Thanks for the clarification Shirish. Is the current 'ijv' format matrix of the Univ-Stats.dml output used in any other build-in script?
If not I'd like to suggest a small change besides (or without) the user friendly version that makes outcomes easier to read: switch 'i' and 'j' in the outcome. That is, order rows of the matrix according to variables (original j column) then the statistics type (original i column). This way the info of one variable are grouped together. There might be situations where grouping by statistics types make more sense, but I felt the other way is more commonly used. Ethan On Thu, Feb 4, 2016 at 10:10 PM, Shirish Tatikonda < [email protected]> wrote: > Just to clarify: the current output is actually a matrix, in which rows > denote stats and columns denote input variables. So, the output you see is > simply the univariate stats matrix in IJV format. > In a general case, the primary data type for input/output and computations > in SystemML is a *matrix *(of course, *scalar* as well) -- with one > exception of a *frame* type (which is used only in the context of > *transform*). > > I agree with you that providing user-friendly output as in R output is very > useful for data scientists -- it however requires a lot of effort to > support such a functionality. > > Shirish > > On Wed, Feb 3, 2016 at 9:09 PM, Ethan Xu <[email protected]> wrote: > > > Thank you Deron. From my personal experience printing a single type of > > user-friendly result on console is usually enough for a quick inspection. > > However that's in an interactive environment (like R interactive > session), > > where recreating the printout is simple. > > > > Since calling a dml scrip on hadoop might trigger a MapReduce job maybe > > it's better to save the user-friendly version as a file too? Or perhaps > > it's helpful to have a script that takes the original summary (plus some > > metadata) as input, and produces the user-friendly output? > > > > Best, > > > > Ethan > > > > > > > > From: Deron Eriksson <[email protected]> > > To: [email protected] > > Date: 02/03/2016 01:13 AM > > Subject: Re: User friendly output of univariate statistics > > > > > > > > Hi Ethan, > > > > I think you make a great point with regards to the readability of the > > output from Univar-Stats.dml. > > > > Do you think outputting the user-friendly results in the format you > > describe to the console while still writing the more mathematical results > > to a file would be the type of behavior that you would find most useful? > > Or > > would you also like to see the user-friendly results also sent to a file? > > > > Also, I was wondering, do you think a single user-friendly format is > > sufficient, or do you think that data scientists would like (or expect) > to > > be able to have multiple formats such as you described? > > > > The table format is very interesting. Currently DML has a basic print > > statement, but I don't believe it can be used to format data into > columns, > > such as in your table format example. It might be very nice to add a > > c-style "printf" statement, which would allow results to be written to > the > > console in a more columnar format. > > > > Does anyone else have any thoughts? > > > > Deron > > > > > > On Tue, Feb 2, 2016 at 8:32 AM, Ethan Xu <[email protected]> wrote: > > > > > dml is quite amazing. I was wondering if there is a user friendly (more > > > human readable) version of outputs from Univar-Stats.dml? I ran the > > > Univar-Stats.dml on my data set that contains 7 variables: two > > continuous, > > > one categorical. The output is a csv file on HDFS that looks like this: > > > > > > 1 1 10.0 > > > 2 1 123.0 > > > 2 7 469.0 > > > 3 1 122.0 > > > 3 7 419.0 > > > 4 1 34.852512104922082 > > > 4 7 0.40786451178676335 > > > 5 1 613.6600902369631 > > > 5 7 1.5322171660886 > > > 6 1 25.566777079580508 > > > 6 7 5.54382044429201915 > > > 7 1 0.219263232610989764 > > > 7 7 12.14558700418414E-4 > > > 8 1 0.5323447433694138 > > > 8 7 1.23151883029726626 > > > 9 1 0.28352047550156284 > > > 9 7 23.25049533659206 > > > 10 1 -0.5348573740280274 > > > 10 7 2023.294658877635 > > > 11 1 2.874872545380876E-4 > > > 11 7 1.874872545380876E-4 > > > 12 1 6.0017749742760714085 > > > 12 7 0.00237749742760714085 > > > 13 1 12.0 > > > 14 1 30.56066514110724 > > > 15 2 4.0 > > > ---- truncated (numbers randomly modified) > > > > > > According to the documentation on > > > > > > > > > > > http://apache.github.io/incubator-systemml/algorithms-descriptive-statistics.html#univariate-statistics > > > > > , the first column of the matrix represents statistics type (minimum, > > > mean, etc.), the second column represents variable ID and the last > > column > > > gives the statistics value. > > > > > > While the documentation is very clear and the results are consistent > > with > > > outputs of other software like R, I found the format a bit inconvenient > > > since I have to refer to the reference Table (table 1 in aforementioned > > > link) to understand the summary statistics. > > > > > > I understand that the pure numeric matrix format is easy to use as > > machine > > > input for future steps. An additional table that is more human readable > > > would be nice since the main purpose of uni-variate statistics is often > > > exploratory data analysis and a clear summary is essential. > > > > > > Suggestions to consider in the readable summary if there's not already > > > one: > > > 1. Order the rows according to variables (column 2) instead of > > statistics > > > type (column 1), so that summary statistics of the same variable are > > > grouped together. > > > 2. Use actual statistics labels ("min", "mean", "skewness" etc) instead > > of > > > IDs (1, 2, etc). > > > 3. Use actual predictor labels ("age", "gender", etc) instead of IDs > > (1,2, > > > etc). > > > 4. Use level labels for categorical predictors ("male", "female", etc) > > > instead of IDs (1,2, etc). > > > 5. Add counts of cases in each level for categorical variable in > > addition > > > to modes. This gives the distribution information of the variable. > > > 6. If the amount of data in the summary is manageable perhaps > > > automatically pull the output of Univar-Stats.dml from HDFS to local > > > machine and display the readable version on terminal? > > > > > > So the output could look like: > > > > > > age min 10 > > > age max 123 > > > age range 113 > > > age mean 60 > > > ... > > > gender female.count 1000 > > > gender male.count 2000 > > > gender mode male > > > ... > > > > > > or even a table format like in R: > > > > > > age gender > > > min 10 female 1000 > > > max 123 male 2000 > > > range 113 mode male > > > mean 60 ... > > > ... > > > Thanks much, > > > > > > Ethan Xu > > > > > > > > > > > > > > > > > -- Yifan "Ethan" Xu, PhD Data Scientist / Statistician Explorys, IBM Watson Health Adjunct Faculty Department of Epidemiology and Biostatistics Case Western Reserve University -------------- Email: [email protected] Phone: (607) 760-6817 --------------
