You can easily accomplish that by simply writing the transpose of stats matrix -- i.e., write(baseStats, $STATS) to write(t(baseStats), $STATS)
On Thu, Feb 4, 2016 at 7:33 PM, Ethan Xu <[email protected]> wrote: > Thanks for the clarification Shirish. Is the current 'ijv' format matrix of > the Univ-Stats.dml output used in any other build-in script? > > If not I'd like to suggest a small change besides (or without) the user > friendly version that makes outcomes easier to read: switch 'i' and 'j' in > the outcome. That is, order rows of the matrix according to variables > (original j column) then the statistics type (original i column). This way > the info of one variable are grouped together. > > There might be situations where grouping by statistics types make more > sense, but I felt the other way is more commonly used. > > Ethan > > On Thu, Feb 4, 2016 at 10:10 PM, Shirish Tatikonda < > [email protected]> wrote: > > > Just to clarify: the current output is actually a matrix, in which rows > > denote stats and columns denote input variables. So, the output you see > is > > simply the univariate stats matrix in IJV format. > > In a general case, the primary data type for input/output and > computations > > in SystemML is a *matrix *(of course, *scalar* as well) -- with one > > exception of a *frame* type (which is used only in the context of > > *transform*). > > > > I agree with you that providing user-friendly output as in R output is > very > > useful for data scientists -- it however requires a lot of effort to > > support such a functionality. > > > > Shirish > > > > On Wed, Feb 3, 2016 at 9:09 PM, Ethan Xu <[email protected]> wrote: > > > > > Thank you Deron. From my personal experience printing a single type of > > > user-friendly result on console is usually enough for a quick > inspection. > > > However that's in an interactive environment (like R interactive > > session), > > > where recreating the printout is simple. > > > > > > Since calling a dml scrip on hadoop might trigger a MapReduce job maybe > > > it's better to save the user-friendly version as a file too? Or perhaps > > > it's helpful to have a script that takes the original summary (plus > some > > > metadata) as input, and produces the user-friendly output? > > > > > > Best, > > > > > > Ethan > > > > > > > > > > > > From: Deron Eriksson <[email protected]> > > > To: [email protected] > > > Date: 02/03/2016 01:13 AM > > > Subject: Re: User friendly output of univariate statistics > > > > > > > > > > > > Hi Ethan, > > > > > > I think you make a great point with regards to the readability of the > > > output from Univar-Stats.dml. > > > > > > Do you think outputting the user-friendly results in the format you > > > describe to the console while still writing the more mathematical > results > > > to a file would be the type of behavior that you would find most > useful? > > > Or > > > would you also like to see the user-friendly results also sent to a > file? > > > > > > Also, I was wondering, do you think a single user-friendly format is > > > sufficient, or do you think that data scientists would like (or expect) > > to > > > be able to have multiple formats such as you described? > > > > > > The table format is very interesting. Currently DML has a basic print > > > statement, but I don't believe it can be used to format data into > > columns, > > > such as in your table format example. It might be very nice to add a > > > c-style "printf" statement, which would allow results to be written to > > the > > > console in a more columnar format. > > > > > > Does anyone else have any thoughts? > > > > > > Deron > > > > > > > > > On Tue, Feb 2, 2016 at 8:32 AM, Ethan Xu <[email protected]> wrote: > > > > > > > dml is quite amazing. I was wondering if there is a user friendly > (more > > > > human readable) version of outputs from Univar-Stats.dml? I ran the > > > > Univar-Stats.dml on my data set that contains 7 variables: two > > > continuous, > > > > one categorical. The output is a csv file on HDFS that looks like > this: > > > > > > > > 1 1 10.0 > > > > 2 1 123.0 > > > > 2 7 469.0 > > > > 3 1 122.0 > > > > 3 7 419.0 > > > > 4 1 34.852512104922082 > > > > 4 7 0.40786451178676335 > > > > 5 1 613.6600902369631 > > > > 5 7 1.5322171660886 > > > > 6 1 25.566777079580508 > > > > 6 7 5.54382044429201915 > > > > 7 1 0.219263232610989764 > > > > 7 7 12.14558700418414E-4 > > > > 8 1 0.5323447433694138 > > > > 8 7 1.23151883029726626 > > > > 9 1 0.28352047550156284 > > > > 9 7 23.25049533659206 > > > > 10 1 -0.5348573740280274 > > > > 10 7 2023.294658877635 > > > > 11 1 2.874872545380876E-4 > > > > 11 7 1.874872545380876E-4 > > > > 12 1 6.0017749742760714085 > > > > 12 7 0.00237749742760714085 > > > > 13 1 12.0 > > > > 14 1 30.56066514110724 > > > > 15 2 4.0 > > > > ---- truncated (numbers randomly modified) > > > > > > > > According to the documentation on > > > > > > > > > > > > > > > > > http://apache.github.io/incubator-systemml/algorithms-descriptive-statistics.html#univariate-statistics > > > > > > > , the first column of the matrix represents statistics type (minimum, > > > > mean, etc.), the second column represents variable ID and the last > > > column > > > > gives the statistics value. > > > > > > > > While the documentation is very clear and the results are consistent > > > with > > > > outputs of other software like R, I found the format a bit > inconvenient > > > > since I have to refer to the reference Table (table 1 in > aforementioned > > > > link) to understand the summary statistics. > > > > > > > > I understand that the pure numeric matrix format is easy to use as > > > machine > > > > input for future steps. An additional table that is more human > readable > > > > would be nice since the main purpose of uni-variate statistics is > often > > > > exploratory data analysis and a clear summary is essential. > > > > > > > > Suggestions to consider in the readable summary if there's not > already > > > > one: > > > > 1. Order the rows according to variables (column 2) instead of > > > statistics > > > > type (column 1), so that summary statistics of the same variable are > > > > grouped together. > > > > 2. Use actual statistics labels ("min", "mean", "skewness" etc) > instead > > > of > > > > IDs (1, 2, etc). > > > > 3. Use actual predictor labels ("age", "gender", etc) instead of IDs > > > (1,2, > > > > etc). > > > > 4. Use level labels for categorical predictors ("male", "female", > etc) > > > > instead of IDs (1,2, etc). > > > > 5. Add counts of cases in each level for categorical variable in > > > addition > > > > to modes. This gives the distribution information of the variable. > > > > 6. If the amount of data in the summary is manageable perhaps > > > > automatically pull the output of Univar-Stats.dml from HDFS to local > > > > machine and display the readable version on terminal? > > > > > > > > So the output could look like: > > > > > > > > age min 10 > > > > age max 123 > > > > age range 113 > > > > age mean 60 > > > > ... > > > > gender female.count 1000 > > > > gender male.count 2000 > > > > gender mode male > > > > ... > > > > > > > > or even a table format like in R: > > > > > > > > age gender > > > > min 10 female 1000 > > > > max 123 male 2000 > > > > range 113 mode male > > > > mean 60 ... > > > > ... > > > > Thanks much, > > > > > > > > Ethan Xu > > > > > > > > > > > > > > > > > > > > > > > > > > > > > -- > Yifan "Ethan" Xu, PhD > > Data Scientist / Statistician > Explorys, IBM Watson Health > > Adjunct Faculty > Department of Epidemiology and Biostatistics > Case Western Reserve University > > -------------- > Email: [email protected] > Phone: (607) 760-6817 > -------------- >
