Re: User friendly output of univariate statistics

Shirish Tatikonda Thu, 04 Feb 2016 20:18:42 -0800

You can easily accomplish that by simply writing the transpose of stats
matrix -- i.e., write(baseStats, $STATS) to write(t(baseStats), $STATS)




On Thu, Feb 4, 2016 at 7:33 PM, Ethan Xu <[email protected]> wrote:

> Thanks for the clarification Shirish. Is the current 'ijv' format matrix of
> the Univ-Stats.dml output used in any other build-in script?
>
> If not I'd like to suggest a small change besides (or without) the user
> friendly version that makes outcomes easier to read: switch 'i' and 'j' in
> the outcome. That is, order rows of the matrix according to variables
> (original j column) then the statistics type (original i column). This way
> the info of one variable are grouped together.
>
> There might be situations where grouping by statistics types make more
> sense, but I felt the other way is more commonly used.
>
> Ethan
>
> On Thu, Feb 4, 2016 at 10:10 PM, Shirish Tatikonda <
> [email protected]> wrote:
>
> > Just to clarify: the current output is actually a matrix, in which rows
> > denote stats and columns denote input variables. So, the output you see
> is
> > simply the univariate stats matrix in IJV format.
> > In a general case, the primary data type for input/output and
> computations
> > in SystemML is a *matrix *(of course, *scalar* as well) -- with one
> > exception of a *frame* type (which is used only in the context of
> > *transform*).
> >
> > I agree with you that providing user-friendly output as in R output is
> very
> > useful for data scientists -- it however requires a lot of effort to
> > support such a functionality.
> >
> > Shirish
> >
> > On Wed, Feb 3, 2016 at 9:09 PM, Ethan Xu <[email protected]> wrote:
> >
> > > Thank you Deron. From my personal experience printing a single type of
> > > user-friendly result on console is usually enough for a quick
> inspection.
> > > However that's in an interactive environment (like R interactive
> > session),
> > > where recreating the printout is simple.
> > >
> > > Since calling a dml scrip on hadoop might trigger a MapReduce job maybe
> > > it's better to save the user-friendly version as a file too? Or perhaps
> > > it's helpful to have a script that takes the original summary (plus
> some
> > > metadata) as input, and produces the user-friendly output?
> > >
> > > Best,
> > >
> > > Ethan
> > >
> > >
> > >
> > > From:   Deron Eriksson <[email protected]>
> > > To:     [email protected]
> > > Date:   02/03/2016 01:13 AM
> > > Subject:        Re: User friendly output of univariate statistics
> > >
> > >
> > >
> > > Hi Ethan,
> > >
> > > I think you make a great point with regards to the readability of the
> > > output from Univar-Stats.dml.
> > >
> > > Do you think outputting the user-friendly results in the format you
> > > describe to the console while still writing the more mathematical
> results
> > > to a file would be the type of behavior that you would find most
> useful?
> > > Or
> > > would you also like to see the user-friendly results also sent to a
> file?
> > >
> > > Also, I was wondering, do you think a single user-friendly format is
> > > sufficient, or do you think that data scientists would like (or expect)
> > to
> > > be able to have multiple formats such as you described?
> > >
> > > The table format is very interesting. Currently DML has a basic print
> > > statement, but I don't believe it can be used to format data into
> > columns,
> > > such as in your table format example. It might be very nice to add a
> > > c-style "printf" statement, which would allow results to be written to
> > the
> > > console in a more columnar format.
> > >
> > > Does anyone else have any thoughts?
> > >
> > > Deron
> > >
> > >
> > > On Tue, Feb 2, 2016 at 8:32 AM, Ethan Xu <[email protected]> wrote:
> > >
> > > > dml is quite amazing. I was wondering if there is a user friendly
> (more
> > > > human readable) version of outputs from Univar-Stats.dml? I ran the
> > > > Univar-Stats.dml on my data set that contains 7 variables: two
> > > continuous,
> > > > one categorical. The output is a csv file on HDFS that looks like
> this:
> > > >
> > > > 1 1 10.0
> > > > 2 1 123.0
> > > > 2 7 469.0
> > > > 3 1 122.0
> > > > 3 7 419.0
> > > > 4 1 34.852512104922082
> > > > 4 7 0.40786451178676335
> > > > 5 1 613.6600902369631
> > > > 5 7 1.5322171660886
> > > > 6 1 25.566777079580508
> > > > 6 7 5.54382044429201915
> > > > 7 1 0.219263232610989764
> > > > 7 7 12.14558700418414E-4
> > > > 8 1 0.5323447433694138
> > > > 8 7 1.23151883029726626
> > > > 9 1 0.28352047550156284
> > > > 9 7 23.25049533659206
> > > > 10 1 -0.5348573740280274
> > > > 10 7 2023.294658877635
> > > > 11 1 2.874872545380876E-4
> > > > 11 7 1.874872545380876E-4
> > > > 12 1 6.0017749742760714085
> > > > 12 7 0.00237749742760714085
> > > > 13 1 12.0
> > > > 14 1 30.56066514110724
> > > > 15 2 4.0
> > > > ---- truncated (numbers randomly modified)
> > > >
> > > > According to the documentation on
> > > >
> > > >
> > >
> > >
> >
> http://apache.github.io/incubator-systemml/algorithms-descriptive-statistics.html#univariate-statistics
> > >
> > > > , the first column of the matrix represents statistics type (minimum,
> > > > mean, etc.), the second column represents variable ID and the last
> > > column
> > > > gives the statistics value.
> > > >
> > > > While the documentation is very clear and the results are consistent
> > > with
> > > > outputs of other software like R, I found the format a bit
> inconvenient
> > > > since I have to refer to the reference Table (table 1 in
> aforementioned
> > > > link) to understand the summary statistics.
> > > >
> > > > I understand that the pure numeric matrix format is easy to use as
> > > machine
> > > > input for future steps. An additional table that is more human
> readable
> > > > would be nice since the main purpose of uni-variate statistics is
> often
> > > > exploratory data analysis and a clear summary is essential.
> > > >
> > > > Suggestions to consider in the readable summary if there's not
> already
> > > > one:
> > > > 1. Order the rows according to variables (column 2) instead of
> > > statistics
> > > > type (column 1), so that summary statistics of the same variable are
> > > > grouped together.
> > > > 2. Use actual statistics labels ("min", "mean", "skewness" etc)
> instead
> > > of
> > > > IDs (1, 2, etc).
> > > > 3. Use actual predictor labels ("age", "gender", etc) instead of IDs
> > > (1,2,
> > > > etc).
> > > > 4. Use level labels for categorical predictors ("male", "female",
> etc)
> > > > instead of IDs (1,2, etc).
> > > > 5. Add counts of cases in each level for categorical variable in
> > > addition
> > > > to modes. This gives the distribution information of the variable.
> > > > 6. If the amount of data in the summary is manageable perhaps
> > > > automatically pull the output of Univar-Stats.dml from HDFS to local
> > > > machine and display the readable version on terminal?
> > > >
> > > > So the output could look like:
> > > >
> > > > age min 10
> > > > age max 123
> > > > age range 113
> > > > age mean 60
> > > > ...
> > > > gender female.count 1000
> > > > gender male.count 2000
> > > > gender mode male
> > > > ...
> > > >
> > > > or even a table format like in R:
> > > >
> > > > age                  gender
> > > > min    10          female 1000
> > > > max   123        male 2000
> > > > range 113        mode male
> > > > mean  60         ...
> > > > ...
> > > > Thanks much,
> > > >
> > > > Ethan Xu
> > > >
> > > >
> > >
> > >
> > >
> > >
> > >
> >
>
>
>
> --
> Yifan "Ethan" Xu, PhD
>
> Data Scientist / Statistician
> Explorys, IBM Watson Health
>
> Adjunct Faculty
> Department of Epidemiology and Biostatistics
> Case Western Reserve University
>
> --------------
> Email: [email protected]
> Phone: (607) 760-6817
> --------------
>

Re: User friendly output of univariate statistics

Reply via email to