[ 
https://issues.apache.org/jira/browse/SPARK-15890?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon resolved SPARK-15890.
----------------------------------
    Resolution: Incomplete

> Support Stata-like tabulation of values in a single column, optionally with 
> weights
> -----------------------------------------------------------------------------------
>
>                 Key: SPARK-15890
>                 URL: https://issues.apache.org/jira/browse/SPARK-15890
>             Project: Spark
>          Issue Type: New Feature
>          Components: SQL
>            Reporter: Shafique Jamal
>            Priority: Minor
>              Labels: bulk-closed
>
> In Stata, one can tabulate the values in a single column of a dataset, and 
> provide weights. For example if your data looks like this:
>      +-----------------+
>      | id   gender   w |
>      |-----------------|
>   1. |  1        M   2 |
>   2. |  2        M   4 |
>   3. |  3        M   1 |
>   4. |  4        F   1 |
>   5. |  5        F   3 |
>      +-----------------+
> (where w is weight), you can tabulate the values of gender and get this 
> result:
> . tab gender
>      gender |      Freq.     Percent        Cum.
> ------------+-----------------------------------
>           F |          2       40.00       40.00
>           M |          3       60.00      100.00
> ------------+-----------------------------------
>       Total |          5      100.00
> you can apply weights to this tabulation as follows:
> . tab gender [aw=w]
>      gender |      Freq.     Percent        Cum.
> ------------+-----------------------------------
>           F | 1.81818182       36.36       36.36
>           M | 3.18181818       63.64      100.00
> ------------+-----------------------------------
>       Total |          5      100.00
> I would like to have the same capability with Spark dataframes. Here is what 
> I have done:
> https://github.com/shafiquejamal/spark/commit/24ed3151db1ed2188ad67b2b5ccbf2883adf7af2
> This allows me to do the following:
>     val obs1 = ("1", "M", 10, "P", 2d)
>     val obs2 = ("2", "M", 12, "S", 4d)
>     val obs3 = ("3", "M", 13, "B", 1d)
>     val obs4 = ("4", "F", 11, "P", 1d)
>     val obs5 = ("5", "F", 13, "M", 3d)
>     val df = Seq(obs1, obs2, obs3, obs4, obs5).toDF("id", "gender", "age", 
> "educ", "w")
>     val tabWithoutWeights = df.stat.tab("gender")
>     val tabWithWeights = df.stat.tab("gender", "w")
>     tabWithoutWeights.show()
>     tabWithWeights.show()
> This yields the following:
> +------+-------------+---------+----------+
> |gender|count(gender)|Frequency|Proportion|
> +------+-------------+---------+----------+
> |     F|            2|      2.0|       0.4|
> |     M|            3|      3.0|       0.6|
> +------+-------------+---------+----------+
> +------+-------------+------------------+-------------------+
> |gender|count(gender)|         Frequency|         Proportion|
> +------+-------------+------------------+-------------------+
> |     F|            2|1.8181818181818181|0.36363636363636365|
> |     M|            3|3.1818181818181817| 0.6363636363636364|
> +------+-------------+------------------+-------------------+



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to