[
https://issues.apache.org/jira/browse/SPARK-15890?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Hyukjin Kwon resolved SPARK-15890.
----------------------------------
Resolution: Incomplete
> Support Stata-like tabulation of values in a single column, optionally with
> weights
> -----------------------------------------------------------------------------------
>
> Key: SPARK-15890
> URL: https://issues.apache.org/jira/browse/SPARK-15890
> Project: Spark
> Issue Type: New Feature
> Components: SQL
> Reporter: Shafique Jamal
> Priority: Minor
> Labels: bulk-closed
>
> In Stata, one can tabulate the values in a single column of a dataset, and
> provide weights. For example if your data looks like this:
> +-----------------+
> | id gender w |
> |-----------------|
> 1. | 1 M 2 |
> 2. | 2 M 4 |
> 3. | 3 M 1 |
> 4. | 4 F 1 |
> 5. | 5 F 3 |
> +-----------------+
> (where w is weight), you can tabulate the values of gender and get this
> result:
> . tab gender
> gender | Freq. Percent Cum.
> ------------+-----------------------------------
> F | 2 40.00 40.00
> M | 3 60.00 100.00
> ------------+-----------------------------------
> Total | 5 100.00
> you can apply weights to this tabulation as follows:
> . tab gender [aw=w]
> gender | Freq. Percent Cum.
> ------------+-----------------------------------
> F | 1.81818182 36.36 36.36
> M | 3.18181818 63.64 100.00
> ------------+-----------------------------------
> Total | 5 100.00
> I would like to have the same capability with Spark dataframes. Here is what
> I have done:
> https://github.com/shafiquejamal/spark/commit/24ed3151db1ed2188ad67b2b5ccbf2883adf7af2
> This allows me to do the following:
> val obs1 = ("1", "M", 10, "P", 2d)
> val obs2 = ("2", "M", 12, "S", 4d)
> val obs3 = ("3", "M", 13, "B", 1d)
> val obs4 = ("4", "F", 11, "P", 1d)
> val obs5 = ("5", "F", 13, "M", 3d)
> val df = Seq(obs1, obs2, obs3, obs4, obs5).toDF("id", "gender", "age",
> "educ", "w")
> val tabWithoutWeights = df.stat.tab("gender")
> val tabWithWeights = df.stat.tab("gender", "w")
> tabWithoutWeights.show()
> tabWithWeights.show()
> This yields the following:
> +------+-------------+---------+----------+
> |gender|count(gender)|Frequency|Proportion|
> +------+-------------+---------+----------+
> | F| 2| 2.0| 0.4|
> | M| 3| 3.0| 0.6|
> +------+-------------+---------+----------+
> +------+-------------+------------------+-------------------+
> |gender|count(gender)| Frequency| Proportion|
> +------+-------------+------------------+-------------------+
> | F| 2|1.8181818181818181|0.36363636363636365|
> | M| 3|3.1818181818181817| 0.6363636363636364|
> +------+-------------+------------------+-------------------+
--
This message was sent by Atlassian JIRA
(v7.6.3#76005)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]