[ 
https://issues.apache.org/jira/browse/SPARK-18940?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Herman van Hovell updated SPARK-18940:
--------------------------------------
    Description: 
I have a frequency distribution table with following entries 
{noformat}
Age,    No of person 
21, 10
22, 15
23, 18 
..
..
30, 14
{noformat}

Moreover it is common to have data in frequency distribution format to further 
calculate Percentile, Median. With current implementation
It would be very difficult and complex to find the percentile.
Therefore i am proposing enhancement to current Percentile and Approx 
Percentile implementation to take frequency distribution column into 
consideration 
Current Percentile definition 
{noformat}
percentile(col, array(percentage1 [, percentage2]...))
case class Percentile(
  child: Expression,
  percentageExpression: Expression,
  mutableAggBufferOffset: Int = 0,
  inputAggBufferOffset: Int = 0) {
   def this(child: Expression, percentageExpression: Expression) = {
    this(child, percentageExpression, 0, 0)
  }
}
{noformat}
Proposed changes 
{noformat}
percentile(col, [frequency], array(percentage1 [, percentage2]...))
case class Percentile(
  child: Expression,
  frequency : Expression,
  percentageExpression: Expression,
  mutableAggBufferOffset: Int = 0,
  inputAggBufferOffset: Int = 0) {
   def this(child: Expression, percentageExpression: Expression) = {
    this(child, Literal(1L), percentageExpression, 0, 0)
  }
  def this(child: Expression, frequency : Expression, percentageExpression: 
Expression) = {
    this(child, frequency, percentageExpression, 0, 0)
  }
}
{noformat}
Although this definition will differ from hive implementation, it will be 
useful functionality to many spark user.
Moreover the changes are local to only Percentile and ApproxPercentile 
implementation


  was:
I have a frequency distribution table with following entries 
Age,    No of person 
21, 10
22, 15
23, 18 
..
..
30, 14

Moreover it is common to have data in frequency distribution format to further 
calculate Percentile, Median. With current implementation
It would be very difficult and complex to find the percentile.
Therefore i am proposing enhancement to current Percentile and Approx 
Percentile implementation to take frequency distribution column into 
consideration 
Current Percentile definition 

percentile(col, array(percentage1 [, percentage2]...))
case class Percentile(
  child: Expression,
  percentageExpression: Expression,
  mutableAggBufferOffset: Int = 0,
  inputAggBufferOffset: Int = 0) {
   def this(child: Expression, percentageExpression: Expression) = {
    this(child, percentageExpression, 0, 0)
  }
}

Proposed changes 

percentile(col, [frequency], array(percentage1 [, percentage2]...))
case class Percentile(
  child: Expression,
  frequency : Expression,
  percentageExpression: Expression,
  mutableAggBufferOffset: Int = 0,
  inputAggBufferOffset: Int = 0) {
   def this(child: Expression, percentageExpression: Expression) = {
    this(child, Literal(1L), percentageExpression, 0, 0)
  }
  def this(child: Expression, frequency : Expression, percentageExpression: 
Expression) = {
    this(child, frequency, percentageExpression, 0, 0)
  }
}

Although this definition will differ from hive implementation, it will be 
useful functionality to many spark user.
Moreover the changes are local to only Percentile and ApproxPercentile 
implementation



> Percentile and approximate percentile support for frequency distribution table
> ------------------------------------------------------------------------------
>
>                 Key: SPARK-18940
>                 URL: https://issues.apache.org/jira/browse/SPARK-18940
>             Project: Spark
>          Issue Type: Improvement
>          Components: SQL
>    Affects Versions: 2.0.2
>            Reporter: gagan taneja
>
> I have a frequency distribution table with following entries 
> {noformat}
> Age,    No of person 
> 21, 10
> 22, 15
> 23, 18 
> ..
> ..
> 30, 14
> {noformat}
> Moreover it is common to have data in frequency distribution format to 
> further calculate Percentile, Median. With current implementation
> It would be very difficult and complex to find the percentile.
> Therefore i am proposing enhancement to current Percentile and Approx 
> Percentile implementation to take frequency distribution column into 
> consideration 
> Current Percentile definition 
> {noformat}
> percentile(col, array(percentage1 [, percentage2]...))
> case class Percentile(
>   child: Expression,
>   percentageExpression: Expression,
>   mutableAggBufferOffset: Int = 0,
>   inputAggBufferOffset: Int = 0) {
>    def this(child: Expression, percentageExpression: Expression) = {
>     this(child, percentageExpression, 0, 0)
>   }
> }
> {noformat}
> Proposed changes 
> {noformat}
> percentile(col, [frequency], array(percentage1 [, percentage2]...))
> case class Percentile(
>   child: Expression,
>   frequency : Expression,
>   percentageExpression: Expression,
>   mutableAggBufferOffset: Int = 0,
>   inputAggBufferOffset: Int = 0) {
>    def this(child: Expression, percentageExpression: Expression) = {
>     this(child, Literal(1L), percentageExpression, 0, 0)
>   }
>   def this(child: Expression, frequency : Expression, percentageExpression: 
> Expression) = {
>     this(child, frequency, percentageExpression, 0, 0)
>   }
> }
> {noformat}
> Although this definition will differ from hive implementation, it will be 
> useful functionality to many spark user.
> Moreover the changes are local to only Percentile and ApproxPercentile 
> implementation



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to