[ 
https://issues.apache.org/jira/browse/SPARK-25841?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Reynold Xin updated SPARK-25841:
--------------------------------
    Description: 
As I was reviewing the Spark API changes for 2.4, I found that through organic, 
ad-hoc evolution the current API for window functions in Scala is pretty bad.
  
 To illustrate the problem, we have two rangeBetween functions in Window class:
  
class Window {
  def unboundedPreceding: Long
  ...
  def rangeBetween(start: Long, end: Long): WindowSpec
  def rangeBetween(start: Column, end: Column): WindowSpec

}
 
 The Column version of rangeBetween was added in Spark 2.3 because the previous 
version (Long) could only support integral values and not time intervals. Now 
in order to support specifying unboundedPreceding in the rangeBetween(Column, 
Column) API, we added an unboundedPreceding that returns a Column in 
functions.scala.
  
 There are a few issues I have with the API:
  
 1. To the end user, this can be just super confusing. Why are there two 
unboundedPreceding functions, in different classes, that are named the same but 
return different types?
  
 2. Using Column as the parameter signature implies this can be an actual 
Column, but in practice rangeBetween can only accept literal values.
  
 3. We added the new APIs to support intervals, but they don't actually work, 
because in the implementation we try to validate the start is less than the 
end, but calendar interval types are not comparable, and as a result we throw a 
type mismatch exception at runtime: scala.MatchError: CalendarIntervalType (of 
class org.apache.spark.sql.types.CalendarIntervalType$)
  
 4. In order to make interval work, users need to create an interval using 
CalendarInterval, which is an internal class that has no documentation and no 
stable API.
  
  

  was:
As I was reviewing the Spark API changes for 2.4, I found that through organic, 
ad-hoc evolution the current API for window functions in Scala is pretty bad.
 
To illustrate the problem, we have two rangeBetween functions in Window class:
 
class Window {
  def unboundedPreceding: Long
  ...
  def rangeBetween(start: Long, end: Long): WindowSpec
  def rangeBetween(start: Column, end: Column): WindowSpec

}
 
The Column version of rangeBetween was added in Spark 2.3 because the previous 
version (Long) could only support integral values and not time intervals. Now 
in order to support specifying unboundedPreceding in the rangeBetween(Column, 
Column) API, we added an unboundedPreceding that returns a Column in 
functions.scala.
 
There are a few issues I have with the API:
 
1. To the end user, this can be just super confusing. Why are there two 
unboundedPreceding functions, in different classes, that are named the same but 
return different types?
 
2. Using Column as the parameter signature implies this can be an actual 
Column, but in practice rangeBetween can only accept literal values.
 
3. We added the new APIs to support intervals, but they don't actually work, 
because in the implementation we try to validate the start is less than the 
end, but calendar interval types are not comparable, and as a result we throw a 
type mismatch exception at runtime: scala.MatchError: CalendarIntervalType (of 
class org.apache.spark.sql.types.CalendarIntervalType$)
 
4. In order to make interval work, users need to create an interval using 
CalendarInterval, which is an internal class that has no documentation and no 
stable API.
 
 


> Redesign window function rangeBetween API
> -----------------------------------------
>
>                 Key: SPARK-25841
>                 URL: https://issues.apache.org/jira/browse/SPARK-25841
>             Project: Spark
>          Issue Type: Umbrella
>          Components: SQL
>    Affects Versions: 2.3.2, 2.4.0
>            Reporter: Reynold Xin
>            Assignee: Reynold Xin
>            Priority: Major
>
> As I was reviewing the Spark API changes for 2.4, I found that through 
> organic, ad-hoc evolution the current API for window functions in Scala is 
> pretty bad.
>   
>  To illustrate the problem, we have two rangeBetween functions in Window 
> class:
>   
> class Window {
>   def unboundedPreceding: Long
>   ...
>   def rangeBetween(start: Long, end: Long): WindowSpec
>   def rangeBetween(start: Column, end: Column): WindowSpec
> }
>  
>  The Column version of rangeBetween was added in Spark 2.3 because the 
> previous version (Long) could only support integral values and not time 
> intervals. Now in order to support specifying unboundedPreceding in the 
> rangeBetween(Column, Column) API, we added an unboundedPreceding that returns 
> a Column in functions.scala.
>   
>  There are a few issues I have with the API:
>   
>  1. To the end user, this can be just super confusing. Why are there two 
> unboundedPreceding functions, in different classes, that are named the same 
> but return different types?
>   
>  2. Using Column as the parameter signature implies this can be an actual 
> Column, but in practice rangeBetween can only accept literal values.
>   
>  3. We added the new APIs to support intervals, but they don't actually work, 
> because in the implementation we try to validate the start is less than the 
> end, but calendar interval types are not comparable, and as a result we throw 
> a type mismatch exception at runtime: scala.MatchError: CalendarIntervalType 
> (of class org.apache.spark.sql.types.CalendarIntervalType$)
>   
>  4. In order to make interval work, users need to create an interval using 
> CalendarInterval, which is an internal class that has no documentation and no 
> stable API.
>   
>   



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

Reply via email to