Li Jin created SPARK-22947:
------------------------------

             Summary: SPIP: as-of join in Spark SQL
                 Key: SPARK-22947
                 URL: https://issues.apache.org/jira/browse/SPARK-22947
             Project: Spark
          Issue Type: Improvement
          Components: SQL
    Affects Versions: 2.2.1
            Reporter: Li Jin


h2. Background and Motivation
Time series analysis is one of the most common analysis on financial data. In 
time series analysis, as-of join is a very common operation. Supporting as-of 
join in Spark SQL will allow many use cases of using Spark SQL for time series 
analysis.

As-of join is “join on time” with inexact time matching criteria. Various 
library has implemented asof join or similar functionality:
Kdb: https://code.kx.com/wiki/Reference/aj
Pandas: 
http://pandas.pydata.org/pandas-docs/version/0.19.0/merging.html#merging-merge-asof
R: This functionality is called “Last Observation Carried Forward”
https://www.rdocumentation.org/packages/zoo/versions/1.8-0/topics/na.locf
JuliaDB: http://juliadb.org/latest/api/joins.html#IndexedTables.asofjoin
Flint: https://github.com/twosigma/flint#temporal-join-functions

This proposal advocates introducing new API in Spark SQL to support as-of join.

h2. Target Personas
Data scientists, data engineers

h2. Goals
* New API in Spark SQL that allows as-of join
* As-of join of multiple table (>2) should be performant, because it’s very 
common that users need to join multiple data sources together for further 
analysis.
* Define Distribution, Partitioning and shuffle strategy for ordered time 
series data

h2. Non-Goals
These are out of scope for the existing SPIP, should be considered in future 
SPIP as improvement to Spark’s time series analysis ability:
* Utilize partition information from data source, i.e, begin/end of each 
partition to reduce sorting/shuffling
* Define API for user to implement asof join time spec in business calendar 
(i.e. lookback one business day, this is very common in financial data analysis 
because of market calendars)
* Support broadcast join

h2. Proposed API Changes
See attachment




--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

Reply via email to