[jira] [Commented] (SPARK-15689) Data source API v2

Russell Spitzer (JIRA) Fri, 16 Jun 2017 12:41:20 -0700

    [ 
https://issues.apache.org/jira/browse/SPARK-15689?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16052321#comment-16052321
 ]


Russell Spitzer commented on SPARK-15689:
-----------------------------------------

I've been trying to work with making Catalyst Cassandra partitioning aware. 
There seem to be two major blocks on this.

The first is that DataSourceScanExec is unable to learn what the underlying 
partitioning should be from the BaseRelation it comes from. I'm currently able 
to get around this by using the DataSourceStrategy plan and then transforming 
the resultant DataSourceScanExec.

The second is that the Partitioning trait is sealed. I want to define a new 
partitioning which is Clustered but is not hashed based on certain columns. It 
would look almost identical to the HashPartitioning class except the
expression which returns a valid PartitionID given expressions would be 
different. 

So for V2 I would really like the ability to specify the physical partitioning 
and as well be able to define new custom partitioning. 

> Data source API v2
> ------------------
>
>                 Key: SPARK-15689
>                 URL: https://issues.apache.org/jira/browse/SPARK-15689
>             Project: Spark
>          Issue Type: New Feature
>          Components: SQL
>            Reporter: Reynold Xin
>              Labels: releasenotes
>
> This ticket tracks progress in creating the v2 of data source API. This new 
> API should focus on:
> 1. Have a small surface so it is easy to freeze and maintain compatibility 
> for a long time. Ideally, this API should survive architectural rewrites and 
> user-facing API revamps of Spark.
> 2. Have a well-defined column batch interface for high performance. 
> Convenience methods should exist to convert row-oriented formats into column 
> batches for data source developers.
> 3. Still support filter push down, similar to the existing API.
> 4. Nice-to-have: support additional common operators, including limit and 
> sampling.
> Note that both 1 and 2 are problems that the current data source API (v1) 
> suffers. The current data source API has a wide surface with dependency on 
> DataFrame/SQLContext, making the data source API compatibility depending on 
> the upper level API. The current data source API is also only row oriented 
> and has to go through an expensive external data type conversion to internal 
> data type.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-15689) Data source API v2

Reply via email to