[jira] [Comment Edited] (SPARK-15689) Data source API v2

Russell Spitzer (JIRA) Wed, 01 Nov 2017 12:59:23 -0700

    [ 
https://issues.apache.org/jira/browse/SPARK-15689?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16234640#comment-16234640
 ]


Russell Spitzer edited comment on SPARK-15689 at 11/1/17 7:58 PM:
------------------------------------------------------------------

Something I just noticed, it may be helpful to also pass in "required columns" 
to the supportPushdownFilter. This could enable systems which can quickly 
estimate counts (but not actually materialize records) to respond to certain 
filters.

In my example system, I can pick out a small number of records past on an index 
very quickly, but as the number of records as proportion of the data increases 
the usefulness of using the pushdown decreases and eventually becomes a 
hinderance on performance. With counts in particular it is almost always 
beneficial to use the index since no rows are returned but I cannot tell if a 
count is being preformed from the base supportsPushdown add in since I cannot 
tell which columns are being requested. 


was (Author: rspitzer):
Something I just noticed, it may be helpful to also pass in "required columns" 
to the supportPushdownFilter. This could enable systems which can quickly 
estimate counts (but not actually materialize records) to respond to certain 
filters.

In my example system, I can pick out a small number of records past on an index 
very quickly, but as the number of records as proportion of the data increases 
the usefulness of using the pushdown decreases and eventually becomes a 
hinderance on performance. With counts in particular it is almost always 
beneficial to use the index since no rows are returned but I cannot tell if a 
count is being preformed from the base supportsPushdown add in. 

> Data source API v2
> ------------------
>
>                 Key: SPARK-15689
>                 URL: https://issues.apache.org/jira/browse/SPARK-15689
>             Project: Spark
>          Issue Type: New Feature
>          Components: SQL
>    Affects Versions: 2.3.0
>            Reporter: Reynold Xin
>            Assignee: Wenchen Fan
>            Priority: Major
>              Labels: SPIP, releasenotes
>         Attachments: SPIP Data Source API V2.pdf
>
>
> This ticket tracks progress in creating the v2 of data source API. This new 
> API should focus on:
> 1. Have a small surface so it is easy to freeze and maintain compatibility 
> for a long time. Ideally, this API should survive architectural rewrites and 
> user-facing API revamps of Spark.
> 2. Have a well-defined column batch interface for high performance. 
> Convenience methods should exist to convert row-oriented formats into column 
> batches for data source developers.
> 3. Still support filter push down, similar to the existing API.
> 4. Nice-to-have: support additional common operators, including limit and 
> sampling.
> Note that both 1 and 2 are problems that the current data source API (v1) 
> suffers. The current data source API has a wide surface with dependency on 
> DataFrame/SQLContext, making the data source API compatibility depending on 
> the upper level API. The current data source API is also only row oriented 
> and has to go through an expensive external data type conversion to internal 
> data type.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Comment Edited] (SPARK-15689) Data source API v2

Reply via email to