[jira] [Commented] (SPARK-19351) Support for obtaining file splits from underlying InputFormat

Reynold Xin (JIRA) Fri, 24 Feb 2017 09:46:18 -0800

    [ 
https://issues.apache.org/jira/browse/SPARK-19351?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15883177#comment-15883177
 ]


Reynold Xin commented on SPARK-19351:
-------------------------------------

Approach 1 should be supported today. I actually think our data source API 
should support approach 2 as well in the future, so we can leave the ticket 
open for that.


> Support for obtaining file splits from underlying InputFormat
> -------------------------------------------------------------
>
>                 Key: SPARK-19351
>                 URL: https://issues.apache.org/jira/browse/SPARK-19351
>             Project: Spark
>          Issue Type: New Feature
>          Components: SQL
>    Affects Versions: 2.0.0
>            Reporter: Vinoth Chandar
>
> This is a request for a feature, that enables SparkSQL to obtain the files 
> for a Hive partition, by calling inputFormat.getSplits(), as opposed to 
> listing files directly, while still using Spark's optimized Parquet readers 
> for actual IO. (Note that the difference between this and falling back 
> entirely to Hive via spark.sql.hive.convertMetastoreParquet=false is that we 
> get to realize benefits such as new parquet reader, schema merging etc in 
> SparkSQL)
> Some background the context, using our use-case at Uber. We have Hive tables, 
> where each partition contains versioned files (whenever records in a file 
> change, we produce a new version, to speed up database ingestion) and such 
> tables are registerd with a custom InputFormat that just filters out old 
> versions and just returns the latest version of each file to the query. 
> We have this working for 5 months now across Hive/Spark/Presto as follows 
> - Hive : Works out of box, by calling the inputFormat.getSplits, so we are 
> good there
> - Presto: We made the fix in Presto, similar to whats proposed here. 
> - Spark : We set convertMetastoreParquet=false. Perf is actually comparable 
> for our use-cases, but we run into schema merging issues now and then. 
> we have explored a few approaches here  and would like to get more feedback 
> from you all, before we go further.. 



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[jira] [Commented] (SPARK-19351) Support for obtaining file splits from underlying InputFormat

Reply via email to