[
https://issues.apache.org/jira/browse/SPARK-19351?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15883177#comment-15883177
]
Reynold Xin commented on SPARK-19351:
-------------------------------------
Approach 1 should be supported today. I actually think our data source API
should support approach 2 as well in the future, so we can leave the ticket
open for that.
> Support for obtaining file splits from underlying InputFormat
> -------------------------------------------------------------
>
> Key: SPARK-19351
> URL: https://issues.apache.org/jira/browse/SPARK-19351
> Project: Spark
> Issue Type: New Feature
> Components: SQL
> Affects Versions: 2.0.0
> Reporter: Vinoth Chandar
>
> This is a request for a feature, that enables SparkSQL to obtain the files
> for a Hive partition, by calling inputFormat.getSplits(), as opposed to
> listing files directly, while still using Spark's optimized Parquet readers
> for actual IO. (Note that the difference between this and falling back
> entirely to Hive via spark.sql.hive.convertMetastoreParquet=false is that we
> get to realize benefits such as new parquet reader, schema merging etc in
> SparkSQL)
> Some background the context, using our use-case at Uber. We have Hive tables,
> where each partition contains versioned files (whenever records in a file
> change, we produce a new version, to speed up database ingestion) and such
> tables are registerd with a custom InputFormat that just filters out old
> versions and just returns the latest version of each file to the query.
> We have this working for 5 months now across Hive/Spark/Presto as follows
> - Hive : Works out of box, by calling the inputFormat.getSplits, so we are
> good there
> - Presto: We made the fix in Presto, similar to whats proposed here.
> - Spark : We set convertMetastoreParquet=false. Perf is actually comparable
> for our use-cases, but we run into schema merging issues now and then.
> we have explored a few approaches here and would like to get more feedback
> from you all, before we go further..
--
This message was sent by Atlassian JIRA
(v6.3.15#6346)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]