[jira] [Commented] (SPARK-19351) Support for obtaining file splits from underlying InputFormat

Vinoth Chandar (JIRA) Sat, 04 Feb 2017 11:38:10 -0800

    [ 
https://issues.apache.org/jira/browse/SPARK-19351?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15852907#comment-15852907
 ]


Vinoth Chandar commented on SPARK-19351:
----------------------------------------

[~rxin] Approach 1, seems to work based on the tests we have run so far (basic 
filter,group-bys, three table joins). Will do more testing next week. Do you 
have ideas into a) how path filters are used today b) if path filters will 
continue to be supported going forward (we might need to add a unit test around 
it)
Approach 2 is more direct and I can work on a patch if that seems the better 
way. 

Without deep understanding into Spark SQL, unable to conclude about soundness 
of approaches. So like some feedback/advice on these

> Support for obtaining file splits from underlying InputFormat
> -------------------------------------------------------------
>
>                 Key: SPARK-19351
>                 URL: https://issues.apache.org/jira/browse/SPARK-19351
>             Project: Spark
>          Issue Type: New Feature
>          Components: SQL
>    Affects Versions: 2.0.0
>            Reporter: Vinoth Chandar
>
> This is a request for a feature, that enables SparkSQL to obtain the files 
> for a Hive partition, by calling inputFormat.getSplits(), as opposed to 
> listing files directly, while still using Spark's optimized Parquet readers 
> for actual IO. (Note that the difference between this and falling back 
> entirely to Hive via spark.sql.hive.convertMetastoreParquet=false is that we 
> get to realize benefits such as new parquet reader, schema merging etc in 
> SparkSQL)
> Some background the context, using our use-case at Uber. We have Hive tables, 
> where each partition contains versioned files (whenever records in a file 
> change, we produce a new version, to speed up database ingestion) and such 
> tables are registerd with a custom InputFormat that just filters out old 
> versions and just returns the latest version of each file to the query. 
> We have this working for 5 months now across Hive/Spark/Presto as follows 
> - Hive : Works out of box, by calling the inputFormat.getSplits, so we are 
> good there
> - Presto: We made the fix in Presto, similar to whats proposed here. 
> - Spark : We set convertMetastoreParquet=false. Perf is actually comparable 
> for our use-cases, but we run into schema merging issues now and then. 
> we have explored a few approaches here  and would like to get more feedback 
> from you all, before we go further.. 



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[jira] [Commented] (SPARK-19351) Support for obtaining file splits from underlying InputFormat

Reply via email to