[jira] [Created] (SPARK-30024) Support subdirectories when accessing partitioned Parquet Hive table

Michael Lotkowski (Jira) Mon, 25 Nov 2019 03:35:54 -0800

Michael Lotkowski created SPARK-30024:
-----------------------------------------


             Summary: Support subdirectories when accessing partitioned Parquet 
Hive table
                 Key: SPARK-30024
                 URL: https://issues.apache.org/jira/browse/SPARK-30024
             Project: Spark
          Issue Type: Bug
          Components: SQL
    Affects Versions: 2.4.4
            Reporter: Michael Lotkowski


Hi all,

 

We have ran in to issues when trying to read parquet partitioned table created 
by Hive. I think I have narrowed down the cause to how 
[InMemoryFileIndex|[https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/InMemoryFileIndex.scala#L95]]
 created a parent -> file mapping.

 

The folder structure created by Hive is as follows:

s3://bucket/table/date=2019-11-25/subdir1/data1.parquet

s3://bucket/table/date=2019-11-25/subdir2/data2.parquet

 

Looking through the code it seems that InMemoryFileIndex is creating a mapping 
of leaf files to their parents yielding the following mapping:

 val leafDirToChildrenFiles = Map(

    s3://bucket/table/date=2019-11-25/subdir1 -> 
s3://bucket/table/date=2019-11-25/subdir1/data1.parquet,

    s3://bucket/table/date=2019-11-25/subdir2 -> 
s3://bucket/table/date=2019-11-25/subdir2/data2.parquet

)

 

Which then in turn is used in 
[PartitioningAwareFileIndex|[https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/PartitioningAwareFileIndex.scala#L83]]

to prune the partitions. From my understanding pruning works by looking up the 
partition path in leafDirToChildrenFiles which in this case is 
s3://bucket/table/date=2019-11-25/ and therefore it fails to find any files for 
this partition.

 

My suggested fix is to update how the InMemoryFileIndex builds the mapping, 
instead of having a map between parent dir to file, is to have a map of 
rootPath to file. More concretely 
[https://gist.github.com/lotkowskim/76e8ff265493efd0b2b7175446805a82]

 

I have tested this by updating the jar running on EMR and we correctly can now 
read the data from these partitioned tables. It's also worth noting that we can 
read the data, without any modifications to the code, if we use the following 
settings:

 

"spark.sql.hive.convertMetastoreParquet" to "false",
"spark.hive.mapred.supports.subdirectories" to "true",
"spark.hadoop.mapreduce.input.fileinputformat.input.dir.recursive" to "true"

 

However with these settings we lose the ability to prune partitions causing us 
to read the entire table every time as we aren't using a Spark relation.

 

I want to start discussion on whether this is a correct change, or if we are 
missing something more obvious. In either case I would be happy to fully 
implement the change.

 

Thanks,

Michael

 

 

 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[jira] [Created] (SPARK-30024) Support subdirectories when accessing partitioned Parquet Hive table

Reply via email to