We have ran in to issues when trying to read parquet partitioned table created
by Hive. I think I have narrowed down the cause to how
created a parent -> file mapping.
The folder structure created by Hive is as follows:
Looking through the code it seems that InMemoryFileIndex is creating a mapping
of leaf files to their parents yielding the following mapping:
val leafDirToChildrenFiles = Map(
Which then in turn is used in
to prune the partitions. From my understanding pruning works by looking up the
partition path in leafDirToChildrenFiles which in this case is
s3://bucket/table/date=2019-11-25/ and therefore it fails to find any files for
My suggested fix is to update how the InMemoryFileIndex builds the mapping,
instead of having a map between parent dir to file, is to have a map of
rootPath to file. More concretely
I have tested this by updating the jar running on EMR and we correctly can now
read the data from these partitioned tables. It's also worth noting that we can
read the data, without any modifications to the code, if we use the following
"spark.sql.hive.convertMetastoreParquet" to "false",
"spark.hive.mapred.supports.subdirectories" to "true",
"spark.hadoop.mapreduce.input.fileinputformat.input.dir.recursive" to "true"
However with these settings we lose the ability to prune partitions causing us
to read the entire table every time as we aren't using a Spark relation.
I want to start discussion on whether this is a correct change, or if we are
missing something more obvious. In either case I would be happy to fully
implement the change.
Amazon Development Centre (Scotland) Limited registered office: Waverley Gate,
2-4 Waterloo Place, Edinburgh EH1 3EG, Scotland. Registered in Scotland
Registration number SC26867