Matei Zaharia created SPARK-3091:
------------------------------------
Summary: Add support for caching metadata on Parquet files
Key: SPARK-3091
URL: https://issues.apache.org/jira/browse/SPARK-3091
Project: Spark
Issue Type: New Feature
Components: SQL
Reporter: Matei Zaharia
Assignee: Matei Zaharia
For larger Parquet files, reading the file footers (which is done in parallel
on up to 5 threads) and HDFS block locations (which is serial) can take
multiple seconds. We can add an option to cache this data within
FilteringParquetInputFormat. Unfortunately ParquetInputFormat only caches
footers within each instance of ParquetInputFormat, not across them.
--
This message was sent by Atlassian JIRA
(v6.2#6252)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]