cen yuhai created SPARK-16591:
---------------------------------
Summary: HadoopFsRelation will list , cache all parquet file paths
Key: SPARK-16591
URL: https://issues.apache.org/jira/browse/SPARK-16591
Project: Spark
Issue Type: Bug
Components: SQL
Affects Versions: 1.6.2, 1.6.1, 1.6.0
Reporter: cen yuhai
HadoopFsRelation has a fileStatusCache which list all paths and then cache all
filestatus no matter whether you specify partition columns or not. It will
cause OOM when reading parquet table.
In HiveMetastoreCatalog file, spark will convert MetaStoreRelation to
ParquetRelation by calling convertToParquetRelation method.
It will call metastoreRelation.getHiveQlPartitions() to request hive metastore
service for all partitions without filters. And then pass all partition paths
to ParquetRelation's paths member.
In FileStatusCache's refresh method, it will list all paths : "val files =
listLeafFiles(paths)"
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]