cen yuhai created SPARK-16591:
---------------------------------

             Summary: HadoopFsRelation will list , cache all parquet file paths
                 Key: SPARK-16591
                 URL: https://issues.apache.org/jira/browse/SPARK-16591
             Project: Spark
          Issue Type: Bug
          Components: SQL
    Affects Versions: 1.6.2, 1.6.1, 1.6.0
            Reporter: cen yuhai


HadoopFsRelation has a fileStatusCache which list all paths and then cache all 
filestatus no matter whether you specify partition columns  or not.  It will 
cause OOM when reading parquet table.

In HiveMetastoreCatalog file, spark will convert MetaStoreRelation to 
ParquetRelation by calling convertToParquetRelation method.
It will call metastoreRelation.getHiveQlPartitions() to request hive metastore 
service for all partitions without filters. And then pass all partition paths 
to ParquetRelation's paths member.

In FileStatusCache's refresh method, it will list all paths : "val files = 
listLeafFiles(paths)"



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to