[jira] [Updated] (SPARK-16591) HadoopFsRelation will list , cache all parquet file paths

cen yuhai (JIRA) Sun, 17 Jul 2016 00:52:40 -0700

     [ 
https://issues.apache.org/jira/browse/SPARK-16591?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]


cen yuhai updated SPARK-16591:
------------------------------
    Description: 
HadoopFsRelation has a fileStatusCache which list all paths and then cache all 
filestatus no matter whether you specify partition columns  or not.  It may 
cause OOM when reading parquet table.

In HiveMetastoreCatalog file, spark will convert MetaStoreRelation to 
ParquetRelation by calling convertToParquetRelation method.
It will call metastoreRelation.getHiveQlPartitions() to request hive metastore 
service for all partitions without filters. And then pass all partition paths 
to ParquetRelation's paths member.

In FileStatusCache's refresh method, it will list all paths : "val files = 
listLeafFiles(paths)"

  was:
HadoopFsRelation has a fileStatusCache which list all paths and then cache all 
filestatus no matter whether you specify partition columns  or not.  It will 
cause OOM when reading parquet table.

In HiveMetastoreCatalog file, spark will convert MetaStoreRelation to 
ParquetRelation by calling convertToParquetRelation method.
It will call metastoreRelation.getHiveQlPartitions() to request hive metastore 
service for all partitions without filters. And then pass all partition paths 
to ParquetRelation's paths member.

In FileStatusCache's refresh method, it will list all paths : "val files = 
listLeafFiles(paths)"


> HadoopFsRelation will list , cache all parquet file paths
> ---------------------------------------------------------
>
>                 Key: SPARK-16591
>                 URL: https://issues.apache.org/jira/browse/SPARK-16591
>             Project: Spark
>          Issue Type: Bug
>          Components: SQL
>    Affects Versions: 1.6.0, 1.6.1, 1.6.2
>            Reporter: cen yuhai
>
> HadoopFsRelation has a fileStatusCache which list all paths and then cache 
> all filestatus no matter whether you specify partition columns  or not.  It 
> may cause OOM when reading parquet table.
> In HiveMetastoreCatalog file, spark will convert MetaStoreRelation to 
> ParquetRelation by calling convertToParquetRelation method.
> It will call metastoreRelation.getHiveQlPartitions() to request hive 
> metastore service for all partitions without filters. And then pass all 
> partition paths to ParquetRelation's paths member.
> In FileStatusCache's refresh method, it will list all paths : "val files = 
> listLeafFiles(paths)"



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[jira] [Updated] (SPARK-16591) HadoopFsRelation will list , cache all parquet file paths

Reply via email to