[
https://issues.apache.org/jira/browse/SPARK-16591?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
cen yuhai updated SPARK-16591:
------------------------------
Description:
HadoopFsRelation has a fileStatusCache which list all paths and then cache all
filestatus no matter whether you specify partition columns or not. It may
cause OOM when reading parquet table.
In HiveMetastoreCatalog file, spark will convert MetaStoreRelation to
ParquetRelation by calling convertToParquetRelation method.
It will call metastoreRelation.getHiveQlPartitions() to request hive metastore
service for all partitions without filters. And then pass all partition paths
to ParquetRelation's paths member.
In FileStatusCache's refresh method, it will list all paths : "val files =
listLeafFiles(paths)"
was:
HadoopFsRelation has a fileStatusCache which list all paths and then cache all
filestatus no matter whether you specify partition columns or not. It will
cause OOM when reading parquet table.
In HiveMetastoreCatalog file, spark will convert MetaStoreRelation to
ParquetRelation by calling convertToParquetRelation method.
It will call metastoreRelation.getHiveQlPartitions() to request hive metastore
service for all partitions without filters. And then pass all partition paths
to ParquetRelation's paths member.
In FileStatusCache's refresh method, it will list all paths : "val files =
listLeafFiles(paths)"
> HadoopFsRelation will list , cache all parquet file paths
> ---------------------------------------------------------
>
> Key: SPARK-16591
> URL: https://issues.apache.org/jira/browse/SPARK-16591
> Project: Spark
> Issue Type: Bug
> Components: SQL
> Affects Versions: 1.6.0, 1.6.1, 1.6.2
> Reporter: cen yuhai
>
> HadoopFsRelation has a fileStatusCache which list all paths and then cache
> all filestatus no matter whether you specify partition columns or not. It
> may cause OOM when reading parquet table.
> In HiveMetastoreCatalog file, spark will convert MetaStoreRelation to
> ParquetRelation by calling convertToParquetRelation method.
> It will call metastoreRelation.getHiveQlPartitions() to request hive
> metastore service for all partitions without filters. And then pass all
> partition paths to ParquetRelation's paths member.
> In FileStatusCache's refresh method, it will list all paths : "val files =
> listLeafFiles(paths)"
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]