[
https://issues.apache.org/jira/browse/SPARK-16591?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15925631#comment-15925631
]
Takeshi Yamamuro commented on SPARK-16591:
------------------------------------------
Since HadoopFsRelation has totally changed, I'll close this. If you have any
problem, you feel to update the description and reopen this. Thanks.
> HadoopFsRelation will list , cache all parquet file paths
> ---------------------------------------------------------
>
> Key: SPARK-16591
> URL: https://issues.apache.org/jira/browse/SPARK-16591
> Project: Spark
> Issue Type: Bug
> Components: SQL
> Affects Versions: 1.6.0, 1.6.1, 1.6.2
> Reporter: cen yuhai
>
> HadoopFsRelation has a fileStatusCache which list all paths and then cache
> all filestatus no matter whether you specify partition columns or not. It
> may cause OOM when reading parquet table.
> In HiveMetastoreCatalog file, spark will convert MetaStoreRelation to
> ParquetRelation by calling convertToParquetRelation method.
> It will call metastoreRelation.getHiveQlPartitions() to request hive
> metastore service for all partitions without filters. And then pass all
> partition paths to ParquetRelation's paths member.
> In FileStatusCache's refresh method, it will list all paths : "val files =
> listLeafFiles(paths)"
--
This message was sent by Atlassian JIRA
(v6.3.15#6346)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]