[jira] [Created] (SPARK-24974) Spark put all file's paths into SharedInMemoryCache even for unused partitions.

[email protected] (JIRA) Mon, 30 Jul 2018 15:36:05 -0700

[email protected] created SPARK-24974:
----------------------------------------------------


             Summary: Spark put all file's paths into SharedInMemoryCache even 
for unused partitions.
                 Key: SPARK-24974
                 URL: https://issues.apache.org/jira/browse/SPARK-24974
             Project: Spark
          Issue Type: Bug
          Components: SQL
    Affects Versions: 2.2.1
            Reporter: [email protected]


SharedInMemoryCache has all  filestatus no matter whether you specify partition 
columns or not. It causes long load time for queries that use only couple 
partitions because Spark loads file's paths for files from all partitions.

I partitioned files by type and i has directory structure like 

{code}

{{report_date=2018-07-24/type=A/file_1}}

{code}

 

I am trying to execute 

{code}

{{val count = spark.read.parquet("/custom_path/report_date=2018-07-24").filter( 
"type == 'A'").count}}

{code}

 

In my query i need to load only files of type A and it is just couple of files. 
But spark load all 19K of files into SharedInMemoryCache which takes about 60 
secs and only after that throws unused partitions.

 

 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[jira] [Created] (SPARK-24974) Spark put all file's paths into SharedInMemoryCache even for unused partitions.

Reply via email to