Pankaj Bhootra created SPARK-34648:
--------------------------------------

             Summary: Reading Parquet Files in Spark Extremely Slow for Large 
Number of Files?
                 Key: SPARK-34648
                 URL: https://issues.apache.org/jira/browse/SPARK-34648
             Project: Spark
          Issue Type: Question
          Components: SQL
    Affects Versions: 2.3.0
            Reporter: Pankaj Bhootra


Hello Team

I am new to Spark and this question may be a possible duplicate of the issue 
highlighted here: https://issues.apache.org/jira/browse/SPARK-9347 

We have a large dataset partitioned by calendar date, and within each date 
partition, we are storing the data as *parquet* files in 128 parts.

We are trying to run aggregation on this dataset for 366 dates at a time with 
Spark SQL on spark version 2.3.0, hence our Spark job is reading 366*128=46848 
partitions, all of which are parquet files. There is currently no *_metadata* 
or *_common_metadata* file(s) available for this dataset.

The problem we are facing is that when we try to run *spark.read.parquet* on 
the above 46848 partitions, our data reads are extremely slow. It takes a long 
time to run even a simple map task (no shuffling) without any aggregation or 
group by.

I read through the above issue and I think I perhaps generally understand the 
ideas around *_common_metadata* file. But the above issue was raised for Spark 
1.3.1 and for Spark 2.3.0, I have not found any documentation related to this 
metadata file so far.

I would like to clarify:
 # What's the latest, best practice for reading large number of parquet files 
efficiently?
 # Does this involve using any additional options with spark.read.parquet? How 
would that work?
 # Are there other possible reasons for slow data reads apart from reading 
metadata for every part? We are basically trying to migrate our existing spark 
pipeline from using csv files to parquet, but from my hands-on so far, it seems 
that parquet's read time is slower than csv? This seems contradictory to 
popular opinion that parquet performs better in terms of both computation and 
storage?



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to