[
https://issues.apache.org/jira/browse/SPARK-34648?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Pankaj Bhootra reopened SPARK-34648:
------------------------------------
Reopening this as there is no response on the email channel. Please help with
clarification.
> Reading Parquet Files in Spark Extremely Slow for Large Number of Files?
> ------------------------------------------------------------------------
>
> Key: SPARK-34648
> URL: https://issues.apache.org/jira/browse/SPARK-34648
> Project: Spark
> Issue Type: Question
> Components: SQL
> Affects Versions: 2.3.0
> Reporter: Pankaj Bhootra
> Priority: Major
>
> Hello Team
> I am new to Spark and this question may be a possible duplicate of the issue
> highlighted here: https://issues.apache.org/jira/browse/SPARK-9347
> We have a large dataset partitioned by calendar date, and within each date
> partition, we are storing the data as *parquet* files in 128 parts.
> We are trying to run aggregation on this dataset for 366 dates at a time with
> Spark SQL on spark version 2.3.0, hence our Spark job is reading
> 366*128=46848 partitions, all of which are parquet files. There is currently
> no *_metadata* or *_common_metadata* file(s) available for this dataset.
> The problem we are facing is that when we try to run *spark.read.parquet* on
> the above 46848 partitions, our data reads are extremely slow. It takes a
> long time to run even a simple map task (no shuffling) without any
> aggregation or group by.
> I read through the above issue and I think I perhaps generally understand the
> ideas around *_common_metadata* file. But the above issue was raised for
> Spark 1.3.1 and for Spark 2.3.0, I have not found any documentation related
> to this metadata file so far.
> I would like to clarify:
> # What's the latest, best practice for reading large number of parquet files
> efficiently?
> # Does this involve using any additional options with spark.read.parquet?
> How would that work?
> # Are there other possible reasons for slow data reads apart from reading
> metadata for every part? We are basically trying to migrate our existing
> spark pipeline from using csv files to parquet, but from my hands-on so far,
> it seems that parquet's read time is slower than csv? This seems
> contradictory to popular opinion that parquet performs better in terms of
> both computation and storage?
--
This message was sent by Atlassian Jira
(v8.3.4#803005)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]