[jira] [Commented] (SPARK-17998) Reading Parquet files coalesces parts into too few in-memory partitions

Nicholas Resnick (JIRA) Thu, 31 Jan 2019 07:09:19 -0800


    [ 
https://issues.apache.org/jira/browse/SPARK-17998?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16757326#comment-16757326
 ]


Nicholas Resnick commented on SPARK-17998:
------------------------------------------

Going to answer my question: it is in fact a function of the number of cores 
available. The key definition is here: 
[https://github.com/apache/spark/blob/c0632cec04e5b0f3fb3c3f27c21a2d3f3fbb4f7e/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/FilePartition.scala#L86]

```

Math.min(defaultMaxSplitBytes, Math.max(openCostInBytes, bytesPerCore))

```

The first two values are configurable, and the third is roughly the size of the 
file divided by the number of cores (according to sc.defaultParallelism.) If 
defaultMaxSplitBytes > bytesPerCore > openCostInBytes, then maxSplitBytes will 
equal bytesPerCore, which means we should expect around one partition for each 
core.

> Reading Parquet files coalesces parts into too few in-memory partitions
> -----------------------------------------------------------------------
>
>                 Key: SPARK-17998
>                 URL: https://issues.apache.org/jira/browse/SPARK-17998
>             Project: Spark
>          Issue Type: Bug
>          Components: PySpark, SQL
>    Affects Versions: 2.0.0, 2.0.1
>         Environment: Spark Standalone Cluster (not "local mode")
> Windows 10 and Windows 7
> Python 3.x
>            Reporter: Shea Parkes
>            Priority: Major
>
> Reading a parquet ~file into a DataFrame is resulting in far too few 
> in-memory partitions.  In prior versions of Spark, the resulting DataFrame 
> would have a number of partitions often equal to the number of parts in the 
> parquet folder.
> Here's a minimal reproducible sample:
> {quote}
> df_first = session.range(start=1, end=100000000, numPartitions=13)
> assert df_first.rdd.getNumPartitions() == 13
> assert session._sc.defaultParallelism == 6
> path_scrap = r"c:\scratch\scrap.parquet"
> df_first.write.parquet(path_scrap)
> df_second = session.read.parquet(path_scrap)
> print(df_second.rdd.getNumPartitions())
> {quote}
> The above shows only 7 partitions in the DataFrame that was created by 
> reading the Parquet back into memory for me.  Why is it no longer just the 
> number of part files in the Parquet folder?  (Which is 13 in the example 
> above.)
> I'm filing this as a bug because it has gotten so bad that we can't work with 
> the underlying RDD without first repartitioning the DataFrame, which is 
> costly and wasteful.  I really doubt this was the intended effect of moving 
> to Spark 2.0.
> I've tried to research where the number of in-memory partitions is 
> determined, but my Scala skills have proven in-adequate.  I'd be happy to dig 
> further if someone could point me in the right direction...



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[jira] [Commented] (SPARK-17998) Reading Parquet files coalesces parts into too few in-memory partitions

Reply via email to