[
https://issues.apache.org/jira/browse/SPARK-17998?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16757326#comment-16757326
]
Nicholas Resnick commented on SPARK-17998:
------------------------------------------
Going to answer my question: it is in fact a function of the number of cores
available. The key definition is here:
[https://github.com/apache/spark/blob/c0632cec04e5b0f3fb3c3f27c21a2d3f3fbb4f7e/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/FilePartition.scala#L86]
```
Math.min(defaultMaxSplitBytes, Math.max(openCostInBytes, bytesPerCore))
```
The first two values are configurable, and the third is roughly the size of the
file divided by the number of cores (according to sc.defaultParallelism.) If
defaultMaxSplitBytes > bytesPerCore > openCostInBytes, then maxSplitBytes will
equal bytesPerCore, which means we should expect around one partition for each
core.
> Reading Parquet files coalesces parts into too few in-memory partitions
> -----------------------------------------------------------------------
>
> Key: SPARK-17998
> URL: https://issues.apache.org/jira/browse/SPARK-17998
> Project: Spark
> Issue Type: Bug
> Components: PySpark, SQL
> Affects Versions: 2.0.0, 2.0.1
> Environment: Spark Standalone Cluster (not "local mode")
> Windows 10 and Windows 7
> Python 3.x
> Reporter: Shea Parkes
> Priority: Major
>
> Reading a parquet ~file into a DataFrame is resulting in far too few
> in-memory partitions. In prior versions of Spark, the resulting DataFrame
> would have a number of partitions often equal to the number of parts in the
> parquet folder.
> Here's a minimal reproducible sample:
> {quote}
> df_first = session.range(start=1, end=100000000, numPartitions=13)
> assert df_first.rdd.getNumPartitions() == 13
> assert session._sc.defaultParallelism == 6
> path_scrap = r"c:\scratch\scrap.parquet"
> df_first.write.parquet(path_scrap)
> df_second = session.read.parquet(path_scrap)
> print(df_second.rdd.getNumPartitions())
> {quote}
> The above shows only 7 partitions in the DataFrame that was created by
> reading the Parquet back into memory for me. Why is it no longer just the
> number of part files in the Parquet folder? (Which is 13 in the example
> above.)
> I'm filing this as a bug because it has gotten so bad that we can't work with
> the underlying RDD without first repartitioning the DataFrame, which is
> costly and wasteful. I really doubt this was the intended effect of moving
> to Spark 2.0.
> I've tried to research where the number of in-memory partitions is
> determined, but my Scala skills have proven in-adequate. I'd be happy to dig
> further if someone could point me in the right direction...
--
This message was sent by Atlassian JIRA
(v7.6.3#76005)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]