[
https://issues.apache.org/jira/browse/SPARK-17998?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16756801#comment-16756801
]
Nicholas Resnick commented on SPARK-17998:
------------------------------------------
I reproduced the OP's steps above on my local machine and got 5 instead of 7.
I've also noticed this behavior on an EMR cluster, where the same Dataset has
been read into a varying number of partitions, nondeterministically it seems.
The trend I've noticed is that the number of partitions seems to equal the
number of cores allocated to the spark app at the time of the read. Could
number of cores impact this? Has this changed since the last comment (1/2018)?
> Reading Parquet files coalesces parts into too few in-memory partitions
> -----------------------------------------------------------------------
>
> Key: SPARK-17998
> URL: https://issues.apache.org/jira/browse/SPARK-17998
> Project: Spark
> Issue Type: Bug
> Components: PySpark, SQL
> Affects Versions: 2.0.0, 2.0.1
> Environment: Spark Standalone Cluster (not "local mode")
> Windows 10 and Windows 7
> Python 3.x
> Reporter: Shea Parkes
> Priority: Major
>
> Reading a parquet ~file into a DataFrame is resulting in far too few
> in-memory partitions. In prior versions of Spark, the resulting DataFrame
> would have a number of partitions often equal to the number of parts in the
> parquet folder.
> Here's a minimal reproducible sample:
> {quote}
> df_first = session.range(start=1, end=100000000, numPartitions=13)
> assert df_first.rdd.getNumPartitions() == 13
> assert session._sc.defaultParallelism == 6
> path_scrap = r"c:\scratch\scrap.parquet"
> df_first.write.parquet(path_scrap)
> df_second = session.read.parquet(path_scrap)
> print(df_second.rdd.getNumPartitions())
> {quote}
> The above shows only 7 partitions in the DataFrame that was created by
> reading the Parquet back into memory for me. Why is it no longer just the
> number of part files in the Parquet folder? (Which is 13 in the example
> above.)
> I'm filing this as a bug because it has gotten so bad that we can't work with
> the underlying RDD without first repartitioning the DataFrame, which is
> costly and wasteful. I really doubt this was the intended effect of moving
> to Spark 2.0.
> I've tried to research where the number of in-memory partitions is
> determined, but my Scala skills have proven in-adequate. I'd be happy to dig
> further if someone could point me in the right direction...
--
This message was sent by Atlassian JIRA
(v7.6.3#76005)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]