[
https://issues.apache.org/jira/browse/SPARK-17998?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Shea Parkes closed SPARK-17998.
-------------------------------
Resolution: Information Provided
> Reading Parquet files coalesces parts into too few in-memory partitions
> -----------------------------------------------------------------------
>
> Key: SPARK-17998
> URL: https://issues.apache.org/jira/browse/SPARK-17998
> Project: Spark
> Issue Type: Bug
> Components: PySpark, SQL
> Affects Versions: 2.0.0, 2.0.1
> Environment: Spark Standalone Cluster (not "local mode")
> Windows 10 and Windows 7
> Python 3.x
> Reporter: Shea Parkes
>
> Reading a parquet ~file into a DataFrame is resulting in far too few
> in-memory partitions. In prior versions of Spark, the resulting DataFrame
> would have a number of partitions often equal to the number of parts in the
> parquet folder.
> Here's a minimal reproducible sample:
> {quote}
> df_first = session.range(start=1, end=100000000, numPartitions=13)
> assert df_first.rdd.getNumPartitions() == 13
> assert session._sc.defaultParallelism == 6
> path_scrap = r"c:\scratch\scrap.parquet"
> df_first.write.parquet(path_scrap)
> df_second = session.read.parquet(path_scrap)
> print(df_second.rdd.getNumPartitions())
> {quote}
> The above shows only 7 partitions in the DataFrame that was created by
> reading the Parquet back into memory for me. Why is it no longer just the
> number of part files in the Parquet folder? (Which is 13 in the example
> above.)
> I'm filing this as a bug because it has gotten so bad that we can't work with
> the underlying RDD without first repartitioning the DataFrame, which is
> costly and wasteful. I really doubt this was the intended effect of moving
> to Spark 2.0.
> I've tried to research where the number of in-memory partitions is
> determined, but my Scala skills have proven in-adequate. I'd be happy to dig
> further if someone could point me in the right direction...
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]