[
https://issues.apache.org/jira/browse/SPARK-16321?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15359339#comment-15359339
]
Jeff Zhang commented on SPARK-16321:
------------------------------------
This could due to a lot things, may be reading parquet file, the python lambda
or some other spark sql changes from 1.6 to 2.0. First could you verify whether
it is due to reading parquet file by calling the following code
{code}
df = sqlctx.read.parquet(path)
df.where('id > some_id').count()
{code}
> Pyspark 2.0 performance drop vs pyspark 1.6
> -------------------------------------------
>
> Key: SPARK-16321
> URL: https://issues.apache.org/jira/browse/SPARK-16321
> Project: Spark
> Issue Type: Bug
> Components: PySpark
> Affects Versions: 2.0.0
> Reporter: Maciej BryĆski
>
> I did some test on parquet file with many nested columns (about 30G in
> 400 partitions) and Spark 2.0 is 2x slower.
> {code}
> df = sqlctx.read.parquet(path)
> df.where('id > some_id').rdd.flatMap(lambda r: [r.id] if not r.id %100000
> else []).collect()
> {code}
> Spark 1.6 -> 2.3 min
> Spark 2.0 -> 4.6 min (2x slower)
> I used BasicProfiler for this task and cumulative time was:
> Spark 1.6 - 4300 sec
> Spark 2.0 - 5800 sec
> Should I expect such a drop in performance ?
> I don't know how to prepare sample data to show the problem.
> Any ideas ? Or public data with many nested columns ?
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]