Hi,
Did anyone measure performance of Spark 2.0 vs Spark 1.6 ?
I did some test on parquet file with many nested columns (about 30G in
400 partitions) and Spark 2.0 is sometimes 2x slower.
I tested following queries:
1) select count(*) where id > some_id
In this query we have PPD and performance is similar. (about 1 sec)
2) select count(*) where nested_column.id > some_id
Spark 1.6 -> 1.6 min
Spark 2.0 -> 2.1 min
Is it normal that both version didn't do PPD ?
3) Spark connected with python
df.where('id > some_id').rdd.flatMap(lambda r: [r.id] if not r.id %
100000 else []).collect()
Spark 1.6 -> 2.3 min
Spark 2.0 -> 4.6 min (2x slower)
I used BasicProfiler for this task and cumulative time was:
Spark 1.6 - 4300 sec
Spark 2.0 - 5800 sec
Should I expect such a drop in performance ?
BTW: why in Spark 2.0 Dataframe lost map and flatmap method ?
I don't know how to prepare sample data to show the problem.
Any ideas ? Or public data with many nested columns ?
I'd like to create Jira for it but Apache server is down at the moment.
Regards,
--
Maciek Bryński
---------------------------------------------------------------------
To unsubscribe e-mail: [email protected]