Re: Spark 2.0 Performance drop
The patch we use in production is for 1.5. We're porting the patch to master (and downstream to 2.0, which is presently very similar) with the intention of submitting a PR "soon". We'll push it here when it's ready: https://github.com/VideoAmp/spark-public. Regarding benchmarking, we have a suite of Spark SQL regression tests which we run to check correctness and performance. I can share our findings when I have them. Cheers, Michael > On Jun 29, 2016, at 2:39 PM, Maciej Bryńskiwrote: > > 2016-06-29 23:22 GMT+02:00 Michael Allman : >> I'm sorry I don't have any concrete advice for you, but I hope this helps >> shed some light on the current support in Spark for projection pushdown. >> >> Michael > > Michael, > Thanks for the answer. This resolves one of my questions. > Which Spark version you have patched ? 1.6 ? Are you planning to > public this patch or just for 2.0 branch ? > > I gladly help with some benchmark in my environment. > > Regards, > -- > Maciek Bryński - To unsubscribe e-mail: dev-unsubscr...@spark.apache.org
Re: Spark 2.0 Performance drop
2016-06-29 23:22 GMT+02:00 Michael Allman: > I'm sorry I don't have any concrete advice for you, but I hope this helps > shed some light on the current support in Spark for projection pushdown. > > Michael Michael, Thanks for the answer. This resolves one of my questions. Which Spark version you have patched ? 1.6 ? Are you planning to public this patch or just for 2.0 branch ? I gladly help with some benchmark in my environment. Regards, -- Maciek Bryński - To unsubscribe e-mail: dev-unsubscr...@spark.apache.org
Re: Spark 2.0 Performance drop
Hi Maciej, In Spark, projection pushdown is currently limited to top-level columns (StructFields). VideoAmp has very large parquet-based tables (many billions of records accumulated per day) with deeply nested schema (four or five levels), and we've spent a considerable amount of time optimizing query performance on these tables. We have a patch internally that extends Spark to support projection pushdown for arbitrarily nested fields. This has resulted in a *huge* performance improvement for many of our queries, like 10x to 100x in some cases. I'm still putting the finishing touches on our port of this patch to Spark master and 2.0. We haven't done any specific benchmarking between versions, but I will do that when our patch is complete. We hope to contribute this functionality to the Spark project at some point in the near future, but it is not ready yet. I'm sorry I don't have any concrete advice for you, but I hope this helps shed some light on the current support in Spark for projection pushdown. Michael > On Jun 29, 2016, at 1:48 PM, Maciej Bryńskiwrote: > > Hi, > Did anyone measure performance of Spark 2.0 vs Spark 1.6 ? > > I did some test on parquet file with many nested columns (about 30G in > 400 partitions) and Spark 2.0 is sometimes 2x slower. > > I tested following queries: > 1) select count(*) where id > some_id > In this query we have PPD and performance is similar. (about 1 sec) > > 2) select count(*) where nested_column.id > some_id > Spark 1.6 -> 1.6 min > Spark 2.0 -> 2.1 min > Is it normal that both version didn't do PPD ? > > 3) Spark connected with python > df.where('id > some_id').rdd.flatMap(lambda r: [r.id] if not r.id % > 10 else []).collect() > Spark 1.6 -> 2.3 min > Spark 2.0 -> 4.6 min (2x slower) > > I used BasicProfiler for this task and cumulative time was: > Spark 1.6 - 4300 sec > Spark 2.0 - 5800 sec > > Should I expect such a drop in performance ? > > BTW: why in Spark 2.0 Dataframe lost map and flatmap method ? > > I don't know how to prepare sample data to show the problem. > Any ideas ? Or public data with many nested columns ? > > I'd like to create Jira for it but Apache server is down at the moment. > > Regards, > -- > Maciek Bryński > > - > To unsubscribe e-mail: dev-unsubscr...@spark.apache.org > - To unsubscribe e-mail: dev-unsubscr...@spark.apache.org
Spark 2.0 Performance drop
Hi, Did anyone measure performance of Spark 2.0 vs Spark 1.6 ? I did some test on parquet file with many nested columns (about 30G in 400 partitions) and Spark 2.0 is sometimes 2x slower. I tested following queries: 1) select count(*) where id > some_id In this query we have PPD and performance is similar. (about 1 sec) 2) select count(*) where nested_column.id > some_id Spark 1.6 -> 1.6 min Spark 2.0 -> 2.1 min Is it normal that both version didn't do PPD ? 3) Spark connected with python df.where('id > some_id').rdd.flatMap(lambda r: [r.id] if not r.id % 10 else []).collect() Spark 1.6 -> 2.3 min Spark 2.0 -> 4.6 min (2x slower) I used BasicProfiler for this task and cumulative time was: Spark 1.6 - 4300 sec Spark 2.0 - 5800 sec Should I expect such a drop in performance ? BTW: why in Spark 2.0 Dataframe lost map and flatmap method ? I don't know how to prepare sample data to show the problem. Any ideas ? Or public data with many nested columns ? I'd like to create Jira for it but Apache server is down at the moment. Regards, -- Maciek Bryński - To unsubscribe e-mail: dev-unsubscr...@spark.apache.org
Re: Bitmap Indexing to increase OLAP query performance
Is it the traditional bitmap indexing? I would not recommend it for big data. You could use bloom filters and min/max indexes in-memory which look to be more appropriate. However, if you want to use bitmap indexes then you would have to do it as you say. However, bitmap indexes may consume a lot of memory, so I am not sure that simply caching them in-memory is desired. > On 29 Jun 2016, at 19:49, Nishadi Kiriellewrote: > > Hi All, > > I am a CSE undergraduate and as for our final year project, we are expecting > to construct a cluster based, bit-oriented analytic platform (storage engine) > to provide fast query performance when used for OLAP with the use of novel > bitmap indexing techniques when and where appropriate. > > For that we are expecting to use Spark SQL. We will need to implement a way > to cache the bit map indexes and in-cooperate the use of bitmap indexing at > the catalyst optimizer level when it is possible. > > I would highly appreciate your feedback regarding the proposed approach. > > Thank you & Regards > > Nishadi Kirielle > Department of Computer Science and Engineering > University of Moratuwa > Sri Lanka - To unsubscribe e-mail: dev-unsubscr...@spark.apache.org
Bitmap Indexing to increase OLAP query performance
Hi All, I am a CSE undergraduate and as for our final year project, we are expecting to construct a cluster based, bit-oriented analytic platform (storage engine) to provide fast query performance when used for OLAP with the use of novel bitmap indexing techniques when and where appropriate. For that we are expecting to use Spark SQL. We will need to implement a way to cache the bit map indexes and in-cooperate the use of bitmap indexing at the catalyst optimizer level when it is possible. I would highly appreciate your feedback regarding the proposed approach. Thank you & Regards Nishadi Kirielle Department of Computer Science and Engineering University of Moratuwa Sri Lanka
test
ignore -- Gav...