Re: Spark 2.0 Performance drop

2016-06-29 Thread Michael Allman
The patch we use in production is for 1.5. We're porting the patch to master 
(and downstream to 2.0, which is presently very similar) with the intention of 
submitting a PR "soon". We'll push it here when it's ready: 
https://github.com/VideoAmp/spark-public.

Regarding benchmarking, we have a suite of Spark SQL regression tests which we 
run to check correctness and performance. I can share our findings when I have 
them.

Cheers,

Michael

> On Jun 29, 2016, at 2:39 PM, Maciej Bryński  wrote:
> 
> 2016-06-29 23:22 GMT+02:00 Michael Allman :
>> I'm sorry I don't have any concrete advice for you, but I hope this helps 
>> shed some light on the current support in Spark for projection pushdown.
>> 
>> Michael
> 
> Michael,
> Thanks for the answer. This resolves one of my questions.
> Which Spark version you have patched ? 1.6 ? Are you planning to
> public this patch or just for 2.0 branch ?
> 
> I gladly help with some benchmark in my environment.
> 
> Regards,
> -- 
> Maciek Bryński


-
To unsubscribe e-mail: dev-unsubscr...@spark.apache.org



Re: Spark 2.0 Performance drop

2016-06-29 Thread Maciej Bryński
2016-06-29 23:22 GMT+02:00 Michael Allman :
> I'm sorry I don't have any concrete advice for you, but I hope this helps 
> shed some light on the current support in Spark for projection pushdown.
>
> Michael

Michael,
Thanks for the answer. This resolves one of my questions.
Which Spark version you have patched ? 1.6 ? Are you planning to
public this patch or just for 2.0 branch ?

I gladly help with some benchmark in my environment.

Regards,
-- 
Maciek Bryński

-
To unsubscribe e-mail: dev-unsubscr...@spark.apache.org



Re: Spark 2.0 Performance drop

2016-06-29 Thread Michael Allman
Hi Maciej,

In Spark, projection pushdown is currently limited to top-level columns 
(StructFields). VideoAmp has very large parquet-based tables (many billions of 
records accumulated per day) with deeply nested schema (four or five levels), 
and we've spent a considerable amount of time optimizing query performance on 
these tables.

We have a patch internally that extends Spark to support projection pushdown 
for arbitrarily nested fields. This has resulted in a *huge* performance 
improvement for many of our queries, like 10x to 100x in some cases.

I'm still putting the finishing touches on our port of this patch to Spark 
master and 2.0. We haven't done any specific benchmarking between versions, but 
I will do that when our patch is complete. We hope to contribute this 
functionality to the Spark project at some point in the near future, but it is 
not ready yet.

I'm sorry I don't have any concrete advice for you, but I hope this helps shed 
some light on the current support in Spark for projection pushdown.

Michael

> On Jun 29, 2016, at 1:48 PM, Maciej Bryński  wrote:
> 
> Hi,
> Did anyone measure performance of Spark 2.0 vs Spark 1.6 ?
> 
> I did some test on parquet file with many nested columns (about 30G in
> 400 partitions) and Spark 2.0 is sometimes 2x slower.
> 
> I tested following queries:
> 1) select count(*) where id > some_id
> In this query we have PPD and performance is similar. (about 1 sec)
> 
> 2) select count(*) where nested_column.id > some_id
> Spark 1.6 -> 1.6 min
> Spark 2.0 -> 2.1 min
> Is it normal that both version didn't do PPD ?
> 
> 3) Spark connected with python
> df.where('id > some_id').rdd.flatMap(lambda r: [r.id] if not r.id %
> 10 else []).collect()
> Spark 1.6 -> 2.3 min
> Spark 2.0 -> 4.6 min (2x slower)
> 
> I used BasicProfiler for this task and cumulative time was:
> Spark 1.6 - 4300 sec
> Spark 2.0 - 5800 sec
> 
> Should I expect such a drop in performance ?
> 
> BTW: why in Spark 2.0 Dataframe lost map and flatmap method ?
> 
> I don't know how to prepare sample data to show the problem.
> Any ideas ? Or public data with many nested columns ?
> 
> I'd like to create Jira for it but Apache server is down at the moment.
> 
> Regards,
> -- 
> Maciek Bryński
> 
> -
> To unsubscribe e-mail: dev-unsubscr...@spark.apache.org
> 


-
To unsubscribe e-mail: dev-unsubscr...@spark.apache.org



Spark 2.0 Performance drop

2016-06-29 Thread Maciej Bryński
Hi,
Did anyone measure performance of Spark 2.0 vs Spark 1.6 ?

I did some test on parquet file with many nested columns (about 30G in
400 partitions) and Spark 2.0 is sometimes 2x slower.

I tested following queries:
1) select count(*) where id > some_id
In this query we have PPD and performance is similar. (about 1 sec)

2) select count(*) where nested_column.id > some_id
Spark 1.6 -> 1.6 min
Spark 2.0 -> 2.1 min
Is it normal that both version didn't do PPD ?

3) Spark connected with python
df.where('id > some_id').rdd.flatMap(lambda r: [r.id] if not r.id %
10 else []).collect()
Spark 1.6 -> 2.3 min
Spark 2.0 -> 4.6 min (2x slower)

I used BasicProfiler for this task and cumulative time was:
Spark 1.6 - 4300 sec
Spark 2.0 - 5800 sec

Should I expect such a drop in performance ?

BTW: why in Spark 2.0 Dataframe lost map and flatmap method ?

I don't know how to prepare sample data to show the problem.
Any ideas ? Or public data with many nested columns ?

I'd like to create Jira for it but Apache server is down at the moment.

Regards,
-- 
Maciek Bryński

-
To unsubscribe e-mail: dev-unsubscr...@spark.apache.org



Re: Bitmap Indexing to increase OLAP query performance

2016-06-29 Thread Jörn Franke

Is it the traditional bitmap indexing? I would not recommend it for big data. 
You could use bloom filters and min/max indexes in-memory which look to be more 
appropriate. However, if you want to use bitmap indexes then you would have to 
do it as you say. However, bitmap indexes may consume a lot of memory, so I am 
not sure that simply caching them in-memory is desired. 

> On 29 Jun 2016, at 19:49, Nishadi Kirielle  wrote:
> 
> Hi All,
> 
> I am a CSE undergraduate and as for our final year project, we are expecting 
> to construct a cluster based, bit-oriented analytic platform (storage engine) 
> to provide fast query performance when used for OLAP with the use of novel 
> bitmap indexing techniques when and where appropriate. 
> 
> For that we are expecting to use Spark SQL. We will need to implement a way 
> to cache the bit map indexes and in-cooperate the use of bitmap indexing at 
> the catalyst optimizer level when it is possible.
> 
> I would highly appreciate your feedback regarding the proposed approach.
> 
> Thank you & Regards
> 
> Nishadi Kirielle
> Department of Computer Science and Engineering
> University of Moratuwa
> Sri Lanka 

-
To unsubscribe e-mail: dev-unsubscr...@spark.apache.org



Bitmap Indexing to increase OLAP query performance

2016-06-29 Thread Nishadi Kirielle
Hi All,

I am a CSE undergraduate and as for our final year project, we are
expecting to construct a cluster based, bit-oriented analytic platform
(storage engine) to provide fast query performance when used for OLAP with
the use of novel bitmap indexing techniques when and where appropriate.

For that we are expecting to use Spark SQL. We will need to implement a way
to cache the bit map indexes and in-cooperate the use of bitmap indexing at
the catalyst optimizer level when it is possible.

I would highly appreciate your feedback regarding the proposed approach.

Thank you & Regards

Nishadi Kirielle
Department of Computer Science and Engineering
University of Moratuwa
Sri Lanka


test

2016-06-29 Thread Gav
ignore

-- 
Gav...