In reference to https://issues.apache.org/jira/browse/SPARK-16320, the code 
path for reading data from parquet files has been refactored extensively. The 
fact that Maciej was testing performance on a table with 400 partitions makes 
me wonder if my PR for https://issues.apache.org/jira/browse/SPARK-15968 will 
make a difference for repeated queries on partitioned tables. That PR was 
merged into master and backported to 2.0. The commit short hash is d5d2457.

Maciej, can you rerun your test on your original dataset with a version of 
Spark 2.0 with that commit in it? And run it more than once? And ensure that 
when you compare your query performance for the first query, ensure that you're 
starting with a fresh spark-shell or spark-sql for each so caching is not a 
factor.

As for the issue with initial query performance on a partitioned table or query 
performance on an unpartitioned table being inferior, I can do a quick test to 
see if I can reproduce that issue on our end. Assuming there is a perf 
regression, I may be able to spend some time debugging today. I've spent a 
substantial amount of time debugging and optimizing parquet table query perf in 
Spark, and we've been using 2.0 for at least a month now. Not sure if I'll have 
time to dig that deep, though.

Michael


> On Jul 20, 2016, at 9:23 AM, Marcin Tustin <mtus...@handybook.com> wrote:
> 
> I refer to Maciej Bryński's (mac...@brynski.pl <mailto:mac...@brynski.pl>) 
> emails of 29 and 30 June 2016 to this list. He said that his benchmarking 
> suggested that Spark 2.0 was slower than 1.6.
> 
> I'm wondering if that was ever investigated, and if so if the speed is back 
> up, or not.
> 
> On Wed, Jul 20, 2016 at 12:18 PM, Michael Allman <mich...@videoamp.com 
> <mailto:mich...@videoamp.com>> wrote:
> Marcin,
> 
> I'm not sure what you're referring to. Can you be more specific?
> 
> Cheers,
> 
> Michael
> 
>> On Jul 20, 2016, at 9:10 AM, Marcin Tustin <mtus...@handybook.com 
>> <mailto:mtus...@handybook.com>> wrote:
>> 
>> Whatever happened with the query regarding benchmarks? Is that resolved?
>> 
>> On Tue, Jul 19, 2016 at 10:35 PM, Reynold Xin <r...@databricks.com 
>> <mailto:r...@databricks.com>> wrote:
>> Please vote on releasing the following candidate as Apache Spark version 
>> 2.0.0. The vote is open until Friday, July 22, 2016 at 20:00 PDT and passes 
>> if a majority of at least 3 +1 PMC votes are cast.
>> 
>> [ ] +1 Release this package as Apache Spark 2.0.0
>> [ ] -1 Do not release this package because ...
>> 
>> 
>> The tag to be voted on is v2.0.0-rc5 
>> (13650fc58e1fcf2cf2a26ba11c819185ae1acc1f).
>> 
>> This release candidate resolves ~2500 issues: 
>> https://s.apache.org/spark-2.0.0-jira <https://s.apache.org/spark-2.0.0-jira>
>> 
>> The release files, including signatures, digests, etc. can be found at:
>> http://people.apache.org/~pwendell/spark-releases/spark-2.0.0-rc5-bin/ 
>> <http://people.apache.org/~pwendell/spark-releases/spark-2.0.0-rc5-bin/>
>> 
>> Release artifacts are signed with the following key:
>> https://people.apache.org/keys/committer/pwendell.asc 
>> <https://people.apache.org/keys/committer/pwendell.asc>
>> 
>> The staging repository for this release can be found at:
>> https://repository.apache.org/content/repositories/orgapachespark-1195/ 
>> <https://repository.apache.org/content/repositories/orgapachespark-1195/>
>> 
>> The documentation corresponding to this release can be found at:
>> http://people.apache.org/~pwendell/spark-releases/spark-2.0.0-rc5-docs/ 
>> <http://people.apache.org/~pwendell/spark-releases/spark-2.0.0-rc5-docs/>
>> 
>> 
>> =================================
>> How can I help test this release?
>> =================================
>> If you are a Spark user, you can help us test this release by taking an 
>> existing Spark workload and running on this release candidate, then 
>> reporting any regressions from 1.x.
>> 
>> ==========================================
>> What justifies a -1 vote for this release?
>> ==========================================
>> Critical bugs impacting major functionalities.
>> 
>> Bugs already present in 1.x, missing features, or bugs related to new 
>> features will not necessarily block this release. Note that historically 
>> Spark documentation has been published on the website separately from the 
>> main release so we do not need to block the release due to documentation 
>> errors either.
>> 
>> 
>> 
>> Want to work at Handy? Check out our culture deck and open roles 
>> <http://www.handy.com/careers>
>> Latest news <http://www.handy.com/press> at Handy
>> Handy just raised $50m 
>> <http://venturebeat.com/2015/11/02/on-demand-home-service-handy-raises-50m-in-round-led-by-fidelity/>
>>  led by Fidelity
>> 
>> 
> 
> 
> 
> Want to work at Handy? Check out our culture deck and open roles 
> <http://www.handy.com/careers>
> Latest news <http://www.handy.com/press> at Handy
> Handy just raised $50m 
> <http://venturebeat.com/2015/11/02/on-demand-home-service-handy-raises-50m-in-round-led-by-fidelity/>
>  led by Fidelity
> 
> 

Reply via email to