Re: Surprising Spark SQL benchmark

Nicholas Chammas Sat, 01 Nov 2014 10:50:12 -0700

Good points raised. Some comments.

Re: #1


It seems like there is a misunderstanding of the purpose of the Daytona
Gray benchmark. The purpose of the benchmark is to see how fast you can
sort 100 TB of data (technically, your sort rate during the operation)
using *any* hardware or software config, within the common rules laid out
at http://sortbenchmark.org

Though people will naturally want to compare one benchmarked system to
another, the Gray benchmark does not control the hardware to make such a
comparison useful. So you're right that it's apples to oranges to compare
Databricks's Spark run to Yahoo's Hadoop run in this type of benchmark, but
that's just inherent to the definition of the benchmark. I wouldn't fault
Databricks or Yahoo for this.

That said, it's nice that Databricks went with a public cloud to do this
benchmark, which makes it more likely that future benchmarks done on the
same cloud can be compared meaningfully. The same can't be said of Yahoo's
benchmark, for example, which was done in private datacenter.

Re: #2

EC2 is a good place to run a reproducible benchmark since it's publicly
accessible and the instance types are well defined. If you had trouble
reproducing the AMPLab benchmark there, I would raise that with the AMPLab
team. I'd assume they would be interested in correcting any problems with
reproducing it, as it definitely detracts from the value of the benchmark.

Nick


2014년 11월 1일 토요일, RJ Nowling<rnowl...@gmail.com>님이 작성한 메시지:

> Two thoughts here:
>
> 1. The real flaw with the sort benchmark was that Hadoop wasn't run on the
> same hardware. Given the advances in networking (availabIlity of
> 10GB Ethernet) and disks (SSDs) since the Hadoop benchmarks it was compared
> to, it's an apples to oranges comparison. Without that, it doesn't tell me
> whether the improvement is due to Spark or just hardware.
>
> To me, that's the biggest flaw -- not the reproducibility of it. As you
> say, most people won't have the financial means to access those resources
> to reproduce it.
>
> And that's the same sort of flaw every other marketing benchmark has --
> apples to oranges comparisons.
>
> 2. The BDD benchmark is hard to run outside of EC2 and I and other users
> were not able to access all of the data via S3.  I could reproduce some of
> the data using HiBench but not the web corpus sub sample. As a result, for
> all the hard work put into documenting it, it's still hard to reproduce :(
>
> On Friday, October 31, 2014, Nicholas Chammas <nicholas.cham...@gmail.com
> <javascript:_e(%7B%7D,'cvml','nicholas.cham...@gmail.com');>> wrote:
>
>> I believe that benchmark has a pending certification on it. See
>> http://sortbenchmark.org under "Process".
>>
>> It's true they did not share enough details on the blog for readers to
>> reproduce the benchmark, but they will have to share enough with the
>> committee behind the benchmark in order to be certified. Given that this
>> is
>> a benchmark not many people will be able to reproduce due to size and
>> complexity, I don't see it as a big negative that the details are not laid
>> out as long as there is independent certification from a third party.
>>
>> From what I've seen so far, the best big data benchmark anywhere is this:
>> https://amplab.cs.berkeley.edu/benchmark/
>>
>> Is has all the details you'd expect, including hosted datasets, to allow
>> anyone to reproduce the full benchmark, covering a number of systems. I
>> look forward to the next update to that benchmark (a lot has changed since
>> Feb). And from what I can tell, it's produced by the same people behind
>> Spark (Patrick being among them).
>>
>> So I disagree that the Spark community "hasn't been any better" in this
>> regard.
>>
>> Nick
>>
>>
>> 2014년 10월 31일 금요일, Steve Nunez<snu...@hortonworks.com>님이 작성한 메시지:
>>
>> > To be fair, we (Spark community) haven’t been any better, for example
>> this
>> > benchmark:
>> >
>> >         https://databricks.com/blog/2014/10/10/spark-petabyte-sort.html
>> >
>> >
>> > For which no details or code have been released to allow others to
>> > reproduce it. I would encourage anyone doing a Spark benchmark in future
>> > to avoid the stigma of vendor reported benchmarks and publish enough
>> > information and code to let others repeat the exercise easily.
>> >
>> >         - Steve
>> >
>> >
>> >
>> > On 10/31/14, 11:30, "Nicholas Chammas" <nicholas.cham...@gmail.com
>> > <javascript:;>> wrote:
>> >
>> > >Thanks for the response, Patrick.
>> > >
>> > >I guess the key takeaways are 1) the tuning/config details are
>> everything
>> > >(they're not laid out here), 2) the benchmark should be reproducible
>> (it's
>> > >not), and 3) reach out to the relevant devs before publishing (didn't
>> > >happen).
>> > >
>> > >Probably key takeaways for any kind of benchmark, really...
>> > >
>> > >Nick
>> > >
>> > >
>> > >2014년 10월 31일 금요일, Patrick Wendell<pwend...@gmail.com
>> <javascript:;>>님이
>> > 작성한 메시지:
>> > >
>> > >> Hey Nick,
>> > >>
>> > >> Unfortunately Citus Data didn't contact any of the Spark or Spark SQL
>> > >> developers when running this. It is really easy to make one system
>> > >> look better than others when you are running a benchmark yourself
>> > >> because tuning and sizing can lead to a 10X performance improvement.
>> > >> This benchmark doesn't share the mechanism in a reproducible way.
>> > >>
>> > >> There are a bunch of things that aren't clear here:
>> > >>
>> > >> 1. Spark SQL has optimized parquet features, were these turned on?
>> > >> 2. It doesn't mention computing statistics in Spark SQL, but it does
>> > >> this for Impala and Parquet. Statistics allow Spark SQL to broadcast
>> > >> small tables which can make a 10X difference in TPC-H.
>> > >> 3. For data larger than memory, Spark SQL often performs better if
>> you
>> > >> don't call "cache", did they try this?
>> > >>
>> > >> Basically, a self-reported marketing benchmark like this that
>> > >> *shocker* concludes this vendor's solution is the best, is not
>> > >> particularly useful.
>> > >>
>> > >> If Citus data wants to run a credible benchmark, I'd invite them to
>> > >> directly involve Spark SQL developers in the future. Until then, I
>> > >> wouldn't give much credence to this or any other similar vendor
>> > >> benchmark.
>> > >>
>> > >> - Patrick
>> > >>
>> > >> On Fri, Oct 31, 2014 at 10:38 AM, Nicholas Chammas
>> > >> <nicholas.cham...@gmail.com <javascript:;> <javascript:;>> wrote:
>> > >> > I know we don't want to be jumping at every benchmark someone posts
>> > >>out
>> > >> > there, but this one surprised me:
>> > >> >
>> > >> >
>> http://www.citusdata.com/blog/86-making-postgresql-scale-hadoop-style
>> > >> >
>> > >> > This benchmark has Spark SQL failing to complete several queries in
>> > >>the
>> > >> > TPC-H benchmark. I don't understand much about the details of
>> > >>performing
>> > >> > benchmarks, but this was surprising.
>> > >> >
>> > >> > Are these results expected?
>> > >> >
>> > >> > Related HN discussion here:
>> > >>https://news.ycombinator.com/item?id=8539678
>> > >> >
>> > >> > Nick
>> > >>
>> >
>> >
>> >
>> > --
>> > CONFIDENTIALITY NOTICE
>> > NOTICE: This message is intended for the use of the individual or
>> entity to
>> > which it is addressed and may contain information that is confidential,
>> > privileged and exempt from disclosure under applicable law. If the
>> reader
>> > of this message is not the intended recipient, you are hereby notified
>> that
>> > any printing, copying, dissemination, distribution, disclosure or
>> > forwarding of this communication is strictly prohibited. If you have
>> > received this communication in error, please contact the sender
>> immediately
>> > and delete it from your system. Thank You.
>> >
>>
>
>
> --
> em rnowl...@gmail.com <javascript:_e(%7B%7D,'cvml','rnowl...@gmail.com');>
> c 954.496.2314
>

Re: Surprising Spark SQL benchmark

Reply via email to