Re: Crunch Experiences

Josh Wills Sun, 17 Jun 2012 22:00:38 -0700

Hey Stefan,

Reply inlined.


On Sat, Jun 16, 2012 at 6:03 AM,  <[email protected]> wrote:
> Hey Josh,
>
> @TupleWritables: Yes I saw that. Didn't think too much about the performance
> implications though. Writing all the classinfo is unnecessary because it is
> statically known, and an identifier will help with that. But using
> reflection will use much more CPU time than if you can avoid it, as proven
> by my little experiment, and it will not help with that.
>
> Kryo uses a neat trick for cutting down serialization size. By forcing the
> user to register all classes in well defined sequence, it can use the index
> for a class in that sequence to describe it. Not as good as describing a
> whole schema, but generally applicable for serialization.
>
> Using avro internally sounds like a good idea. I don't know it too well
> though. So you would give up the TupleWritable's completely, replace them
> with AvroWritables, and use the mentioned commit to use Writables within
> Avro? What do you mean with "I think there would be some cool stuff we could
> do if we could assume how all shuffle serializations worked, ..."?
> Well the reflection call would stay, but that can be solved independently I
> guess (for example by having an optional factory argument in
> BytesToWritableMapFn).

Re: cool stuff, I suspect that having Avro for all of the intermediate
data transfer would dramatically simplify the implementation of MSCR
fusion, which I've been putting off b/c having an abstraction that
would handle fusion for both Writables and Avro makes my head hurt.
I'm going to start a thread on [email protected]
advocating for it.

>
> What still bothers me is that the partitioner for the join, which only has
> to ensure that the bit indicating whether it's left or right is ignored,
> still has to do much too much work. If he could just get the serialized
> bytes for the key, he could compute the hashcode for the byte array
> directly, just ignoring the last byte. The deserialization there is
> unnecessary and is actually, at least in my profiling, what seemed to hurt
> the performance of the join so badly. Maybe some more benchmarking is needed
> though for this.

That doesn't sound right-- isn't the Partitioner call is done at the
end of the map task, before the data is ever serialized?

>
> Guess I should contact Alexy Khravbrov then.
>
> Cheers,
> Stefan
>
> PS: Sorry, replied directly instead to the mailing list.
>
> Am Freitag, 15. Juni 2012 17:37:06 UTC+2 schrieb Josh Wills:
>>
>> On Fri, Jun 15, 2012 at 2:24 AM,  <[email protected]> wrote:
>> > Hey Josh,
>> >
>> > I actually first found pangool and from there concluded that Crunch is
>> > worth
>> > a try. So you are saying that the TupleWritable's are in general quite
>> > slow,
>> > and the performance of Crunch is in that case not comparable to pure
>> > Hadoop?
>> > If it really makes a difference we should look into generating
>> > TupleWritable
>> > subclasses for our project (if there will be another round of
>> > benchmarks).
>>
>> The impl of TupleWritable in Crunch is too conservative, in that it
>> passes along the names of the Writable classes that it serialized in
>> the tuple along with the actual data, which leads to a pretty massive
>> blowup in the amount of data that gets passed around. My initial
>> thought in doing this was that I wanted to be sure that the Crunch
>> output was always readable by anything-- e.g., you didn't have to use
>> Crunch in order to read the data back, since all of the information on
>> what the data contained was there. This is essentially what Avro gets
>> you, although the Avro file format is smarter about just putting the
>> schema at the head of the file and not copying it over and over again,
>> which is why we generally recommend Avro + Crunch. The #s in the
>> pangool benchmark were all based on Avro.
>>
>> The tuple-oriented frameworks handle this by having integer
>> identifiers for the different data types are supported, and that's
>> certainly one way to improve the performance here. I've also been
>> tossing around the idea of using Avro for everything internal to the
>> framework (e.g., any data transferred during the shuffle stage), since
>> I just added a way for AvroTypes to support arbitrary writables:
>>
>>
>> https://github.com/cloudera/crunch/commit/224102ac4813fc0e124114026438a2e3884f858b
>>
>> I think there would be some cool stuff we could do if we could assume
>> how all shuffle serializations worked, but I haven't benchmarked it
>> yet or tossed the idea around with the other committers.
>>
>> >
>> > You didn't comment on the the SourceTargetHelper. Do glob patterns now
>> > work
>> > fine as inputs? It just failed in 0.2.4 when I specified lineitems* as
>> > input
>> > in local mode (no hdfs), while it worked fine in Scoobi.
>>
>> Sorry I missed that-- I was under the impression it did work, but I'll
>> take a look at it and report back.
>>
>> >
>> > Right now I just need to finish my thesis, so currently I am not
>> > planning to
>> > commit anywhere.
>>
>> It never hurts to ask. I saw your threads on the spark and scoobi
>> mailing lists and enjoyed them-- please keep us posted on what you're
>> up to. Alexy Khravbrov at Klout was also interested in an abstraction
>> layer for Scala MapReduce vs. Spark-- did you talk to him?
>>
>> >
>> > Cheers,
>> > Stefan
>> >
>> > Am Donnerstag, 14. Juni 2012 16:42:43 UTC+2 schrieb Josh Wills:
>> >>
>> >> Hey Stefan,
>> >>
>> >> Thanks for your email. Re: join performance, I agree with you that the
>> >> current implementation that uses Writables is pretty terrible, and
>> >> I've been thinking about good ways to do away with it for awhile now.
>> >> The good news is that the Avro-based implementation of PTypes is
>> >> pretty fast, and that's what most people end up using in practice. For
>> >> example, when the Pangool guys were benchmarking frameworks, they used
>> >> the Avro PTypes, and it usually runs pretty close to native MR:
>> >> http://pangool.net/benchmark.html
>> >>
>> >> Re: the gist, I will take a look at it. The good news about the pull
>> >> request is that we just submitted Crunch to the Apache Incubator and
>> >> are in the process of moving over all of the infrastructure. It would
>> >> be great to have your patches in when we're done-- we're always on the
>> >> lookout for people who are interested in becoming committers.
>> >>
>> >> Best,
>> >> Josh
>> >>
>> >> On Thu, Jun 14, 2012 at 6:53 AM,  <[email protected]> wrote:
>> >> > Hi,
>> >> >
>> >> > I did some more investigations. You set the partitioner correctly,
>> >> > but
>> >> > you
>> >> > do not set the comparator. But actually the comparator might not be
>> >> > needed,
>> >> > because the value with 0 will always come first by the default
>> >> > comparator?
>> >> > If you rely on that, then maybe you should put a comment there as it
>> >> > is
>> >> > not
>> >> > immediately obvious.
>> >> >
>> >> > The performance problem with joins (as shown by profiling) is that
>> >> > Hadoop
>> >> > does not know about the PType's. So to deserialize during sorting the
>> >> > general TupleWritable is used, which is very inefficient as it uses
>> >> > reflection. I added a cache for the Class.forName call  in readFields
>> >> > and it
>> >> > improves performance a lot, but it is not a definitive answer to the
>> >> > problem. Maybe you could add a special writable for TaggedKey's,
>> >> > which
>> >> > knows
>> >> > that one of them is an Integer. To define the partitioning, the
>> >> > actual
>> >> > content does not matter, just that the tagging value is ignored.
>> >> > Scoobi
>> >> > even
>> >> > generates classes for TaggedKeys and TaggedValues, that would most
>> >> > likely be
>> >> > even faster.
>> >> >
>> >> > Here are some numbers for a 1GB join on my laptop, second column has
>> >> > the
>> >> > total amount of seconds.
>> >> > My version of a join, which tags the values by adding a boolean to
>> >> > them.
>> >> > Note that in most cases, where the size of one group does not come
>> >> > close
>> >> > to
>> >> > the available RAM, this is totally fine. Implemented in [1].
>> >> > crunch_4    33.42    13    2.06    26.82    6526368    86%
>> >> > Crunch's original join
>> >> > crunch_4    150.11    13    2.79    143.50    7924496    97%
>> >> > Crunch's join with my caching changes in TupleWritable:
>> >> > crunch_4    69.67    13    2.51    59.99    7965808    89%
>> >> >
>> >> > Too lazy to open a pull request, here is a gist of my changes for the
>> >> > join.
>> >> > https://gist.github.com/2930414
>> >> > These are for 0.2.4, I hope it's still applicable for trunk.
>> >> >
>> >> > Regards,
>> >> > Stefan Ackermann
>> >> >
>> >> > [1]
>> >> >
>> >> >
>> >> > https://github.com/Stivo/Distributed/blob/master/crunch/src/main/scala/ch/epfl/distributed/utils/CrunchUtils.scala#L83
>> >> >
>> >> > Am Donnerstag, 14. Juni 2012 12:24:34 UTC+2 schrieb [email protected]:
>> >> >>
>> >> >> Hi,
>> >> >>
>> >> >> I am writing a Distributed Collections DSL with compiler
>> >> >> optimizations
>> >> >> as
>> >> >> my master thesis. I have added Crunch as a backend, since we were
>> >> >> not
>> >> >> happy
>> >> >> with the performance we were getting from our other backend for
>> >> >> Hadoop.
>> >> >> And
>> >> >> I must say, I am quite impressed by Crunch's performance.
>> >> >> Here is some sample code our DSL generates, in case you are
>> >> >> interested:
>> >> >>
>> >> >>
>> >> >>
>> >> >> https://github.com/Stivo/Distributed/blob/bigdata2012/crunch/src/main/scala/generated/v4/
>> >> >>
>> >> >> We are impressed with the performance, except for the joins. In
>> >> >> local
>> >> >> benchmarks, joins performed 100x worse than my own join
>> >> >> implementation
>> >> >> (no
>> >> >> joke, 100x). Also it seems like the current implementation has some
>> >> >> bugs.
>> >> >> You are setting the partitioner class to the comparator in
>> >> >>
>> >> >>
>> >> >>
>> >> >> https://github.com/cloudera/crunch/blob/master/src/main/java/com/cloudera/crunch/lib/Join.java#L144
>> >> >> . Also you are not setting the partitioner class. Seems to me the
>> >> >> code
>> >> >> is
>> >> >> all there, it's just not linked up correctly.
>> >> >> For the performance, maybe you should have different versions of the
>> >> >> comparator / partitioner for different key types instead of writing
>> >> >> the
>> >> >> key
>> >> >> just to compare it. String comparison for example only checks
>> >> >> characters
>> >> >> until it finds a difference.
>> >> >>
>> >> >> I encountered lots of problems with SourceTargetHelper in 0.2.4. I
>> >> >> know
>> >> >> you changed it since then, but I think I also had troubles with the
>> >> >> new
>> >> >> version. Does it support using glob patterns or directories as
>> >> >> input?
>> >> >> In any
>> >> >> case, it should not prevent the program from running imho. I spent
>> >> >> quite a
>> >> >> while just trying to work around that bug.
>> >> >>
>> >> >> Regards,
>> >> >> Stefan Ackermann
>> >> >
>> >> >
>> >>
>> >>
>> >> --
>> >> Director of Data Science
>> >> Cloudera
>> >> Twitter: @josh_wills
>>
>>
>>
>> --
>> Director of Data Science
>> Cloudera
>> Twitter: @josh_wills



-- 
Director of Data Science
Cloudera
Twitter: @josh_wills

Re: Crunch Experiences

Reply via email to