On Saturday 26 July 2008 00:53:48 Joydeep Sen Sarma wrote: > Just as an aside - there is probably a general perception that streaming > is really slow (at least I had it). > > The last I did some profiling (in 0.15) - the primary overheads from > streaming came from the scripting language (python is ssslllloooow). For
Beg pardon, it's a question of good design. And yes, in some cases, a microscopic tiny amounts C code (in our case Pyrex code to parse log lines). E.g. our driver script that parses the loglines is 4 times slower than cat, which IMHO, considering the amount of work it does to parse a custom CLF logfile is a good value. C/C++ is not that cool, nor that fast when you need to deal with arbitrary long lines. Which might be not the case for CLF files, but you wouldn't believe what kind of garbage I've seen ;) In practice, especially on the longline processing job, Python beats Hadoop's Java code easily into submission. Not that I'm happy about that because it makes some stuff for me quite more complicated. > an insanely fast script (bin/cat), I saw significant overheads in java > function/data path that drowned out streaming overheads by huge margin > (lot of those overheads have been fixed in recent versions - thanks to > the hadoop team). > > Writing a c/c++ streaming program is pretty good way of getting good > performance (and some performance sensitive apps in our environment > ended up doing just that). Well, in our environment, the architecture issues, e.g. how to access data from the database that is needed to process the data, but which is to huge to just copy to each node or include somehow in the input, are of way higher relevance. Refactoring the application code to cover more of the latency for accessing that data can easily speed up the application by a magnitude. Andreas
signature.asc
Description: This is a digitally signed message part.
