On Saturday 26 July 2008 00:53:48 Joydeep Sen Sarma wrote:
> Just as an aside - there is probably a general perception that streaming
> is really slow (at least I had it).
>
> The last I did some profiling (in 0.15) - the primary overheads from
> streaming came from the scripting language (python is ssslllloooow). For

Beg pardon, it's a question of good design. And yes, in some cases, a 
microscopic tiny amounts C code (in our case Pyrex code to parse log lines).

E.g. our driver script that parses the loglines is 4 times slower than cat, 
which IMHO, considering the amount of work it does to parse a custom CLF 
logfile is a good value.

C/C++ is not that cool, nor that fast when you need to deal with arbitrary 
long lines. Which might be not the case for CLF files, but you wouldn't 
believe what kind of garbage I've seen ;)

In practice, especially on the longline processing job, Python beats Hadoop's 
Java code easily into submission. Not that I'm happy about that because it 
makes some stuff for me quite more complicated.

> an insanely fast script (bin/cat), I saw significant overheads in java
> function/data path that drowned out streaming overheads by huge margin
> (lot of those overheads have been fixed in recent versions - thanks to
> the hadoop team).
>
> Writing a c/c++ streaming program is pretty good way of getting good
> performance (and some performance sensitive apps in our environment
> ended up doing just that).

Well, in our environment, the architecture issues, e.g. how to access data 
from the database that is needed to process the data, but which is to huge to 
just copy to each node or include somehow in the input, are of way higher 
relevance. Refactoring the application code to cover more of the latency for 
accessing that data can easily speed up the application by a magnitude.

Andreas

Attachment: signature.asc
Description: This is a digitally signed message part.

Reply via email to