Re: [Jchat] hadoop map/reduce from a J user's perspective

Jon Hough Fri, 26 Dec 2014 07:59:33 -0800

Happy holidays!

I'm just a little curious. Are you using Hadoop with J?


> From: [email protected]
> Date: Fri, 26 Dec 2014 05:41:10 -0500
> To: [email protected]
> Subject: [Jchat] hadoop map/reduce from a J user's perspective
> 
> Hadoop's use of the "map/reduce" term would suggest, to the J
> programmer something like:
> 
>   v/ u"0
> 
> Or, maybe
> 
>   v/ u"1
> 
> But that's actually not a very good representation of what it's doing.
> 
> This is closer to what it's doing:
>    v;.2 ;/:~<;.2 u;.2
> 
> (Oh, and early releases hadoop did not have the "reduce" step, they
> were just a mechanism to farm jobs out across a bunch of different
> machines.)
> 
> More specifically, it's set up to handle a unix pipeline (reading from
> stdin, writing to stdout, implicitly operating on lines terminated by
> ascii character 10), farming the job out across multiple machines for
> the "map" operation, then sorting the result and farming that out
> again across multiple machines for the "reduce" operation.
> 
> Also, "u" and "v" are expected to be java classes (and you can specify
> which *.jar file contains those classes), though there are prebuilt
> implementations which let you run any arbitrary unix command (for
> example hadoop-streaming.jar).
> 
> (Of course, there are a lot of ornate details, most of which usually
> get ignored (having to do with tuning the environment and/or catering
> to somewhat arbitrary constraints of somewhat arbitrary programs).
> Also, there are numerous versions and flavors, all slightly
> incompatible.)
> 
> But what if the data you are processing is not "lines" but something
> else? Files? Directories extracted from archives? Etc...?
> 
> Well, for that case you should probably set things up so that a "line"
> is a descriptor of work to be done, and have the code being run
> download/process/upload the real work while stdout becomes lines
> briefly summarizing the work which was done. (This assumes you only
> are "mapping".) Alternatively, if you are "reducing" then lines output
> from the mapper would describe the next phase of work to be done (and
> this assumes that sorting is a useful mechanism, in this context).
> Though there are also constraints on resources which you have to not
> run afoul of (too much time, too much memory, or too much disk and a
> process can get shut down - and there's some variation on what those
> limits are - but they are relatively small by default - maybe a
> gigabyte of memory and 10 minutes of processing time, for example).
> 
> (On the flip side, once you are running code in a hadoop cluster, you
> pretty much have unrestricted access to the machine -- as long as you
> stay within your resource constraints - the machine instance gets
> discarded when you are done with it.)
> 
> Finally, speed... when running a hadoop job, it's several seconds to
> set up a small cluster, minutes to set up a larger cluster, I've not
> pushed the boundaries enough to know what a really large job looks
> like, and the good documentation seems to be mostly for early versions
> of hadoop (capable of handling only 4000 machines for a cluster) but I
> have noticed that some features documented for early versions are no
> longer present [or, at least, are syntactically different] in more
> recent versions.
> 
> But I guess that's pretty normal for all software projects. Makes it a
> pain to use, though, when old documentation just plain doesn't work
> and you can't find the relevant newer documentation. (And, sadly,
> search engines seem to be getting worse and worse at locating relevant
> documentation, as opposed to auto-generated spam .. though perhaps
> that also has to do with my idea of relevance?)
> 
> Anyways, enough musing for now, time to get back to debugging...
> 
> Thanks,
> 
> -- 
> Raul
> ----------------------------------------------------------------------
> For information about J forums see http://www.jsoftware.com/forums.htm
                                          
----------------------------------------------------------------------
For information about J forums see http://www.jsoftware.com/forums.htm

Re: [Jchat] hadoop map/reduce from a J user's perspective

Reply via email to