Happy holidays! I'm just a little curious. Are you using Hadoop with J?
> From: [email protected] > Date: Fri, 26 Dec 2014 05:41:10 -0500 > To: [email protected] > Subject: [Jchat] hadoop map/reduce from a J user's perspective > > Hadoop's use of the "map/reduce" term would suggest, to the J > programmer something like: > > v/ u"0 > > Or, maybe > > v/ u"1 > > But that's actually not a very good representation of what it's doing. > > This is closer to what it's doing: > v;.2 ;/:~<;.2 u;.2 > > (Oh, and early releases hadoop did not have the "reduce" step, they > were just a mechanism to farm jobs out across a bunch of different > machines.) > > More specifically, it's set up to handle a unix pipeline (reading from > stdin, writing to stdout, implicitly operating on lines terminated by > ascii character 10), farming the job out across multiple machines for > the "map" operation, then sorting the result and farming that out > again across multiple machines for the "reduce" operation. > > Also, "u" and "v" are expected to be java classes (and you can specify > which *.jar file contains those classes), though there are prebuilt > implementations which let you run any arbitrary unix command (for > example hadoop-streaming.jar). > > (Of course, there are a lot of ornate details, most of which usually > get ignored (having to do with tuning the environment and/or catering > to somewhat arbitrary constraints of somewhat arbitrary programs). > Also, there are numerous versions and flavors, all slightly > incompatible.) > > But what if the data you are processing is not "lines" but something > else? Files? Directories extracted from archives? Etc...? > > Well, for that case you should probably set things up so that a "line" > is a descriptor of work to be done, and have the code being run > download/process/upload the real work while stdout becomes lines > briefly summarizing the work which was done. (This assumes you only > are "mapping".) Alternatively, if you are "reducing" then lines output > from the mapper would describe the next phase of work to be done (and > this assumes that sorting is a useful mechanism, in this context). > Though there are also constraints on resources which you have to not > run afoul of (too much time, too much memory, or too much disk and a > process can get shut down - and there's some variation on what those > limits are - but they are relatively small by default - maybe a > gigabyte of memory and 10 minutes of processing time, for example). > > (On the flip side, once you are running code in a hadoop cluster, you > pretty much have unrestricted access to the machine -- as long as you > stay within your resource constraints - the machine instance gets > discarded when you are done with it.) > > Finally, speed... when running a hadoop job, it's several seconds to > set up a small cluster, minutes to set up a larger cluster, I've not > pushed the boundaries enough to know what a really large job looks > like, and the good documentation seems to be mostly for early versions > of hadoop (capable of handling only 4000 machines for a cluster) but I > have noticed that some features documented for early versions are no > longer present [or, at least, are syntactically different] in more > recent versions. > > But I guess that's pretty normal for all software projects. Makes it a > pain to use, though, when old documentation just plain doesn't work > and you can't find the relevant newer documentation. (And, sadly, > search engines seem to be getting worse and worse at locating relevant > documentation, as opposed to auto-generated spam .. though perhaps > that also has to do with my idea of relevance?) > > Anyways, enough musing for now, time to get back to debugging... > > Thanks, > > -- > Raul > ---------------------------------------------------------------------- > For information about J forums see http://www.jsoftware.com/forums.htm ---------------------------------------------------------------------- For information about J forums see http://www.jsoftware.com/forums.htm
