Re: Possible contributions

Shannon Quinn Sat, 28 May 2011 15:54:21 -0700

As far as I understand, the problem isn't adding multiple inputs; youcan do it exactly as the documentation you linked shows. The problem(which is what we're trying to solve in MAHOUT-537) is how to tellwithin the Mapper/Reducer itself from which input path the current dataare taken; there's no way to say for sure where the row of data you'recurrently operating on came from, which is information that is essentialto, say, matrix-matrix multiplication.

However, if all you're wanting to do is allow your algorithm to takemultiple input files and treat them more or less as "one big file", thenI don't think this approach should give you any problems.


On 5/28/11 5:43 PM, Dhruv Kumar wrote:

Isabel and Dmitry,

Thank you for your input on this. I've noticed that Mahout's code uses the
new mapreduce package, so I have been following the new APIs. This was also
suggested by Sean w.r.t Mahout-294.

Multiple inputs is a requirement for my project and I was planning on using
the old mapred.lib.multipleinputs class which is not marked as deprecated in
0.20.2:


http://hadoop.apache.org/common/docs/r0.20.2/api/org/apache/hadoop/mapred/lib/MultipleInputs.html

Is this advisable and if not, what are my options to handle multiple inputs?

On Sat, May 28, 2011 at 5:59 PM, Dmitriy Lyubimov<[email protected]>  wrote:

Dhruv,

Just a warning, before you want to lock yourself to new apis:

Yes new APIs are preferrable but it is not always possible to use them
because 0.20.2 lacks _a lot_ in terms of bare necessities in new api
realm . (multiple inputs/ outputs come to mind at once).

I think i did weasel my way out of those in some cases but i did not
test it at scale yet, it is certainly not an official way to do it.

Either way it's probably not worth it for anything beyond sheer basic
MR functionality until we switch to something that actually does have
the 'new api' because 0.20.2 has some very much truncated version
which is very far from complete.

-d

On Fri, May 27, 2011 at 3:19 AM, Isabel Drost<[email protected]>  wrote:

On 18.05.2011 Dhruv Kumar wrote:

For the GSoC project which version of Hadoop's API should I follow?

Try to use the new M/R apis where possible - we had the same discussion

in an

earlier thread on spectral clustering, in addition Sean just opened an

issue

concerning Upgrading to newer Hadoop versions, you can take a look there

as

well.

Isabel

Re: Possible contributions

Reply via email to