As far as I understand, the problem isn't adding multiple inputs; you
can do it exactly as the documentation you linked shows. The problem
(which is what we're trying to solve in MAHOUT-537) is how to tell
within the Mapper/Reducer itself from which input path the current data
are taken; there's no way to say for sure where the row of data you're
currently operating on came from, which is information that is essential
to, say, matrix-matrix multiplication.
However, if all you're wanting to do is allow your algorithm to take
multiple input files and treat them more or less as "one big file", then
I don't think this approach should give you any problems.
On 5/28/11 5:43 PM, Dhruv Kumar wrote:
Isabel and Dmitry,
Thank you for your input on this. I've noticed that Mahout's code uses the
new mapreduce package, so I have been following the new APIs. This was also
suggested by Sean w.r.t Mahout-294.
Multiple inputs is a requirement for my project and I was planning on using
the old mapred.lib.multipleinputs class which is not marked as deprecated in
0.20.2:
http://hadoop.apache.org/common/docs/r0.20.2/api/org/apache/hadoop/mapred/lib/MultipleInputs.html
Is this advisable and if not, what are my options to handle multiple inputs?
On Sat, May 28, 2011 at 5:59 PM, Dmitriy Lyubimov<[email protected]> wrote:
Dhruv,
Just a warning, before you want to lock yourself to new apis:
Yes new APIs are preferrable but it is not always possible to use them
because 0.20.2 lacks _a lot_ in terms of bare necessities in new api
realm . (multiple inputs/ outputs come to mind at once).
I think i did weasel my way out of those in some cases but i did not
test it at scale yet, it is certainly not an official way to do it.
Either way it's probably not worth it for anything beyond sheer basic
MR functionality until we switch to something that actually does have
the 'new api' because 0.20.2 has some very much truncated version
which is very far from complete.
-d
On Fri, May 27, 2011 at 3:19 AM, Isabel Drost<[email protected]> wrote:
On 18.05.2011 Dhruv Kumar wrote:
For the GSoC project which version of Hadoop's API should I follow?
Try to use the new M/R apis where possible - we had the same discussion
in an
earlier thread on spectral clustering, in addition Sean just opened an
issue
concerning Upgrading to newer Hadoop versions, you can take a look there
as
well.
Isabel