[
https://issues.apache.org/jira/browse/MAHOUT-537?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Shannon Quinn updated MAHOUT-537:
---------------------------------
Attachment: MAHOUT-537.patch
Attached is the patch without the custom Writable I wrote, instead using
NamedVector.
It seems (to me) that there are two options for eliminating the two extra M/R
tasks I had to create in lieu of the CompositeInputFormat's joins:
1) Have each row of a DistributedRowMatrix labeled when it is first created.
Since DRM isn't much more than a glorified wrapper, its constructor can't
implement something like this, so this would be infeasible from a scope
perspective.
2) Guarantee the ordering of two given rows in the Iterable object of a
Combiner/Reducer, so we know one of them belongs to the multiplicand, the other
to the multiplier.
Option #2 seems most technically feasible, however my limited understanding of
the inner workings of Hadoop prevents me from knowing where to start. I've
taken a look at Partitioner, RecordReader, and various InputFormats and they
haven't given me any intuition. Any thoughts on how to do this? Or another
method entirely?
> Bring DistributedRowMatrix into compliance with Hadoop 0.20.2
> -------------------------------------------------------------
>
> Key: MAHOUT-537
> URL: https://issues.apache.org/jira/browse/MAHOUT-537
> Project: Mahout
> Issue Type: Improvement
> Affects Versions: 0.4
> Reporter: Shannon Quinn
> Assignee: Shannon Quinn
> Attachments: MAHOUT-537.patch, MAHOUT-537.patch, MAHOUT-537.patch,
> MAHOUT-537.patch
>
>
> Convert the current DistributedRowMatrix to use the newer Hadoop 0.20.2 API,
> in particular eliminate dependence on the deprecated JobConf, using instead
> the separate Job and Configuration objects.
--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.