[
https://issues.apache.org/jira/browse/MAHOUT-1615?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14166991#comment-14166991
]
ASF GitHub Bot commented on MAHOUT-1615:
----------------------------------------
Github user dlyubimov commented on the pull request:
https://github.com/apache/mahout/pull/58#issuecomment-58672862
i already mentioned that i don't want any whiff of hadoop stuff in
math-scala. For most part, because of impossibility to pinpoint exact hadoop
api version a third party wants to use. There will always be applications
claiming incompatbility with this or that in Hadoop with that approach.
it might make sense to create another module for "all things Hadoop", and
make engine specific modules depend on that, but i am not sure if amount of
current code really justifies it yet. mrLegacy is kind of that, but it is
legacy. maybe it makes sense to move some things from legacy to that "all
things hadoop" module (e.g. sequence file iterators and such), although that is
not used right now anywhere beyond mrlegacy.
> SparkEngine drmFromHDFS returning the same Key for all Key,Vec Pairs for
> Text-Keyed SequenceFiles
> -------------------------------------------------------------------------------------------------
>
> Key: MAHOUT-1615
> URL: https://issues.apache.org/jira/browse/MAHOUT-1615
> Project: Mahout
> Issue Type: Bug
> Reporter: Andrew Palumbo
> Assignee: Andrew Palumbo
> Fix For: 1.0
>
>
> When reading in seq2sparse output from HDFS in the spark-shell of form
> <Text,VectorWriteable> SparkEngine's drmFromHDFS method is creating rdds
> with the same Key for all Pairs:
> {code}
> mahout> val drmTFIDF= drmFromHDFS( path =
> "/tmp/mahout-work-andy/20news-test-vectors/part-r-00000")
> {code}
> Has keys:
> {...}
> key: /talk.religion.misc/84570
> key: /talk.religion.misc/84570
> key: /talk.religion.misc/84570
> {...}
> for the entire set. This is the last Key in the set.
> The problem can be traced to the first line of drmFromHDFS(...) in
> SparkEngine.scala:
> {code}
> val rdd = sc.sequenceFile(path, classOf[Writable], classOf[VectorWritable],
> minPartitions = parMin)
> // Get rid of VectorWritable
> .map(t => (t._1, t._2.get()))
> {code}
> which gives the same key for all t._1.
>
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)