[ https://issues.apache.org/jira/browse/MAHOUT-1615?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14148797#comment-14148797 ]
ASF GitHub Bot commented on MAHOUT-1615: ---------------------------------------- Github user dlyubimov commented on the pull request: https://github.com/apache/mahout/pull/52#issuecomment-56920785 I think the more standard types we support the better. Original Mahout DRM concept did not assume any sort of limitations on key type. It could be a tree, for all we know. The limitation stems from trying to build higher-level subroutines like drmFromHDFS for the algebraic optimizer, where user is not assumed to supply writeable to value converter on its own; at least, not by default. Maybe we just need an additional version of drmFromHDFS where such conversion function could be supplied explicitly by the user in order to handle compound annotation key types such as trees or collections. The reason why int, long and String were chosen as necessary minimum conversion set of types is mostly because there are known algorithms that rely on these types. String is used in seq2sparse output, Int has a special meaning for transposition, and Long... Long happens from time to time in legacy Mahout code too. The idea with implicit converters was to support anything that is already defined in implicit scope, without match. Think about it. Implicit stuff is just a special case of a scope. Like a package. as such, it might be possible to enumerate those conversions at runtime. Unfortunately I haven't found a scala construct that would allow to do that and apply found conversion elegantly. so this point is probably just a distraction. Let's forget about it and perhaps assume a match for a set of basic types, as defined in Spark. We should require hard stop if we don't know the converter (and user hasn't supplied one). > SparkEngine drmFromHDFS returning the same Key for all Key,Vec Pairs for > Text-Keyed SequenceFiles > ------------------------------------------------------------------------------------------------- > > Key: MAHOUT-1615 > URL: https://issues.apache.org/jira/browse/MAHOUT-1615 > Project: Mahout > Issue Type: Bug > Reporter: Andrew Palumbo > Fix For: 1.0 > > > When reading in seq2sparse output from HDFS in the spark-shell of form > <Text,VectorWriteable> SparkEngine's drmFromHDFS method is creating rdds > with the same Key for all Pairs: > {code} > mahout> val drmTFIDF= drmFromHDFS( path = > "/tmp/mahout-work-andy/20news-test-vectors/part-r-00000") > {code} > Has keys: > {...} > key: /talk.religion.misc/84570 > key: /talk.religion.misc/84570 > key: /talk.religion.misc/84570 > {...} > for the entire set. This is the last Key in the set. > The problem can be traced to the first line of drmFromHDFS(...) in > SparkEngine.scala: > {code} > val rdd = sc.sequenceFile(path, classOf[Writable], classOf[VectorWritable], > minPartitions = parMin) > // Get rid of VectorWritable > .map(t => (t._1, t._2.get())) > {code} > which gives the same key for all t._1. > -- This message was sent by Atlassian JIRA (v6.3.4#6332)