[ 
https://issues.apache.org/jira/browse/MAHOUT-1615?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14148797#comment-14148797
 ] 

ASF GitHub Bot commented on MAHOUT-1615:
----------------------------------------

Github user dlyubimov commented on the pull request:

    https://github.com/apache/mahout/pull/52#issuecomment-56920785
  
    I think the more standard types we support the better. Original Mahout DRM 
concept did not assume any sort of limitations on key type. It could be a tree, 
for all we know. The limitation stems from trying to build higher-level 
subroutines like drmFromHDFS for the algebraic optimizer, where user is not 
assumed to supply writeable to value converter on its own; at least, not by 
default. Maybe we just need an additional version of drmFromHDFS where such 
conversion function could be supplied explicitly by the user in order to handle 
compound annotation key types such as trees or  collections.
    
    The reason why int, long and String were chosen as necessary minimum 
conversion set of types  is mostly because there are known algorithms that rely 
on these types. String is used in seq2sparse output, Int has a special meaning 
for transposition, and Long... Long happens from time to time in legacy Mahout 
code too.
    
     The idea with implicit converters was to support anything that is already 
defined in implicit scope, without match. Think about it. Implicit stuff is 
just a special case of a scope. Like a package. as such, it might be possible 
to enumerate those conversions at runtime. Unfortunately I haven't found a 
scala construct that would allow to do that and apply found conversion 
elegantly. so this point is probably just a distraction. Let's forget about it 
and perhaps assume a match for a set of basic types, as defined in Spark. We 
should require hard stop if we don't know the converter (and user hasn't 
supplied one).
    



> SparkEngine drmFromHDFS returning the same Key for all Key,Vec Pairs for 
> Text-Keyed SequenceFiles
> -------------------------------------------------------------------------------------------------
>
>                 Key: MAHOUT-1615
>                 URL: https://issues.apache.org/jira/browse/MAHOUT-1615
>             Project: Mahout
>          Issue Type: Bug
>            Reporter: Andrew Palumbo
>             Fix For: 1.0
>
>
> When reading in seq2sparse output from HDFS in the spark-shell of form 
> <Text,VectorWriteable>  SparkEngine's drmFromHDFS method is creating rdds 
> with the same Key for all Pairs:  
> {code}
> mahout> val drmTFIDF= drmFromHDFS( path = 
> "/tmp/mahout-work-andy/20news-test-vectors/part-r-00000")
> {code}
> Has keys:
> {...} 
>     key: /talk.religion.misc/84570
>     key: /talk.religion.misc/84570
>     key: /talk.religion.misc/84570
> {...}
> for the entire set.  This is the last Key in the set.
> The problem can be traced to the first line of drmFromHDFS(...) in 
> SparkEngine.scala: 
> {code}
>  val rdd = sc.sequenceFile(path, classOf[Writable], classOf[VectorWritable], 
> minPartitions = parMin)
>         // Get rid of VectorWritable
>         .map(t => (t._1, t._2.get()))
> {code}
> which gives the same key for all t._1.
>   



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Reply via email to