[jira] [Commented] (MAHOUT-1615) SparkEngine drmFromHDFS returning the same Key for all Key,Vec Pairs for Text-Keyed SequenceFiles

ASF GitHub Bot (JIRA) Mon, 22 Sep 2014 09:38:19 -0700

    [ 
https://issues.apache.org/jira/browse/MAHOUT-1615?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14143378#comment-14143378
 ]


ASF GitHub Bot commented on MAHOUT-1615:
----------------------------------------

Github user andrewpalumbo commented on the pull request:

    https://github.com/apache/mahout/pull/52#issuecomment-56401176
  
    Still a work in progress, (and still in need of some cleanup). The latest 
commits now solve the original key object reuse problem by method (2) - reading 
key type in from the SequenceFile  Headers and then matching on it:
    
        mahout> val drmTFIDF= drmFromHDFS( path = 
"/tmp/mahout-work-andy/20news-test-vectors/part-r-00000")
        14/09/22 11:20:33 WARN NativeCodeLoader: Unable to load native-hadoop 
library for your platform... using builtin-java classes where applicable
        drmTFIDF: org.apache.mahout.math.drm.CheckpointedDrm[_] = 
org.apache.mahout.sparkbindings.drm.CheckpointedDrmSpark@adf7236 
        mahout> val rowLabels=drmTFIDF.getRowLabelBindings
        rowLabels: java.util.Map[String,Integer] = 
{/soc.religion.christian/21427=6141, /comp.graphics/38427=422, 
/comp.sys.ibm.pc.hardware/60526=1281, /misc.forsale/76295=2495, 
/soc.religion.christian/21332=6103, /sci.med/59045=5265, 
/sci.electronics/54343=5096, /comp.sys.ibm.pc.hardware/60928=1404, 
/rec.sport.hockey/54173=4205, /rec.motorcycles/104596=3282, 
/rec.autos/103326=2968, /talk.politics.misc/179110=7333, 
/comp.windows.x/66966=1944, /rec.autos/103707=3053, /comp.windows.x/67474=2146, 
/rec.sport.baseball/105011=3850, /talk.religion.misc/83812=7424, 
/comp.graphics/38707=522, /comp.graphics/38597=484, 
/sci.electronics/54317=5083, /rec.motorcycles/104708=3322, 
/rec.sport.hockey/53627=3994, /comp.sys.mac.hardware/51633=1601, 
/sci.crypt/16088=4686, /sci.electronics/53714=4840, /rec.sport.ho...
        mahout> rowLabels.size
        res15: Int = 7598
    
    
    Which is what I am expecting.  
    
    Two problems that I am still having:
    
    (1) Its not yet solving the problem of setting the DrmLike[_] ClassTag yet.
    
    
        mahout> def getKeyClassTag[K: ClassTag](drm:DrmLike[K]) = 
implicitly[ClassTag[K]]
    
        mahout> getKeyClassTag(drmTFIDF)
        res13: scala.reflect.ClassTag[_] = Object 
    
     I believe that this is just because I'm not setting it correctly due to my 
limited scala abilities.
    
    (2) DRM DFS i/o (local) is failing.  I believe that this may  downside to 
integrating HDFS I/O code into the spark module. I'm not positive I'm setting 
the configuration correctly inside of drmFromHDFS(...).  I have no problem 
reading in the files from within the spark-shell, but the spark `DRM DFS i/o 
(local)` test is failing with:
    
        DRM DFS i/o (local) *** FAILED ***
        java.io.FileNotFoundException: 
/home/andy/sandbox/mahout/spark/tmp/UploadedDRM (Is a directory)
    
    
    I believe may be because SequenceFile.readHeader(...) is trying to read 
from HDFS and the test is writing locally.  I will continue to look into this.  
 



> SparkEngine drmFromHDFS returning the same Key for all Key,Vec Pairs for 
> Text-Keyed SequenceFiles
> -------------------------------------------------------------------------------------------------
>
>                 Key: MAHOUT-1615
>                 URL: https://issues.apache.org/jira/browse/MAHOUT-1615
>             Project: Mahout
>          Issue Type: Bug
>            Reporter: Andrew Palumbo
>             Fix For: 1.0
>
>
> When reading in seq2sparse output from HDFS in the spark-shell of form 
> <Text,VectorWriteable>  SparkEngine's drmFromHDFS method is creating rdds 
> with the same Key for all Pairs:  
> {code}
> mahout> val drmTFIDF= drmFromHDFS( path = 
> "/tmp/mahout-work-andy/20news-test-vectors/part-r-00000")
> {code}
> Has keys:
> {...} 
>     key: /talk.religion.misc/84570
>     key: /talk.religion.misc/84570
>     key: /talk.religion.misc/84570
> {...}
> for the entire set.  This is the last Key in the set.
> The problem can be traced to the first line of drmFromHDFS(...) in 
> SparkEngine.scala: 
> {code}
>  val rdd = sc.sequenceFile(path, classOf[Writable], classOf[VectorWritable], 
> minPartitions = parMin)
>         // Get rid of VectorWritable
>         .map(t => (t._1, t._2.get()))
> {code}
> which gives the same key for all t._1.
>   



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (MAHOUT-1615) SparkEngine drmFromHDFS returning the same Key for all Key,Vec Pairs for Text-Keyed SequenceFiles

Reply via email to