[ https://issues.apache.org/jira/browse/MAHOUT-1615?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14143378#comment-14143378 ]
ASF GitHub Bot commented on MAHOUT-1615: ---------------------------------------- Github user andrewpalumbo commented on the pull request: https://github.com/apache/mahout/pull/52#issuecomment-56401176 Still a work in progress, (and still in need of some cleanup). The latest commits now solve the original key object reuse problem by method (2) - reading key type in from the SequenceFile Headers and then matching on it: mahout> val drmTFIDF= drmFromHDFS( path = "/tmp/mahout-work-andy/20news-test-vectors/part-r-00000") 14/09/22 11:20:33 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable drmTFIDF: org.apache.mahout.math.drm.CheckpointedDrm[_] = org.apache.mahout.sparkbindings.drm.CheckpointedDrmSpark@adf7236 mahout> val rowLabels=drmTFIDF.getRowLabelBindings rowLabels: java.util.Map[String,Integer] = {/soc.religion.christian/21427=6141, /comp.graphics/38427=422, /comp.sys.ibm.pc.hardware/60526=1281, /misc.forsale/76295=2495, /soc.religion.christian/21332=6103, /sci.med/59045=5265, /sci.electronics/54343=5096, /comp.sys.ibm.pc.hardware/60928=1404, /rec.sport.hockey/54173=4205, /rec.motorcycles/104596=3282, /rec.autos/103326=2968, /talk.politics.misc/179110=7333, /comp.windows.x/66966=1944, /rec.autos/103707=3053, /comp.windows.x/67474=2146, /rec.sport.baseball/105011=3850, /talk.religion.misc/83812=7424, /comp.graphics/38707=522, /comp.graphics/38597=484, /sci.electronics/54317=5083, /rec.motorcycles/104708=3322, /rec.sport.hockey/53627=3994, /comp.sys.mac.hardware/51633=1601, /sci.crypt/16088=4686, /sci.electronics/53714=4840, /rec.sport.ho... mahout> rowLabels.size res15: Int = 7598 Which is what I am expecting. Two problems that I am still having: (1) Its not yet solving the problem of setting the DrmLike[_] ClassTag yet. mahout> def getKeyClassTag[K: ClassTag](drm:DrmLike[K]) = implicitly[ClassTag[K]] mahout> getKeyClassTag(drmTFIDF) res13: scala.reflect.ClassTag[_] = Object I believe that this is just because I'm not setting it correctly due to my limited scala abilities. (2) DRM DFS i/o (local) is failing. I believe that this may downside to integrating HDFS I/O code into the spark module. I'm not positive I'm setting the configuration correctly inside of drmFromHDFS(...). I have no problem reading in the files from within the spark-shell, but the spark `DRM DFS i/o (local)` test is failing with: DRM DFS i/o (local) *** FAILED *** java.io.FileNotFoundException: /home/andy/sandbox/mahout/spark/tmp/UploadedDRM (Is a directory) I believe may be because SequenceFile.readHeader(...) is trying to read from HDFS and the test is writing locally. I will continue to look into this. > SparkEngine drmFromHDFS returning the same Key for all Key,Vec Pairs for > Text-Keyed SequenceFiles > ------------------------------------------------------------------------------------------------- > > Key: MAHOUT-1615 > URL: https://issues.apache.org/jira/browse/MAHOUT-1615 > Project: Mahout > Issue Type: Bug > Reporter: Andrew Palumbo > Fix For: 1.0 > > > When reading in seq2sparse output from HDFS in the spark-shell of form > <Text,VectorWriteable> SparkEngine's drmFromHDFS method is creating rdds > with the same Key for all Pairs: > {code} > mahout> val drmTFIDF= drmFromHDFS( path = > "/tmp/mahout-work-andy/20news-test-vectors/part-r-00000") > {code} > Has keys: > {...} > key: /talk.religion.misc/84570 > key: /talk.religion.misc/84570 > key: /talk.religion.misc/84570 > {...} > for the entire set. This is the last Key in the set. > The problem can be traced to the first line of drmFromHDFS(...) in > SparkEngine.scala: > {code} > val rdd = sc.sequenceFile(path, classOf[Writable], classOf[VectorWritable], > minPartitions = parMin) > // Get rid of VectorWritable > .map(t => (t._1, t._2.get())) > {code} > which gives the same key for all t._1. > -- This message was sent by Atlassian JIRA (v6.3.4#6332)