Yes, but tuplewritable is pretty inefficient since it stores the classname with every record. This seems wasteful given that the class is always the same.
On 18 Dec, 2011, at 9:19 PM, Lance Norskog wrote: > JIRA is acting up, so posting here instead. > > You have already made RandomPermuteJob extend AbstractJob. Never mind. > > bq. Does this seem like a reasonable approach? It would require that a > class be created for each object type of interest which is somewhat > painfull. However I can't see a simpler approach since > setMapOutputValueClass() needs to take a class that has a default > constructor (and PairWritable doesn't have a default constructor since > it doesn't know how to call new for first and second since it doesn't > know what class first and second belong to). > > TupleWritable handles this by writing the classname. Looking at this > again, can't this just use TupleWritable? > > http://grepcode.com/file/repo1.maven.org/maven2/org.jvnet.hudson.hadoop/hadoop-core/0.19.1-hudson-3/org/apache/hadoop/mapred/join/TupleWritable.java > > On Sun, Dec 18, 2011 at 7:48 PM, Raphael Cendrillon (Commented) (JIRA) > <j...@apache.org> wrote: >> >> [ >> https://issues.apache.org/jira/browse/MAHOUT-904?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13172021#comment-13172021 >> ] >> >> Raphael Cendrillon commented on MAHOUT-904: >> ------------------------------------------- >> >> Hi Lance. Is that a general comment, or specifically for the issue regarding >> PairWritable/IntVectorWritable? >> >>> SplitInput should support randomizing the input >>> ----------------------------------------------- >>> >>> Key: MAHOUT-904 >>> URL: https://issues.apache.org/jira/browse/MAHOUT-904 >>> Project: Mahout >>> Issue Type: Improvement >>> Reporter: Grant Ingersoll >>> Assignee: Raphael Cendrillon >>> Labels: MAHOUT_INTRO_CONTRIBUTE >>> Attachments: MAHOUT-904.patch, MAHOUT-904.patch, MAHOUT-904.patch >>> >>> >>> For some learning tasks, we need the input to be randomized (SGD) instead >>> of blocks of labels all at once. SplitInput is a useful tool for setting >>> up train/test files but it currently doesn't support randomizing the input. >> >> -- >> This message is automatically generated by JIRA. >> If you think it was sent incorrectly, please contact your JIRA >> administrators: >> https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa >> For more information on JIRA, see: http://www.atlassian.com/software/jira >> >> > > > > -- > Lance Norskog > goks...@gmail.com