Yes, but tuplewritable is pretty inefficient since it stores the classname with 
every record.  This seems wasteful given that the class is always the same.

On 18 Dec, 2011, at 9:19 PM, Lance Norskog wrote:

> JIRA is acting up, so posting here instead.
> 
> You have already made RandomPermuteJob extend AbstractJob. Never mind.
> 
> bq. Does this seem like a reasonable approach? It would require that a
> class be created for each object type of interest which is somewhat
> painfull. However I can't see a simpler approach since
> setMapOutputValueClass() needs to take a class that has a default
> constructor (and PairWritable doesn't have a default constructor since
> it doesn't know how to call new for first and second since it doesn't
> know what class first and second belong to).
> 
> TupleWritable handles this by writing the classname. Looking at this
> again, can't this just use TupleWritable?
> 
> http://grepcode.com/file/repo1.maven.org/maven2/org.jvnet.hudson.hadoop/hadoop-core/0.19.1-hudson-3/org/apache/hadoop/mapred/join/TupleWritable.java
> 
> On Sun, Dec 18, 2011 at 7:48 PM, Raphael Cendrillon (Commented) (JIRA)
> <j...@apache.org> wrote:
>> 
>>    [ 
>> https://issues.apache.org/jira/browse/MAHOUT-904?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13172021#comment-13172021
>>  ]
>> 
>> Raphael Cendrillon commented on MAHOUT-904:
>> -------------------------------------------
>> 
>> Hi Lance. Is that a general comment, or specifically for the issue regarding 
>> PairWritable/IntVectorWritable?
>> 
>>> SplitInput should support randomizing the input
>>> -----------------------------------------------
>>> 
>>>                 Key: MAHOUT-904
>>>                 URL: https://issues.apache.org/jira/browse/MAHOUT-904
>>>             Project: Mahout
>>>          Issue Type: Improvement
>>>            Reporter: Grant Ingersoll
>>>            Assignee: Raphael Cendrillon
>>>              Labels: MAHOUT_INTRO_CONTRIBUTE
>>>         Attachments: MAHOUT-904.patch, MAHOUT-904.patch, MAHOUT-904.patch
>>> 
>>> 
>>> For some learning tasks, we need the input to be randomized (SGD) instead 
>>> of blocks of labels all at once.  SplitInput is a useful tool for setting 
>>> up train/test files but it currently doesn't support randomizing the input.
>> 
>> --
>> This message is automatically generated by JIRA.
>> If you think it was sent incorrectly, please contact your JIRA 
>> administrators: 
>> https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
>> For more information on JIRA, see: http://www.atlassian.com/software/jira
>> 
>> 
> 
> 
> 
> -- 
> Lance Norskog
> goks...@gmail.com

Reply via email to