[ https://issues.apache.org/jira/browse/MAPREDUCE-2001?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13269420#comment-13269420 ]
Harsh J commented on MAPREDUCE-2001: ------------------------------------ bq. The users would then in their mapper or reducer configure or setup method call SequenceFileOutputFormat.setMetadata with the appropiate metadata object that they would create. The problem exposed by this approach hits upon a possible inconsistency/bug in the framework: | Record Writer Instantiation | Old API | New API | | Map Task | Before Mapper | After Mapper | | Reduce Task | After Reducer | After Reducer | See MapTask.java/ReduceTask.java in 1.x for instance, methods run{Old/New}{Mapper/Reducer}. This has been so now for a very long time, and I do think changing this may break behavior of several users out there, including some of the code I've written at my former workplace. Though yeah, its highly strange no spec doc exists for this, we ought to have one via another JIRA. Hence the mapper.configure() approach with a static method would unfortunately fail on the old API runs, for map-only jobs. bq. Then we make the SequenceFileOutputFormat JobConfigurable so that ReflectionUtils.newInstance will call configure on it and load the metadata. I imagine this working in a much better way. For new API users, they may still be able to sneak in changes per map/reduce task, and otherwise (on Old API) rely on driver to provide these up. bq. I think we should avoid users having to subclass SequenceFileOutputFormat. Thoughts? Agreed, given your new approach via jobconf. Lets also make sure we serialize with base64 encoding or so, to allow for special chars in metadata if users so wish it (cause job.xml dislikes special chars). > Enhancement to SequenceFileOutputFormat to allow user to set MetaData > --------------------------------------------------------------------- > > Key: MAPREDUCE-2001 > URL: https://issues.apache.org/jira/browse/MAPREDUCE-2001 > Project: Hadoop Map/Reduce > Issue Type: Improvement > Affects Versions: 0.20.2 > Reporter: David Rosenstrauch > Priority: Minor > Attachments: MAPREDUCE-2001.patch > > > The org.apache.hadoop.mapreduce.lib.output.SequenceFileOutputFormat class > currently does not provide a way for the user to pass in a MetaData object to > be written to the SequenceFile. > Currently he only way for a developer to implement this functionality appears > to be to create a subclass which overrides the SequenceFileOutputFormat's > getRecordWriter() method, which is a bit of a kludge. > This seems to be a common enough request to warrant a fix of some sort. > (It's already been brought up twice in the past year: > http://www.mail-archive.com/common-user@hadoop.apache.org/msg02198.html and > http://www.mail-archive.com/mapreduce-user@hadoop.apache.org/msg00904.html) > A couple of possible solutions: > 1) provide a static method SequenceFileOutputFormat.setMetaData(Job, MetaData) > 2) Provide a (non-static) setMetaData() method on the > SequenceFileOutputFormat class. The user would create a subclass of > SequenceFileOutputFormat which, say, implements Configurable. Then in the > setConf() method, the user could create the MetaData object (using data from > the Configuration), and then call setMetaData. The SequenceFileOutputFormat > would then use this MetaData object when creating the SequenceFile. (Note > that the user would have to create a subclass of SequenceFileOutputFormat to > make this solution work.) -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira