Re: How to set SequenceFile.Metadata from within SequenceFileOutputFormat?
You may also propose to extend the existing SFOP to allow this on JIRA or the dev mailing list :) On Mon, Aug 9, 2010 at 8:09 PM, David Rosenstrauch dar...@darose.net wrote: On 08/07/2010 02:06 AM, Harsh J wrote: On Sat, Aug 7, 2010 at 11:20 AM, David Rosenstrauchdar...@darose.net wrote: I'm using a SequenceFileOutputFormat. But I'd like to be able to set some SequenceFile.Metadata on the SequenceFile.Writer that's getting created. Doesn't look like there's any easy way to do that, other than overriding the SequenceFileOutputFormat.getRecordWriter() method. Am I overlooking anything? Doesn't seem like you are on a wrong path, so don't worry and go ahead! P.s.: A reply by Tom White to a similar question: http://www.mail-archive.com/common-u...@hadoop.apache.org/msg02218.html Thanks for the link! Sounds like I'll have to sub-class, then. A bit of a kludge, IMO, but I can live with it. That's life on the bleeding edge! Thanks again, DR -- Harsh J www.harshj.com
Re: How to set SequenceFile.Metadata from within SequenceFileOutputFormat?
Not sure if this is something the devs would want to implement a change like this, but it couldn't hurt to at least file it and make them aware. Done: https://issues.apache.org/jira/browse/MAPREDUCE-2001 Thanks, DR On 08/09/2010 12:16 PM, Harsh J wrote: You may also propose to extend the existing SFOP to allow this on JIRA or the dev mailing list :) On Mon, Aug 9, 2010 at 8:09 PM, David Rosenstrauchdar...@darose.net wrote: On 08/07/2010 02:06 AM, Harsh J wrote: On Sat, Aug 7, 2010 at 11:20 AM, David Rosenstrauchdar...@darose.net wrote: I'm using a SequenceFileOutputFormat. But I'd like to be able to set some SequenceFile.Metadata on the SequenceFile.Writer that's getting created. Doesn't look like there's any easy way to do that, other than overriding the SequenceFileOutputFormat.getRecordWriter() method. Am I overlooking anything? Doesn't seem like you are on a wrong path, so don't worry and go ahead! P.s.: A reply by Tom White to a similar question: http://www.mail-archive.com/common-u...@hadoop.apache.org/msg02218.html Thanks for the link! Sounds like I'll have to sub-class, then. A bit of a kludge, IMO, but I can live with it. That's life on the bleeding edge! Thanks again, DR
Re: How to set SequenceFile.Metadata from within SequenceFileOutputFormat?
On 08/07/2010 02:06 AM, Harsh J wrote: On Sat, Aug 7, 2010 at 11:20 AM, David Rosenstrauchdar...@darose.net wrote: I'm using a SequenceFileOutputFormat. But I'd like to be able to set some SequenceFile.Metadata on the SequenceFile.Writer that's getting created. Doesn't look like there's any easy way to do that, other than overriding the SequenceFileOutputFormat.getRecordWriter() method. Am I overlooking anything? Doesn't seem like you are on a wrong path, so don't worry and go ahead! P.s.: A reply by Tom White to a similar question: http://www.mail-archive.com/common-u...@hadoop.apache.org/msg02218.html TIA, DR On a similar note, it looks like if I want to customize the name/path of the generated SequenceFile my only option currently is to override FileOutputFormat.getDefaultWorkFile(). a) Again, have I got this correct, or am I overlooking something? b) Would anyone else agree that this is something that can/should be made easier? (And thus worthy of a bug report?) Thanks, DR
Re: How to set SequenceFile.Metadata from within SequenceFileOutputFormat?
On 08/09/2010 05:45 PM, David Rosenstrauch wrote: On 08/09/2010 04:01 PM, David Rosenstrauch wrote: On a similar note, it looks like if I want to customize the name/path of the generated SequenceFile my only option currently is to override FileOutputFormat.getDefaultWorkFile(). a) Again, have I got this correct, or am I overlooking something? b) Would anyone else agree that this is something that can/should be made easier? (And thus worthy of a bug report?) Thanks, DR Ugh. Actually, this looks even worse than I thought. It looks like there's a bunch of static helper methods in FileOutputFormat which use methods other than getDefaultWorkFile() to determine the file name. It looks like most of them use the method getUniqueFile(). Problem is that getUniqueFile is a *static* method, so I can't override it with an alternate implementation. Anyone know any short way out of this conundrum without my having to completely reimplement chunks of FileOutputFormat/SequenceFileOutputFormat? Thanks, DR Hmmm ... on second look, overriding getDefaultWorkFile() should work. That's the method called by SequenceFileOutputFormat.getRecordWriter. So sorry for the noise. Still, would be helpful if there were a less kludgey way to handle this, I'd think. Thanks, DR
Re: How to set SequenceFile.Metadata from within SequenceFileOutputFormat?
Another solution would be to create a custom named output using mapred.lib.MultipleOutputs and collecting to that instead of the job-set output format (which one can set to NullOutputFormat so it doesn't complain about existing paths, etc.). So if you'd want 'foo' prefix to your 0-N numbered output files (instead of default 'part'), you'd create it with MultipleOutputs.addNamedOutput(Conf, foo, YourOutFormat.class, Key.class, Value.class); The extension, I believe, can be changed too, while 'getting' the path from the FileOutputFormat while building your RecordWriter. Something like: Path outPath = FileOutputFormat.getTaskOutputPath(job, name+YOUR_EXTENSION); // Now create the 'writer' on this path. On Tue, Aug 10, 2010 at 3:30 AM, David Rosenstrauch dar...@darose.net wrote: On 08/09/2010 05:45 PM, David Rosenstrauch wrote: On 08/09/2010 04:01 PM, David Rosenstrauch wrote: On a similar note, it looks like if I want to customize the name/path of the generated SequenceFile my only option currently is to override FileOutputFormat.getDefaultWorkFile(). a) Again, have I got this correct, or am I overlooking something? b) Would anyone else agree that this is something that can/should be made easier? (And thus worthy of a bug report?) Thanks, DR Ugh. Actually, this looks even worse than I thought. It looks like there's a bunch of static helper methods in FileOutputFormat which use methods other than getDefaultWorkFile() to determine the file name. It looks like most of them use the method getUniqueFile(). Problem is that getUniqueFile is a *static* method, so I can't override it with an alternate implementation. Anyone know any short way out of this conundrum without my having to completely reimplement chunks of FileOutputFormat/SequenceFileOutputFormat? Thanks, DR Hmmm ... on second look, overriding getDefaultWorkFile() should work. That's the method called by SequenceFileOutputFormat.getRecordWriter. So sorry for the noise. Still, would be helpful if there were a less kludgey way to handle this, I'd think. Thanks, DR -- Harsh J www.harshj.com
Re: How to set SequenceFile.Metadata from within SequenceFileOutputFormat?
On 08/09/2010 09:14 PM, Harsh J wrote: Another solution would be to create a custom named output using mapred.lib.MultipleOutputs and collecting to that instead of the job-set output format (which one can set to NullOutputFormat so it doesn't complain about existing paths, etc.). So if you'd want 'foo' prefix to your 0-N numbered output files (instead of default 'part'), you'd create it with MultipleOutputs.addNamedOutput(Conf, foo, YourOutFormat.class, Key.class, Value.class); The extension, I believe, can be changed too, while 'getting' the path from the FileOutputFormat while building your RecordWriter. Something like: Path outPath = FileOutputFormat.getTaskOutputPath(job, name+YOUR_EXTENSION); // Now create the 'writer' on this path. Tnx for the tip - didn't know about MultipleOutputs. (Though it's probably overkill for what I'm doing.) Thanks again, DR
Re: How to set SequenceFile.Metadata from within SequenceFileOutputFormat?
On Sat, Aug 7, 2010 at 11:20 AM, David Rosenstrauch dar...@darose.net wrote: I'm using a SequenceFileOutputFormat. But I'd like to be able to set some SequenceFile.Metadata on the SequenceFile.Writer that's getting created. Doesn't look like there's any easy way to do that, other than overriding the SequenceFileOutputFormat.getRecordWriter() method. Am I overlooking anything? Doesn't seem like you are on a wrong path, so don't worry and go ahead! P.s.: A reply by Tom White to a similar question: http://www.mail-archive.com/common-u...@hadoop.apache.org/msg02218.html TIA, DR -- Harsh J www.harshj.com
How to set SequenceFile.Metadata from within SequenceFileOutputFormat?
I'm using a SequenceFileOutputFormat. But I'd like to be able to set some SequenceFile.Metadata on the SequenceFile.Writer that's getting created. Doesn't look like there's any easy way to do that, other than overriding the SequenceFileOutputFormat.getRecordWriter() method. Am I overlooking anything? TIA, DR