Re: How to set SequenceFile.Metadata from within SequenceFileOutputFormat?

2010-08-09 Thread Harsh J
You may also propose to extend the existing SFOP to allow this on JIRA
or the dev mailing list :)

On Mon, Aug 9, 2010 at 8:09 PM, David Rosenstrauch dar...@darose.net wrote:
 On 08/07/2010 02:06 AM, Harsh J wrote:

 On Sat, Aug 7, 2010 at 11:20 AM, David Rosenstrauchdar...@darose.net
  wrote:

 I'm using a SequenceFileOutputFormat.  But I'd like to be able to set
 some
 SequenceFile.Metadata on the SequenceFile.Writer that's getting created.
  Doesn't look like there's any easy way to do that, other than overriding
 the SequenceFileOutputFormat.getRecordWriter() method.

 Am I overlooking anything?

 Doesn't seem like you are on a wrong path, so don't worry and go ahead!

 P.s.: A reply by Tom White to a similar question:
 http://www.mail-archive.com/common-u...@hadoop.apache.org/msg02218.html

 Thanks for the link!

 Sounds like I'll have to sub-class, then.  A bit of a kludge, IMO, but I can
 live with it.  That's life on the bleeding edge!

 Thanks again,

 DR




-- 
Harsh J
www.harshj.com


Re: How to set SequenceFile.Metadata from within SequenceFileOutputFormat?

2010-08-09 Thread David Rosenstrauch
Not sure if this is something the devs would want to implement a change 
like this, but it couldn't hurt to at least file it and make them aware.


Done:  https://issues.apache.org/jira/browse/MAPREDUCE-2001

Thanks,

DR

On 08/09/2010 12:16 PM, Harsh J wrote:

You may also propose to extend the existing SFOP to allow this on JIRA
or the dev mailing list :)

On Mon, Aug 9, 2010 at 8:09 PM, David Rosenstrauchdar...@darose.net  wrote:

On 08/07/2010 02:06 AM, Harsh J wrote:


On Sat, Aug 7, 2010 at 11:20 AM, David Rosenstrauchdar...@darose.net
  wrote:


I'm using a SequenceFileOutputFormat.  But I'd like to be able to set
some
SequenceFile.Metadata on the SequenceFile.Writer that's getting created.
  Doesn't look like there's any easy way to do that, other than overriding
the SequenceFileOutputFormat.getRecordWriter() method.

Am I overlooking anything?


Doesn't seem like you are on a wrong path, so don't worry and go ahead!

P.s.: A reply by Tom White to a similar question:
http://www.mail-archive.com/common-u...@hadoop.apache.org/msg02218.html


Thanks for the link!

Sounds like I'll have to sub-class, then.  A bit of a kludge, IMO, but I can
live with it.  That's life on the bleeding edge!

Thanks again,

DR


Re: How to set SequenceFile.Metadata from within SequenceFileOutputFormat?

2010-08-09 Thread David Rosenstrauch

On 08/07/2010 02:06 AM, Harsh J wrote:

On Sat, Aug 7, 2010 at 11:20 AM, David Rosenstrauchdar...@darose.net  wrote:

I'm using a SequenceFileOutputFormat.  But I'd like to be able to set some
SequenceFile.Metadata on the SequenceFile.Writer that's getting created.
  Doesn't look like there's any easy way to do that, other than overriding
the SequenceFileOutputFormat.getRecordWriter() method.

Am I overlooking anything?

Doesn't seem like you are on a wrong path, so don't worry and go ahead!

P.s.: A reply by Tom White to a similar question:
http://www.mail-archive.com/common-u...@hadoop.apache.org/msg02218.html


TIA,

DR


On a similar note, it looks like if I want to customize the name/path of 
the generated SequenceFile my only option currently is to override 
FileOutputFormat.getDefaultWorkFile().


a) Again, have I got this correct, or am I overlooking something?
b) Would anyone else agree that this is something that can/should be 
made easier?  (And thus worthy of a bug report?)


Thanks,

DR


Re: How to set SequenceFile.Metadata from within SequenceFileOutputFormat?

2010-08-09 Thread David Rosenstrauch

On 08/09/2010 05:45 PM, David Rosenstrauch wrote:

On 08/09/2010 04:01 PM, David Rosenstrauch wrote:

On a similar note, it looks like if I want to customize the name/path of
the generated SequenceFile my only option currently is to override
FileOutputFormat.getDefaultWorkFile().

a) Again, have I got this correct, or am I overlooking something?
b) Would anyone else agree that this is something that can/should be
made easier? (And thus worthy of a bug report?)

Thanks,

DR


Ugh. Actually, this looks even worse than I thought.

It looks like there's a bunch of static helper methods in
FileOutputFormat which use methods other than getDefaultWorkFile() to
determine the file name.

It looks like most of them use the method getUniqueFile(). Problem is
that getUniqueFile is a *static* method, so I can't override it with an
alternate implementation.

Anyone know any short way out of this conundrum without my having to
completely reimplement chunks of FileOutputFormat/SequenceFileOutputFormat?

Thanks,

DR


Hmmm ... on second look, overriding getDefaultWorkFile() should work. 
That's the method called by SequenceFileOutputFormat.getRecordWriter. 
So sorry for the noise.


Still, would be helpful if there were a less kludgey way to handle this, 
I'd think.


Thanks,

DR


Re: How to set SequenceFile.Metadata from within SequenceFileOutputFormat?

2010-08-09 Thread Harsh J
Another solution would be to create a custom named output using
mapred.lib.MultipleOutputs and collecting to that instead of the
job-set output format (which one can set to NullOutputFormat so it
doesn't complain about existing paths, etc.).

So if you'd want 'foo' prefix to your 0-N numbered output
files (instead of default 'part'), you'd create it with
MultipleOutputs.addNamedOutput(Conf, foo, YourOutFormat.class,
Key.class, Value.class);

The extension, I believe, can be changed too, while 'getting' the path
from the FileOutputFormat while building your RecordWriter. Something
like:
Path outPath = FileOutputFormat.getTaskOutputPath(job, name+YOUR_EXTENSION);
// Now create the 'writer' on this path.

On Tue, Aug 10, 2010 at 3:30 AM, David Rosenstrauch dar...@darose.net wrote:
 On 08/09/2010 05:45 PM, David Rosenstrauch wrote:

 On 08/09/2010 04:01 PM, David Rosenstrauch wrote:

 On a similar note, it looks like if I want to customize the name/path of
 the generated SequenceFile my only option currently is to override
 FileOutputFormat.getDefaultWorkFile().

 a) Again, have I got this correct, or am I overlooking something?
 b) Would anyone else agree that this is something that can/should be
 made easier? (And thus worthy of a bug report?)

 Thanks,

 DR

 Ugh. Actually, this looks even worse than I thought.

 It looks like there's a bunch of static helper methods in
 FileOutputFormat which use methods other than getDefaultWorkFile() to
 determine the file name.

 It looks like most of them use the method getUniqueFile(). Problem is
 that getUniqueFile is a *static* method, so I can't override it with an
 alternate implementation.

 Anyone know any short way out of this conundrum without my having to
 completely reimplement chunks of
 FileOutputFormat/SequenceFileOutputFormat?

 Thanks,

 DR

 Hmmm ... on second look, overriding getDefaultWorkFile() should work. That's
 the method called by SequenceFileOutputFormat.getRecordWriter. So sorry for
 the noise.

 Still, would be helpful if there were a less kludgey way to handle this, I'd
 think.

 Thanks,

 DR




-- 
Harsh J
www.harshj.com


Re: How to set SequenceFile.Metadata from within SequenceFileOutputFormat?

2010-08-09 Thread David Rosenstrauch

On 08/09/2010 09:14 PM, Harsh J wrote:

Another solution would be to create a custom named output using
mapred.lib.MultipleOutputs and collecting to that instead of the
job-set output format (which one can set to NullOutputFormat so it
doesn't complain about existing paths, etc.).

So if you'd want 'foo' prefix to your 0-N numbered output
files (instead of default 'part'), you'd create it with
MultipleOutputs.addNamedOutput(Conf, foo, YourOutFormat.class,
Key.class, Value.class);

The extension, I believe, can be changed too, while 'getting' the path
from the FileOutputFormat while building your RecordWriter. Something
like:
Path outPath = FileOutputFormat.getTaskOutputPath(job, name+YOUR_EXTENSION);
// Now create the 'writer' on this path.


Tnx for the tip - didn't know about MultipleOutputs.  (Though it's 
probably overkill for what I'm doing.)


Thanks again,

DR


Re: How to set SequenceFile.Metadata from within SequenceFileOutputFormat?

2010-08-07 Thread Harsh J
On Sat, Aug 7, 2010 at 11:20 AM, David Rosenstrauch dar...@darose.net wrote:
 I'm using a SequenceFileOutputFormat.  But I'd like to be able to set some
 SequenceFile.Metadata on the SequenceFile.Writer that's getting created.
  Doesn't look like there's any easy way to do that, other than overriding
 the SequenceFileOutputFormat.getRecordWriter() method.

 Am I overlooking anything?
Doesn't seem like you are on a wrong path, so don't worry and go ahead!

P.s.: A reply by Tom White to a similar question:
http://www.mail-archive.com/common-u...@hadoop.apache.org/msg02218.html

 TIA,

 DR




-- 
Harsh J
www.harshj.com


How to set SequenceFile.Metadata from within SequenceFileOutputFormat?

2010-08-06 Thread David Rosenstrauch
I'm using a SequenceFileOutputFormat.  But I'd like to be able to set 
some SequenceFile.Metadata on the SequenceFile.Writer that's getting 
created.  Doesn't look like there's any easy way to do that, other than 
overriding the SequenceFileOutputFormat.getRecordWriter() method.


Am I overlooking anything?

TIA,

DR