Re: DBInputFormat / DBWritable question

Aaron Kimball Thu, 05 Aug 2010 21:51:23 -0700

The InputFormat instantiates a RecordReader (DBRecordReader) in the same
process as the Mapper. The DBWritable instances are instantiated inside the
RecordReader and fed directly to your mapper.

If your mapper then processes the data and sends it directly to the
OutputFormat (e.g., through TextOutputFormat which just calls
key/val.toString())  then you do not need to implement the Writable
interface.

If you intend to serialize your data to SequenceFiles (through
SequenceFileOutputFormat, or otherwise) or as intermediate data (to be
consumed by a reducer) then you need to implement Writable.

For that matter, if you don't intend to use DBOutputFormat with this data,
then you don't even need to provide a body for the "void
write(PreparedStatement)" method; just stub it.

A couple other tips:
* Consider using DataDrivenDBInputFormat. It's considerably
higher-throughput.
* If you're using CDH (Cloudera's Distribution for Hadoop), rather than
write your own DBWritable, use Sqoop's code generation capability (sqoop
codegen --connect ... --table ...) to create your java class for you.
* Related, if all you're doing is importing a copy of the data to HDFS,
Sqoop can handle that for you pretty easily :)

See github.com/cloudera/sqoop and archive.cloudera.com/cdh/3/sqoop for more
info.

Cheers,
- Aaron

On Wed, Aug 4, 2010 at 7:41 PM, Harsh J <[email protected]> wrote:

> AFAIK you don't really need serialization if your job is a map-only
> one; the OutputFormat/RecWriter (if any) should take care of it.
>
> On Thu, Aug 5, 2010 at 7:07 AM, David Rosenstrauch <[email protected]>
> wrote:
> > I'm working on a M/R job which uses DBInputFormat.  So I have to create
> my
> > own DBWritable for this.  I'm a little bit confused about how to
> implement
> > this though.
> >
> > In the sample code in the Javadoc for the DBWritable class, the
> MyWritable
> > implements both DBWritable and Writable - thereby forcing the author of
> the
> > MyWritable class to implement the methods to serialize/deserialize it
> > to/from DataInput & DataOutput.  Without getting into too much detail,
> > having to implement this serialization would add a good bit of complexity
> to
> > my code.
> >
> > However, the DBWritable that I'm writing really doesn't need to exist
> beyond
> > the Mapper.  I.e., it'll be input to the Mapper, but the Mapper won't
> emit
> > it out to the sort/reduce steps.  And after doing some reading/digging
> > through the code, it looks to me like the InputFormat and the Mapper
> always
> > get run on the same host & JVM.  If that's in fact the case, then there'd
> be
> > no need for me to make my DBWritable implement Writable also and so I
> could
> > avoid the whole serialization/deserialization issue.
> >
> > So my question is basically:  have I got this correct?  Do the
> InputFormat
> > and the Mapper always run in the same VM?  (In which case I can do what
> I'm
> > planning and code the DBWritable without the serialization headaches from
> > the Writable class.)
> >
> > TIA,
> >
> > DR
> >
>
>
>
> --
> Harsh J
> www.harshj.com
>

Re: DBInputFormat / DBWritable question

Reply via email to