Re: SequenceFile as map input

Alan Miller Fri, 09 Jul 2010 10:16:42 -0700

Hi Alex,

My original files are ascii text. I was using <Object, Text, Text, Text>and everything worked fine.

Because my files are small (>2MB on avg.) I get one-map task per file.

For my test I had 2000 files, totalling 5GB and the whole run tookapprox 40 minutes.

I read that I could improve performance by merging my original filesinto one big SequenceFile.

I did that and that's why I trying to use <Object, BytesWritable, Text,Text>My new SequenceFile is only 444MB so my m/r job trigerred 7 map tasksbut apparently my newmap() is computationally more intensive and the whole run now takes 64minutes.

In my map(Text key, BytesWritable value, Context context) valuecontains the contentsof a whole file. I tried to break it down into line-based records whichI send to reduce().


   StringBuilder line = *new* StringBuilder*()*;
*char* linefeed = '\n';
*for* *(**byte* byt : value.getBytes*())* *{*
*if* *(* *(**int**)*byt == *(**int**)*linefeed *)* *{*
          line.append*((**char**)*byt*)*;
process_line*(*line.toString*()*, context*)*;
          line.delete*(*0, line.length*())*;
*}* *else* *{*
          line.append*((**char**)*byt*)*;
*}*
*}*

Alan

On 07/08/2010 11:22 PM, Alex Kozlov wrote:

Hi Alan,

Is the content of the original file ascii text? Then you should beusing <Object, Text, Text, Text> signature. By default 'hadoop fs-text ...' just will call toString() on the object. You get theobject itself in the map() method and can do whatever you want withit. If Text or BytesWritable does not work for you, you can alwayswrite your own class implementing Writable<http://hadoop.apache.org/common/docs/current/api/org/apache/hadoop/io/Writable.html>interface.


Let me know if you need more details how to do this.

Alex K

On Thu, Jul 8, 2010 at 1:59 PM, Alan Miller <someb...@squareplanet.de<mailto:someb...@squareplanet.de>> wrote:


    Hi Alex,

    I'm not sure what you mean. I already set my mapper's signature to:

      public class MyMapper extends Mapper<Object, BytesWritable,
    Text, Text> {
         ...
         public void map(Text key, BytesWritable value, Context context)
         }
       }

    In my map() loop the contents of value is the text from the
    original file
    and the value.toString() returns a String of bytes as hex pairs
    separated by space.
    But I'd like the original tab separated list of strings (i.e. the
    lines in my original files).

    I see BytesWritable.getBytes() returns a byte[]. I guess I could
    write my own
    RecordReader to convert the byte[] back to text strings but I
    thought this is
    something the framework would provide.

    Alan


    On 07/08/2010 08:42 PM, Alex Loddengaard wrote:

    Hi Alan,

    SequenceFiles keep track of the key and value type, so you should
    be able to use the Writables in the signature.  Though it looks
    like you're using the new API, and I admit that I'm not an expert
    with the new API.  Have you tried using the Writables in the
    signature?

    Alex

    On Thu, Jul 8, 2010 at 6:44 AM, Some Body
    <someb...@squareplanet.de <mailto:someb...@squareplanet.de>> wrote:

        To get around the small-file-problem (I have thousands of 2MB
        log files) I wrote
        a class to convert all my log files into a single SequenceFile in
        (Text key,  BytesWritable value) format.  That works fine. I
        can run this:

           hadoop fs -text /my.seq |grep peemt114.log | head -1
           10/07/08 15:02:10 INFO util.NativeCodeLoader: Loaded the
        native-hadoop library
           10/07/08 15:02:10 INFO zlib.ZlibFactory: Successfully
        loaded & initialized native-zlib library
           10/07/08 15:02:10 INFO compress.CodecPool: Got brand-new
        decompressor
           peemt114.log    70 65 65 6d 74 31 31 34 09
        .........[snip].......

        which shows my file name key (peemt114.log)
        and file contents value which appears to be converted to hex.
        The hex values up to the first tab (09)  translate to my
        hostname.

        I'm trying to adapt my mapper to use the SequenceFile as input.

        I  changed the job's inputFormatClass to:
           MyJob.setInputFormatClass(SequenceFileInputFormat.class);
        and modified my mapper signature to:
          public class MyMapper extends Mapper<Object, BytesWritable,
        Text, Text> {

        but how do I convert the value back to Text? When I print out
        the key,values using:
               System.out.printf("MAPPER INKEY: [%s]\n", key);
               System.out.printf("MAPPER INVAL: [%s]\n",
        value.toString());
        I get::
           MAPPER INKEY: [peemt114.log]
           MAPPER INVAL: [70 65 65 6d 74 31 31 34 09 .....[snip]......]

        Alan

Re: SequenceFile as map input

Reply via email to