Hi Alan, You don't need to do this complex trickery if you write <Object,Text> to the Sequence File. How do you create the Sequence File? In your case it might make sense to create a <Text,Text> Sequence File where the first object is the file name or compete path and the second is the content.
Then you just call: process_line*(*value.toString*()*, context*)*; without having to do the StringBuffer thing. Alex K On Fri, Jul 9, 2010 at 10:10 AM, Alan Miller <someb...@squareplanet.de>wrote: > Hi Alex, > > My original files are ascii text. I was using <Object, Text, Text, Text> > and everything worked fine. > Because my files are small (>2MB on avg.) I get one-map task per file. > For my test I had 2000 files, totalling 5GB and the whole run took approx > 40 minutes. > > I read that I could improve performance by merging my original files into > one big SequenceFile. > > I did that and that's why I trying to use <Object, BytesWritable, Text, > Text> > My new SequenceFile is only 444MB so my m/r job trigerred 7 map tasks but > apparently my new > map() is computationally more intensive and the whole run now takes 64 > minutes. > > In my map(Text key, BytesWritable value, Context context) value contains > the contents > of a whole file. I tried to break it down into line-based records which I > send to reduce(). > > StringBuilder line = *new* StringBuilder*()*; > *char* linefeed = '\n'; > *for* *(**byte* byt : value.getBytes*())* *{* > *if* *(* *(**int**)*byt == *(**int**)*linefeed *)* *{* > line.append*((**char**)*byt*)*; > process_line*(*line.toString*()*, context*)*; > line.delete*(*0, line.length*())*; > *}* *else* *{* > line.append*((**char**)*byt*)*; > *}* > *}* > > Alan > > > On 07/08/2010 11:22 PM, Alex Kozlov wrote: > > Hi Alan, > > Is the content of the original file ascii text? Then you should be using > <Object, Text, Text, Text> signature. By default 'hadoop fs -text ...' > just will call toString() on the object. You get the object itself in the > map() method and can do whatever you want with it. If Text or BytesWritable > does not work for you, you can always write your own class implementing > Writable<http://hadoop.apache.org/common/docs/current/api/org/apache/hadoop/io/Writable.html>interface. > > Let me know if you need more details how to do this. > > Alex K > > On Thu, Jul 8, 2010 at 1:59 PM, Alan Miller <someb...@squareplanet.de>wrote: > >> Hi Alex, >> >> I'm not sure what you mean. I already set my mapper's signature to: >> >> public class MyMapper extends Mapper<Object, BytesWritable, Text, Text> >> { >> ... >> public void map(Text key, BytesWritable value, Context context) >> } >> } >> >> In my map() loop the contents of value is the text from the original file >> and the value.toString() returns a String of bytes as hex pairs separated >> by space. >> But I'd like the original tab separated list of strings (i.e. the lines in >> my original files). >> >> I see BytesWritable.getBytes() returns a byte[]. I guess I could write my >> own >> RecordReader to convert the byte[] back to text strings but I thought this >> is >> something the framework would provide. >> >> Alan >> >> >> On 07/08/2010 08:42 PM, Alex Loddengaard wrote: >> >> Hi Alan, >> >> SequenceFiles keep track of the key and value type, so you should be >> able to use the Writables in the signature. Though it looks like you're >> using the new API, and I admit that I'm not an expert with the new API. >> Have you tried using the Writables in the signature? >> >> Alex >> >> On Thu, Jul 8, 2010 at 6:44 AM, Some Body <someb...@squareplanet.de>wrote: >> >>> To get around the small-file-problem (I have thousands of 2MB log files) >>> I wrote >>> a class to convert all my log files into a single SequenceFile in >>> (Text key, BytesWritable value) format. That works fine. I can run >>> this: >>> >>> hadoop fs -text /my.seq |grep peemt114.log | head -1 >>> 10/07/08 15:02:10 INFO util.NativeCodeLoader: Loaded the native-hadoop >>> library >>> 10/07/08 15:02:10 INFO zlib.ZlibFactory: Successfully loaded & >>> initialized native-zlib library >>> 10/07/08 15:02:10 INFO compress.CodecPool: Got brand-new decompressor >>> peemt114.log 70 65 65 6d 74 31 31 34 09 .........[snip]....... >>> >>> which shows my file name key (peemt114.log) >>> and file contents value which appears to be converted to hex. >>> The hex values up to the first tab (09) translate to my hostname. >>> >>> I'm trying to adapt my mapper to use the SequenceFile as input. >>> >>> I changed the job's inputFormatClass to: >>> MyJob.setInputFormatClass(SequenceFileInputFormat.class); >>> and modified my mapper signature to: >>> public class MyMapper extends Mapper<Object, BytesWritable, Text, Text> >>> { >>> >>> but how do I convert the value back to Text? When I print out the >>> key,values using: >>> System.out.printf("MAPPER INKEY: [%s]\n", key); >>> System.out.printf("MAPPER INVAL: [%s]\n", value.toString()); >>> I get:: >>> MAPPER INKEY: [peemt114.log] >>> MAPPER INVAL: [70 65 65 6d 74 31 31 34 09 .....[snip]......] >>> >>> Alan >>> >> >> >> > >