By the design, the TextInputFormat will split the file into lines and pass each one as a record.
If you override isSplittable(), it will still return a bunch of records. Each file will be a split. If you want to get the context of a single file, the best way is to put the files into a SequenceFile, one per key, which can be the file name, and read the file as bytes. Alternatively, you can pass a file where each line is a file name to a mapper and open the file explicitly within the mapper. On Sat, Jan 23, 2010 at 8:48 AM, prashant ullegaddi < [email protected]> wrote: > Why don't you extend FileInputFormat, and implement > isSplittable< > http://hadoop.apache.org/common/docs/r0.20.1/api/org/apache/hadoop/mapred/FileInputFormat.html#isSplitable%28org.apache.hadoop.fs.FileSystem,%20org.apache.hadoop.fs.Path%29 > >, > so that it returns false. > > > On Sat, Jan 23, 2010 at 10:05 PM, stolikp <[email protected]> wrote: > > > > > I've got some text files in my input directory and I want to pass each > > single > > text file (whole file not just a line) to a map (one file per one map). > How > > can I do this ? TextInputFormat splits text into lines and I do not want > > this to happen. > > I tried: > > > > > http://hadoop.apache.org/common/docs/r0.20./streaming.html#How+do+I+process+files%2C+one+per+map%3F > > but it doesn't work for me, compiler doesn't know what > > NonSplitableTextInputFormat.class is. > > I'm using hadoop 0.20.1 > > -- > > View this message in context: > > > http://old.nabble.com/Passing-whole-text-file-to-a-single-map-tp27287649p27287649.html > > Sent from the Hadoop core-user mailing list archive at Nabble.com. > > > > > > > -- > Thanks, > Prashant Ullegaddi, > Search and Information Extraction Lab, > IIIT-Hyderabad, INDIA. >
