Hi, The issue here is that your input data is not in fact ASCII - there are no ascii characters with code > 127. My guess is that your input is in latin1 encoding or something else that defines one-byte character codes with value >127.
TextInputFormat (and anything else that uses the Text writable type) assumes the input is UTF8. True ASCII is a subset of UTF8, whereas latin1 is invalid UTF8. As for the best solution here, I'm not exactly sure. Hopefully someone else can pipe up with a trick to get an inputformat that works on non-UTF8 data. -Todd On Mon, Jul 27, 2009 at 10:22 AM, pmg <[email protected]> wrote: > > I have a tab delimited text file and I read using TextInputFormat. I have > problems reading lines from the txt file with ascii code > 127 e.g. > > P 676827 Martin Plachý amg > > gets read as > > P 676827 Martin Plach? with missing 3rd tab delimited column. Whats > the > best way to handle this kind of input? thanks > -- > View this message in context: > http://www.nabble.com/Text-encoding-tp24684865p24684865.html > Sent from the Hadoop core-user mailing list archive at Nabble.com. > >
