Hi,

The issue here is that your input data is not in fact ASCII - there are no
ascii characters with code > 127. My guess is that your input is in latin1
encoding or something else that defines one-byte character codes with value
>127.

TextInputFormat (and anything else that uses the Text writable type) assumes
the input is UTF8. True ASCII is a subset of UTF8, whereas latin1 is invalid
UTF8.

As for the best solution here, I'm not exactly sure. Hopefully someone else
can pipe up with a trick to get an inputformat that works on non-UTF8 data.

-Todd

On Mon, Jul 27, 2009 at 10:22 AM, pmg <[email protected]> wrote:

>
> I have a tab delimited text file and I read using TextInputFormat. I have
> problems reading lines from the txt file with ascii code > 127 e.g.
>
> P   676827      Martin Plachý   amg
>
> gets read as
>
> P   676827      Martin Plach? with missing 3rd tab delimited column. Whats
> the
> best way to handle this kind of input? thanks
> --
> View this message in context:
> http://www.nabble.com/Text-encoding-tp24684865p24684865.html
> Sent from the Hadoop core-user mailing list archive at Nabble.com.
>
>

Reply via email to