[
https://issues.apache.org/jira/browse/PIG-2613?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13238198#comment-13238198
]
Daniel Dai commented on PIG-2613:
---------------------------------
Can you attach your input?
> Pig substitutes/mangles "upper ASCII" characters (values > 127)
> ---------------------------------------------------------------
>
> Key: PIG-2613
> URL: https://issues.apache.org/jira/browse/PIG-2613
> Project: Pig
> Issue Type: Bug
> Components: data, parser
> Affects Versions: 0.8.1
> Environment: linux
> Reporter: Leo Heska
>
> Create small/dummy input file that contains ASCII 254 (decimal) characters.
> These are often represented as the Thorn character. A sample line looks like
> this:
> 1þ4þaaaþbbbþcccþdddþ7þ8þ9
> but your browser may not render that correctly. Hex representation of that
> sample line:
> 31FE34FE616161FE626262FE636363FE646464FE37FE38FE390D0A
> or, with spaces added for your convenience in reading:
> 31 FE 34 FE 61 61 61 FE 62 62 62 FE 63 63 63 FE 64 64 64 FE 37 FE 38 FE 39
> 0D 0A
> You can see that this is just a sample line of plain ASCII numerals and
> lower-case letters, separated by the FE (hex) or 254 (decimal) code point.
> Now load, like this:
> dummyts = load '/test/DummyDataTS.txt' using PigStorage(',') as
> (line:chararray);
> A dump
> dump dummyts;
>
> shows this:
> (1�4�aaa�bbb�ccc�ddd�7�8�9)
> The problem does not seem to be with the dump. I have written a UDF that
> counts characters in the line and returns TRUE if the character count is
> correct. When I do this:
>
> fd = filter dummyts by CountRight(line, 254, 8);
> which is saying "validate that there are 8 instances of the ASCII 254 code
> point/character" I get no results. When I do this:
> fd1 = filter dummyts by CountRight(line, 97, 3);
> which says "validate that there are three instances of the 'a' (ASCII 97)
> character the results are perfect.
> It looks like something in Pig's load is changing instances of ASCII 254 to
> the following three characters:
> �
--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators:
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira