Laukik Chitnis updated PIG-560:

    Attachment: utf-limit-patch.diff

The patch uses the String object's getBytes(charsetname) method to convert the 
string to UTF bytes, instead of the writeUTF() function. Now, an int can be 
used for storing the length instead of the 2 bytes used by the writeUTF(). Also 
includes the corresponding change while reading in a CHARARRAY.

> UTFDataFormatException (encoded string too long) is thrown when storing 
> strings > 65536 bytes (in UTF8 form) using BinStorage()
> -------------------------------------------------------------------------------------------------------------------------------
>                 Key: PIG-560
>                 URL: https://issues.apache.org/jira/browse/PIG-560
>             Project: Pig
>          Issue Type: Bug
>    Affects Versions: types_branch
>            Reporter: Pradeep Kamath
>             Fix For: types_branch
>         Attachments: utf-limit-patch.diff
> BinStorage() uses DataOutput.writeUTF() and DataInput.readUTF() Java API to 
> write out Strings as UTF-8 bytes and to read them back. From the Javadoc - 
> "First, the total number of bytes needed to represent all the characters of s 
> is calculated. If this number is larger than 65535, then a 
> UTFDataFormatException  is thrown. " (because the writeUTF() API uses 2 bytes 
> to represent the number of bytes). A way to get around this would be to not 
> use writeUTF()/ReadUTF() and instead hand convert the string to the 
> corresponding UTF-8 byte[]  (using String.getBytes("UTF-8") and then write 
> the length of the byte array as an int - this will allow a size of upto 2^32 
> (2 raised to 32).

This message is automatically generated by JIRA.
You can reply to this email to add a comment to the issue online.

Reply via email to