[ 
https://issues.apache.org/jira/browse/PIG-560?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12669097#action_12669097
 ] 

Laukik Chitnis commented on PIG-560:
------------------------------------

In the current patch, when the length is <65536, the string to UTF8 conversion 
is happening twice -- once with String::getBytes() and once with 
DataOutput::writeUTF()

To avoid that, instead of writeUTF(), how about using writeShort() followed by 
writeBytes() since we would already have the length and the UTF8 bytes? 


> UTFDataFormatException (encoded string too long) is thrown when storing 
> strings > 65536 bytes (in UTF8 form) using BinStorage()
> -------------------------------------------------------------------------------------------------------------------------------
>
>                 Key: PIG-560
>                 URL: https://issues.apache.org/jira/browse/PIG-560
>             Project: Pig
>          Issue Type: Bug
>    Affects Versions: types_branch
>            Reporter: Pradeep Kamath
>             Fix For: types_branch
>
>         Attachments: PIG-560.patch, utf-limit-patch.diff
>
>
> BinStorage() uses DataOutput.writeUTF() and DataInput.readUTF() Java API to 
> write out Strings as UTF-8 bytes and to read them back. From the Javadoc - 
> "First, the total number of bytes needed to represent all the characters of s 
> is calculated. If this number is larger than 65535, then a 
> UTFDataFormatException  is thrown. " (because the writeUTF() API uses 2 bytes 
> to represent the number of bytes). A way to get around this would be to not 
> use writeUTF()/ReadUTF() and instead hand convert the string to the 
> corresponding UTF-8 byte[]  (using String.getBytes("UTF-8") and then write 
> the length of the byte array as an int - this will allow a size of upto 2^32 
> (2 raised to 32).

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

Reply via email to