[ https://issues.apache.org/jira/browse/PIG-560?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12669097#action_12669097 ]
Laukik Chitnis commented on PIG-560: ------------------------------------ In the current patch, when the length is <65536, the string to UTF8 conversion is happening twice -- once with String::getBytes() and once with DataOutput::writeUTF() To avoid that, instead of writeUTF(), how about using writeShort() followed by writeBytes() since we would already have the length and the UTF8 bytes? > UTFDataFormatException (encoded string too long) is thrown when storing > strings > 65536 bytes (in UTF8 form) using BinStorage() > ------------------------------------------------------------------------------------------------------------------------------- > > Key: PIG-560 > URL: https://issues.apache.org/jira/browse/PIG-560 > Project: Pig > Issue Type: Bug > Affects Versions: types_branch > Reporter: Pradeep Kamath > Fix For: types_branch > > Attachments: PIG-560.patch, utf-limit-patch.diff > > > BinStorage() uses DataOutput.writeUTF() and DataInput.readUTF() Java API to > write out Strings as UTF-8 bytes and to read them back. From the Javadoc - > "First, the total number of bytes needed to represent all the characters of s > is calculated. If this number is larger than 65535, then a > UTFDataFormatException is thrown. " (because the writeUTF() API uses 2 bytes > to represent the number of bytes). A way to get around this would be to not > use writeUTF()/ReadUTF() and instead hand convert the string to the > corresponding UTF-8 byte[] (using String.getBytes("UTF-8") and then write > the length of the byte array as an int - this will allow a size of upto 2^32 > (2 raised to 32). -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.