[ https://issues.apache.org/jira/browse/PIG-560?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12668710#action_12668710 ]
Olga Natkovich commented on PIG-560: ------------------------------------ We know that since we put this changes into the production, only one other person complained so we are pretty certain it is a very rare case. I agree with Alan that we should only pay the penalty on long strings > UTFDataFormatException (encoded string too long) is thrown when storing > strings > 65536 bytes (in UTF8 form) using BinStorage() > ------------------------------------------------------------------------------------------------------------------------------- > > Key: PIG-560 > URL: https://issues.apache.org/jira/browse/PIG-560 > Project: Pig > Issue Type: Bug > Affects Versions: types_branch > Reporter: Pradeep Kamath > Fix For: types_branch > > Attachments: utf-limit-patch.diff > > > BinStorage() uses DataOutput.writeUTF() and DataInput.readUTF() Java API to > write out Strings as UTF-8 bytes and to read them back. From the Javadoc - > "First, the total number of bytes needed to represent all the characters of s > is calculated. If this number is larger than 65535, then a > UTFDataFormatException is thrown. " (because the writeUTF() API uses 2 bytes > to represent the number of bytes). A way to get around this would be to not > use writeUTF()/ReadUTF() and instead hand convert the string to the > corresponding UTF-8 byte[] (using String.getBytes("UTF-8") and then write > the length of the byte array as an int - this will allow a size of upto 2^32 > (2 raised to 32). -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.