Jason Howes created NUTCH-1591:
----------------------------------

             Summary: Incorrect conversion of ByteBuffer to string
                 Key: NUTCH-1591
                 URL: https://issues.apache.org/jira/browse/NUTCH-1591
             Project: Nutch
          Issue Type: Bug
          Components: crawldb, indexer, parser, storage
    Affects Versions: 2.2
         Environment: Mac O/S 10.8.4, JDK 1.6.0_51
            Reporter: Jason Howes
            Priority: Critical
             Fix For: 2.3
         Attachments: NUTCH-1591.zip

There are many occurrences of the following ByteBuffer-to-String conversion 
throughout the Nutch codebase:
{code}
ByteBuffer buf = ...;
return new String(buf.array);
{code}
This approach assume that the ByteBuffer and it's underlying array are aligned 
(i.e. ByteBuffer.arrayOffset() is equal to 0 and the length of the underlying 
array is the same as ByteBuffer.remaining()). In many cases this is not the 
case. The correct way to convert a ByteBuffer to a String (or stream thereof) 
is the following:
{code}
ByteBuffer buf = ...;
return new String(buf.array(), buf.arrayOffset() + buf.position(), 
buf.remaining());
{code}
I noticed this bug when using Nutch with Cassandra. In most cases, the parsed 
content contains data from other columns (as well as garbage content) since the 
Cassandra client library returns ByteBuffers that are views on top of a larger 
byte[]. It also seems that others have hit this as well:

http://grokbase.com/p/nutch/user/132jnq8s4r/slow-parse-on-hadoop

I've attached a patch based on the release-2.2 tag of the 2.x branch on GitHub:

https://github.com/apache/nutch/tree/release-2.2

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

Reply via email to