Jason Howes created NUTCH-1591:
----------------------------------
Summary: Incorrect conversion of ByteBuffer to string
Key: NUTCH-1591
URL: https://issues.apache.org/jira/browse/NUTCH-1591
Project: Nutch
Issue Type: Bug
Components: crawldb, indexer, parser, storage
Affects Versions: 2.2
Environment: Mac O/S 10.8.4, JDK 1.6.0_51
Reporter: Jason Howes
Priority: Critical
Fix For: 2.3
Attachments: NUTCH-1591.zip
There are many occurrences of the following ByteBuffer-to-String conversion
throughout the Nutch codebase:
{code}
ByteBuffer buf = ...;
return new String(buf.array);
{code}
This approach assume that the ByteBuffer and it's underlying array are aligned
(i.e. ByteBuffer.arrayOffset() is equal to 0 and the length of the underlying
array is the same as ByteBuffer.remaining()). In many cases this is not the
case. The correct way to convert a ByteBuffer to a String (or stream thereof)
is the following:
{code}
ByteBuffer buf = ...;
return new String(buf.array(), buf.arrayOffset() + buf.position(),
buf.remaining());
{code}
I noticed this bug when using Nutch with Cassandra. In most cases, the parsed
content contains data from other columns (as well as garbage content) since the
Cassandra client library returns ByteBuffers that are views on top of a larger
byte[]. It also seems that others have hit this as well:
http://grokbase.com/p/nutch/user/132jnq8s4r/slow-parse-on-hadoop
I've attached a patch based on the release-2.2 tag of the 2.x branch on GitHub:
https://github.com/apache/nutch/tree/release-2.2
--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira