[
https://issues.apache.org/jira/browse/NUTCH-1591?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Lewis John McGibbney resolved NUTCH-1591.
-----------------------------------------
Resolution: Fixed
Committed @revision 1497447 in 2.x head
Thank you v much for the patch Jason and the insight.
> Incorrect conversion of ByteBuffer to String
> --------------------------------------------
>
> Key: NUTCH-1591
> URL: https://issues.apache.org/jira/browse/NUTCH-1591
> Project: Nutch
> Issue Type: Bug
> Components: crawldb, indexer, parser, storage
> Affects Versions: 2.2
> Environment: Mac O/S 10.8.4, JDK 1.6.0_51
> Reporter: Jason Howes
> Priority: Critical
> Fix For: 2.2.1
>
> Attachments: NUTCH-1591.patch, Nutch1591Test.java, NUTCH-1591.zip
>
>
> There are many occurrences of the following ByteBuffer-to-String conversion
> throughout the Nutch codebase:
> {code}
> ByteBuffer buf = ...;
> return new String(buf.array);
> {code}
> This approach assume that the ByteBuffer and it's underlying array are
> aligned (i.e. ByteBuffer.arrayOffset() is equal to 0 and the length of the
> underlying array is the same as ByteBuffer.remaining()). In many cases this
> is not the case. The correct way to convert a ByteBuffer to a String (or
> stream thereof) is the following:
> {code}
> ByteBuffer buf = ...;
> return new String(buf.array(), buf.arrayOffset() + buf.position(),
> buf.remaining());
> {code}
> I noticed this bug when using Nutch with Cassandra. In most cases, the parsed
> content contains data from other columns (as well as garbage content) since
> the Cassandra client library returns ByteBuffers that are views on top of a
> larger byte[]. It also seems that others have hit this as well:
> http://grokbase.com/p/nutch/user/132jnq8s4r/slow-parse-on-hadoop
> I've attached a patch based on the release-2.2 tag of the 2.x branch on
> GitHub:
> https://github.com/apache/nutch/tree/release-2.2
--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira