[ 
https://issues.apache.org/jira/browse/NUTCH-1591?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Lewis John McGibbney resolved NUTCH-1591.
-----------------------------------------

    Resolution: Fixed

Committed @revision 1497447 in 2.x head
Thank you v much for the patch Jason and the insight.

                
> Incorrect conversion of ByteBuffer to String
> --------------------------------------------
>
>                 Key: NUTCH-1591
>                 URL: https://issues.apache.org/jira/browse/NUTCH-1591
>             Project: Nutch
>          Issue Type: Bug
>          Components: crawldb, indexer, parser, storage
>    Affects Versions: 2.2
>         Environment: Mac O/S 10.8.4, JDK 1.6.0_51
>            Reporter: Jason Howes
>            Priority: Critical
>             Fix For: 2.2.1
>
>         Attachments: NUTCH-1591.patch, Nutch1591Test.java, NUTCH-1591.zip
>
>
> There are many occurrences of the following ByteBuffer-to-String conversion 
> throughout the Nutch codebase:
> {code}
> ByteBuffer buf = ...;
> return new String(buf.array);
> {code}
> This approach assume that the ByteBuffer and it's underlying array are 
> aligned (i.e. ByteBuffer.arrayOffset() is equal to 0 and the length of the 
> underlying array is the same as ByteBuffer.remaining()). In many cases this 
> is not the case. The correct way to convert a ByteBuffer to a String (or 
> stream thereof) is the following:
> {code}
> ByteBuffer buf = ...;
> return new String(buf.array(), buf.arrayOffset() + buf.position(), 
> buf.remaining());
> {code}
> I noticed this bug when using Nutch with Cassandra. In most cases, the parsed 
> content contains data from other columns (as well as garbage content) since 
> the Cassandra client library returns ByteBuffers that are views on top of a 
> larger byte[]. It also seems that others have hit this as well:
> http://grokbase.com/p/nutch/user/132jnq8s4r/slow-parse-on-hadoop
> I've attached a patch based on the release-2.2 tag of the 2.x branch on 
> GitHub:
> https://github.com/apache/nutch/tree/release-2.2

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

Reply via email to