[ 
https://issues.apache.org/jira/browse/NUTCH-1591?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13694431#comment-13694431
 ] 

Jason Howes commented on NUTCH-1591:
------------------------------------

Interesting. I guess it comes down to how and when the metadata value in 
question is used. It seems to me there was a conscientious decision to leave 
the deserialization (for lack of a better word) of various WebPage attributes 
up to the consumer. This makes sense to me, as it allows one to simply "pass 
through" the binary values until they need to be interpreted, thus reducing 
overall CPU utilization. I'm sure I'm missing some context as I've just started 
using Nutch, but I assume the fix for NUTCH-1511 is to decode the CASH_KEY 
value using Bytes.toFloat() after retrieving it from the Gora store?
                
> Incorrect conversion of ByteBuffer to String
> --------------------------------------------
>
>                 Key: NUTCH-1591
>                 URL: https://issues.apache.org/jira/browse/NUTCH-1591
>             Project: Nutch
>          Issue Type: Bug
>          Components: crawldb, indexer, parser, storage
>    Affects Versions: 2.2
>         Environment: Mac O/S 10.8.4, JDK 1.6.0_51
>            Reporter: Jason Howes
>            Priority: Critical
>             Fix For: 2.3
>
>         Attachments: NUTCH-1591.patch, Nutch1591Test.java, NUTCH-1591.zip
>
>
> There are many occurrences of the following ByteBuffer-to-String conversion 
> throughout the Nutch codebase:
> {code}
> ByteBuffer buf = ...;
> return new String(buf.array);
> {code}
> This approach assume that the ByteBuffer and it's underlying array are 
> aligned (i.e. ByteBuffer.arrayOffset() is equal to 0 and the length of the 
> underlying array is the same as ByteBuffer.remaining()). In many cases this 
> is not the case. The correct way to convert a ByteBuffer to a String (or 
> stream thereof) is the following:
> {code}
> ByteBuffer buf = ...;
> return new String(buf.array(), buf.arrayOffset() + buf.position(), 
> buf.remaining());
> {code}
> I noticed this bug when using Nutch with Cassandra. In most cases, the parsed 
> content contains data from other columns (as well as garbage content) since 
> the Cassandra client library returns ByteBuffers that are views on top of a 
> larger byte[]. It also seems that others have hit this as well:
> http://grokbase.com/p/nutch/user/132jnq8s4r/slow-parse-on-hadoop
> I've attached a patch based on the release-2.2 tag of the 2.x branch on 
> GitHub:
> https://github.com/apache/nutch/tree/release-2.2

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

Reply via email to