[ 
https://issues.apache.org/jira/browse/NUTCH-1591?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13694434#comment-13694434
 ] 

Lewis John McGibbney commented on NUTCH-1591:
---------------------------------------------

In Gora, with all data stores, we store everything as Bytes. This was not 
always the case and I hope that this provides some more context. Lets just say 
that in 'some' data stores, previously we stored data types 'as they were' 
instead of converting everything to Bytes.
As you said, we saw the light and now everything is persisted down into Bytes 
within Gora. This directly delegates the responsibility and task of dealing 
with the Bytes to the client code. 
If he wishes ;) Renato can chime in here with some more commentary. He 
explained this very well at CassandraSummit a couple weeks back when discussing 
and explaining some changes to gora-cassandra which are now in 0.3.
I would like to commit this patch tomorrow. I've been running since last Friday 
and I have no quims. 
Thanks.
                
> Incorrect conversion of ByteBuffer to String
> --------------------------------------------
>
>                 Key: NUTCH-1591
>                 URL: https://issues.apache.org/jira/browse/NUTCH-1591
>             Project: Nutch
>          Issue Type: Bug
>          Components: crawldb, indexer, parser, storage
>    Affects Versions: 2.2
>         Environment: Mac O/S 10.8.4, JDK 1.6.0_51
>            Reporter: Jason Howes
>            Priority: Critical
>             Fix For: 2.3
>
>         Attachments: NUTCH-1591.patch, Nutch1591Test.java, NUTCH-1591.zip
>
>
> There are many occurrences of the following ByteBuffer-to-String conversion 
> throughout the Nutch codebase:
> {code}
> ByteBuffer buf = ...;
> return new String(buf.array);
> {code}
> This approach assume that the ByteBuffer and it's underlying array are 
> aligned (i.e. ByteBuffer.arrayOffset() is equal to 0 and the length of the 
> underlying array is the same as ByteBuffer.remaining()). In many cases this 
> is not the case. The correct way to convert a ByteBuffer to a String (or 
> stream thereof) is the following:
> {code}
> ByteBuffer buf = ...;
> return new String(buf.array(), buf.arrayOffset() + buf.position(), 
> buf.remaining());
> {code}
> I noticed this bug when using Nutch with Cassandra. In most cases, the parsed 
> content contains data from other columns (as well as garbage content) since 
> the Cassandra client library returns ByteBuffers that are views on top of a 
> larger byte[]. It also seems that others have hit this as well:
> http://grokbase.com/p/nutch/user/132jnq8s4r/slow-parse-on-hadoop
> I've attached a patch based on the release-2.2 tag of the 2.x branch on 
> GitHub:
> https://github.com/apache/nutch/tree/release-2.2

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

Reply via email to