[ 
https://issues.apache.org/jira/browse/NUTCH-1591?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13694932#comment-13694932
 ] 

Hudson commented on NUTCH-1591:
-------------------------------

Integrated in Nutch-nutchgora #664 (See 
[https://builds.apache.org/job/Nutch-nutchgora/664/])
    NUTCH-1591 Incorrect conversion of ByteBuffer to String (Revision 1497447)

     Result = SUCCESS
lewismc : http://svn.apache.org/viewvc/nutch/branches/2.x/?view=rev&rev=1497447
Files : 
* /nutch/branches/2.x/CHANGES.txt
* /nutch/branches/2.x/src/java/org/apache/nutch/api/DbReader.java
* /nutch/branches/2.x/src/java/org/apache/nutch/crawl/DbUpdateReducer.java
* /nutch/branches/2.x/src/java/org/apache/nutch/crawl/MD5Signature.java
* /nutch/branches/2.x/src/java/org/apache/nutch/crawl/SignatureComparator.java
* /nutch/branches/2.x/src/java/org/apache/nutch/crawl/WebTableReader.java
* /nutch/branches/2.x/src/java/org/apache/nutch/indexer/IndexUtil.java
* /nutch/branches/2.x/src/java/org/apache/nutch/parse/ParserChecker.java
* /nutch/branches/2.x/src/java/org/apache/nutch/storage/Host.java
* /nutch/branches/2.x/src/java/org/apache/nutch/util/Bytes.java
* /nutch/branches/2.x/src/java/org/apache/nutch/util/EncodingDetector.java
* /nutch/branches/2.x/src/java/org/apache/nutch/util/StringUtil.java
* 
/nutch/branches/2.x/src/plugin/creativecommons/src/java/org/creativecommons/nutch/CCIndexingFilter.java
* 
/nutch/branches/2.x/src/plugin/creativecommons/src/test/org/creativecommons/nutch/TestCCParseFilter.java
* 
/nutch/branches/2.x/src/plugin/index-basic/src/java/org/apache/nutch/indexer/basic/BasicIndexingFilter.java
* 
/nutch/branches/2.x/src/plugin/language-identifier/src/java/org/apache/nutch/analysis/lang/HTMLLanguageParser.java
* 
/nutch/branches/2.x/src/plugin/language-identifier/src/java/org/apache/nutch/analysis/lang/LanguageIndexingFilter.java
* 
/nutch/branches/2.x/src/plugin/language-identifier/src/test/org/apache/nutch/analysis/lang/TestHTMLLanguageParser.java
* 
/nutch/branches/2.x/src/plugin/microformats-reltag/src/java/org/apache/nutch/microformats/reltag/RelTagIndexingFilter.java
* 
/nutch/branches/2.x/src/plugin/parse-html/src/java/org/apache/nutch/parse/html/HtmlParser.java
* 
/nutch/branches/2.x/src/plugin/parse-js/src/java/org/apache/nutch/parse/js/JSParseFilter.java
* 
/nutch/branches/2.x/src/plugin/parse-tika/src/java/org/apache/nutch/parse/tika/TikaParser.java
* 
/nutch/branches/2.x/src/plugin/scoring-opic/src/java/org/apache/nutch/scoring/opic/OPICScoringFilter.java
* /nutch/branches/2.x/src/test/org/apache/nutch/crawl/TestInjector.java
* /nutch/branches/2.x/src/test/org/apache/nutch/fetcher/TestFetcher.java

                
> Incorrect conversion of ByteBuffer to String
> --------------------------------------------
>
>                 Key: NUTCH-1591
>                 URL: https://issues.apache.org/jira/browse/NUTCH-1591
>             Project: Nutch
>          Issue Type: Bug
>          Components: crawldb, indexer, parser, storage
>    Affects Versions: 2.2
>         Environment: Mac O/S 10.8.4, JDK 1.6.0_51
>            Reporter: Jason Howes
>            Priority: Critical
>             Fix For: 2.2.1
>
>         Attachments: NUTCH-1591.patch, Nutch1591Test.java, NUTCH-1591.zip
>
>
> There are many occurrences of the following ByteBuffer-to-String conversion 
> throughout the Nutch codebase:
> {code}
> ByteBuffer buf = ...;
> return new String(buf.array);
> {code}
> This approach assume that the ByteBuffer and it's underlying array are 
> aligned (i.e. ByteBuffer.arrayOffset() is equal to 0 and the length of the 
> underlying array is the same as ByteBuffer.remaining()). In many cases this 
> is not the case. The correct way to convert a ByteBuffer to a String (or 
> stream thereof) is the following:
> {code}
> ByteBuffer buf = ...;
> return new String(buf.array(), buf.arrayOffset() + buf.position(), 
> buf.remaining());
> {code}
> I noticed this bug when using Nutch with Cassandra. In most cases, the parsed 
> content contains data from other columns (as well as garbage content) since 
> the Cassandra client library returns ByteBuffers that are views on top of a 
> larger byte[]. It also seems that others have hit this as well:
> http://grokbase.com/p/nutch/user/132jnq8s4r/slow-parse-on-hadoop
> I've attached a patch based on the release-2.2 tag of the 2.x branch on 
> GitHub:
> https://github.com/apache/nutch/tree/release-2.2

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

Reply via email to