[ 
https://issues.apache.org/jira/browse/NUTCH-2696?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16775641#comment-16775641
 ] 

Sebastian Nagel commented on NUTCH-2696:
----------------------------------------

Hi [~lhervaud], thanks for the bug report! This issue is probably related to 
NUTCH-1807 - we still have a couple of places where methods are called which 
depend on the locale of the system. The SegmentReader dump parts belong to 
this. I'll open a PR providing a fix: it's simply about opening all streams 
using UTF-8 as encoding. The code called by {{-get}} does exactly this. Your 
proposal using the write() method of Writable (Text) works only partially: the 
Text class holds the data as UTF-8-encoded byte array, so writing the bytes to 
a binary stream (FSDataOutputStream) results in proper UTF-8. However, the 
write() method is used for serialization and usually some binary data is also 
written, here the length of the byte array as a binary VInt.

Btw., I'm also running Nutch on CDH 6.1 - everything works smoothly and there 
is no need to build Nutch with the Hadoop 3.0 / CDH 6.1 dependencies.

 

> Nutch SegmentReader does not dump non-ASCII characters with Hadoop 3.x
> ----------------------------------------------------------------------
>
>                 Key: NUTCH-2696
>                 URL: https://issues.apache.org/jira/browse/NUTCH-2696
>             Project: Nutch
>          Issue Type: Bug
>          Components: segment
>         Environment: Hadoop version : 3.0.0 (CDH 6.1)
> Nutch : 1.15
> Mode : distributed mode
>            Reporter: Laurent Hervaud
>            Priority: Major
>
> All Nutch tasks work properly with Hadoop 3.x. (except SegmentReader)
>  SegmentReader with -get option work fine.
>  SegmentReader with -dump option replace non-ascii character by ?
> Exemple url : [http://www.wikipedia.fr/index.php]
>  
> {code:java}
> command : ./runtime/deploy/bin/nutch readseg -dump 
> /user/nutch/crawl1.15/segments/20190221093756 /tmp/dump1.15 -nocontent 
> -nogenerate -noparse -noparsedata
> ParseText::
>  Wikipedia.fr - Portail de recherche sur les projets Wikim?dia
>  Chercher sur Wikip?dia en fran?ais
>  L?encyclop?die librement r?utilisable que chacun peut am?liorer.
> {code}
>  
>  
> {code:java}
> command : ./runtime/deploy/bin/nutch readseg -get 
> /user/nutch/crawl1.15/segments/20190221093756 
> http://www.wikipedia.fr/index.php -nocontent -nogenerate -noparse -noparsedata
> ParseText::
>  Wikipedia.fr - Portail de recherche sur les projets Wikimédia
>  Chercher sur Wikipédia en français
>  L’encyclopédie librement réutilisable que chacun peut améliorer.
> {code}
>  
> I try to build with hadoop 3.0.0 dependencies in ivy.xml but i have the same 
> result
> It's work fine in local mode.
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

Reply via email to