[
https://issues.apache.org/jira/browse/NUTCH-2696?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16774088#comment-16774088
]
Laurent Hervaud commented on NUTCH-2696:
----------------------------------------
Problem come from PrintStream object in TextOutputFormat class
(SegmentReader.java line 118).
Solution : use FSDataOutputStream in place of PrintStream
{code:java}
// final PrintStream printStream = new
PrintStream(fs.create(segmentDumpFile));
FSDataOutputStream fsOutStream = fs.create(segmentDumpFile);
return new RecordWriter<WritableComparable<?>, Writable>() {
public synchronized void write(WritableComparable<?> key, Writable
value)
throws IOException {
// printStream.println(value);
value.write(fsOutStream);
}
public synchronized void close(TaskAttemptContext context) throws
IOException {
// printStream.close();
fsOutStream.close();
}
};
{code}
> Nutch SegmentReader does not dump non-ASCII characters with Hadoop 3.x
> ----------------------------------------------------------------------
>
> Key: NUTCH-2696
> URL: https://issues.apache.org/jira/browse/NUTCH-2696
> Project: Nutch
> Issue Type: Bug
> Components: segment
> Environment: Hadoop version : 3.0.0 (CDH 6.1)
> Nutch : 1.15
> Mode : distributed mode
> Reporter: Laurent Hervaud
> Priority: Major
>
> All Nutch tasks work properly with Hadoop 3.x. (except SegmentReader)
> SegmentReader with -get option work fine.
> SegmentReader with -dump option replace non-ascii character by ?
> Exemple url : [http://www.wikipedia.fr/index.php]
>
> {code:java}
> command : ./runtime/deploy/bin/nutch readseg -dump
> /user/nutch/crawl1.15/segments/20190221093756 /tmp/dump1.15 -nocontent
> -nogenerate -noparse -noparsedata
> ParseText::
> Wikipedia.fr - Portail de recherche sur les projets Wikim?dia
> Chercher sur Wikip?dia en fran?ais
> L?encyclop?die librement r?utilisable que chacun peut am?liorer.
> {code}
>
>
> {code:java}
> command : ./runtime/deploy/bin/nutch readseg -get
> /user/nutch/crawl1.15/segments/20190221093756
> http://www.wikipedia.fr/index.php -nocontent -nogenerate -noparse -noparsedata
> ParseText::
> Wikipedia.fr - Portail de recherche sur les projets Wikimédia
> Chercher sur Wikipédia en français
> L’encyclopédie librement réutilisable que chacun peut améliorer.
> {code}
>
> I try to build with hadoop 3.0.0 dependencies in ivy.xml but i have the same
> result
> It's work fine in local mode.
>
--
This message was sent by Atlassian JIRA
(v7.6.3#76005)