[
https://issues.apache.org/jira/browse/NUTCH-1541?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16395312#comment-16395312
]
ASF GitHub Bot commented on NUTCH-1541:
---------------------------------------
okedoki commented on issue #294: NUTCH-1541 Indexer plugin to write CSV
URL: https://github.com/apache/nutch/pull/294#issuecomment-372329755
Hi @sebastian-nagel ,
I have the following exception, I'm not sure that it is related to the
distributed mode
Error: java.lang.StringIndexOutOfBoundsException: String index out of range:
-2827 at java.lang.String.substring(String.java:1967) at
org.apache.nutch.indexwriter.csv.CSVIndexWriter.writeEscaped(CSVIndexWriter.java:357)
at
org.apache.nutch.indexwriter.csv.CSVIndexWriter.writeQuoted(CSVIndexWriter.java:320)
at
org.apache.nutch.indexwriter.csv.CSVIndexWriter.write(CSVIndexWriter.java:248)
at org.apache.nutch.indexer.IndexWriters.write(IndexWriters.java:87) at
org.apache.nutch.indexer.IndexerOutputFormat$1.write(IndexerOutputFormat.java:50)
at
org.apache.nutch.indexer.IndexerOutputFormat$1.write(IndexerOutputFormat.java:41)
at
org.apache.hadoop.mapred.ReduceTask$OldTrackingRecordWriter.write(ReduceTask.java:493)
at org.apache.hadoop.mapred.ReduceTask$3.collect(ReduceTask.java:422) at
org.apache.nutch.indexer.IndexerMapReduce.reduce(IndexerMapReduce.java:369) at
org.apache.nutch.indexer.IndexerMapReduce.reduce(IndexerMapReduce.java:57) at
org.apache.hadoop.mapred.ReduceTask.runOldReducer(ReduceTask.java:444) at
org.apache.hadoop.mapred.ReduceTask.run(ReduceTask.java:392) at
org.apache.hadoop.mapred.YarnChild$2.run(YarnChild.java:170) at
java.security.AccessController.doPrivileged(Native Method) at
javax.security.auth.Subject.doAs(Subject.java:422) at
org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1866)
at org.apache.hadoop.mapred.YarnChild.main(YarnChild.java:164)
It happens when it tries to write escape characters for some string like this
6 GT
Année 2017-2018 Deuxième degré (orientation) :
Quatrième année de l’enseignement
de transition de qualification
général technique technique professionnel
Classe 4 GT 4 GT 4 GT 4 TT 4 TQA 4 TQR 4 PR 4 PV
Option principale
S c ie
n c e s
S c ie
n c e s
é c o
n o
m iq
u e s
S c ie
n c e s
s o
c ia
le s
E d
u c a ti
o n
p
h y s iq
u e
T e c h
n iq
What is kind of strange because in my version of Nutch all punctuation
characters should be deleted.
The other lines look like this and parsed correctly:
atletica de sports aerobics club voor jou atletica de sports aerobics club
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
[email protected]
> Indexer plugin to write CSV
> ---------------------------
>
> Key: NUTCH-1541
> URL: https://issues.apache.org/jira/browse/NUTCH-1541
> Project: Nutch
> Issue Type: New Feature
> Components: indexer
> Affects Versions: 1.7
> Reporter: Sebastian Nagel
> Priority: Minor
> Attachments: NUTCH-1541-v1.patch, NUTCH-1541-v2.patch
>
>
> With the new pluggable indexer a simple plugin would be handy to write
> configurable fields into a CSV file - for further analysis or just for export.
--
This message was sent by Atlassian JIRA
(v7.6.3#76005)