[ https://issues.apache.org/jira/browse/NUTCH-1541?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16395312#comment-16395312 ]
ASF GitHub Bot commented on NUTCH-1541: --------------------------------------- okedoki commented on issue #294: NUTCH-1541 Indexer plugin to write CSV URL: https://github.com/apache/nutch/pull/294#issuecomment-372329755 Hi @sebastian-nagel , I have the following exception, I'm not sure that it is related to the distributed mode Error: java.lang.StringIndexOutOfBoundsException: String index out of range: -2827 at java.lang.String.substring(String.java:1967) at org.apache.nutch.indexwriter.csv.CSVIndexWriter.writeEscaped(CSVIndexWriter.java:357) at org.apache.nutch.indexwriter.csv.CSVIndexWriter.writeQuoted(CSVIndexWriter.java:320) at org.apache.nutch.indexwriter.csv.CSVIndexWriter.write(CSVIndexWriter.java:248) at org.apache.nutch.indexer.IndexWriters.write(IndexWriters.java:87) at org.apache.nutch.indexer.IndexerOutputFormat$1.write(IndexerOutputFormat.java:50) at org.apache.nutch.indexer.IndexerOutputFormat$1.write(IndexerOutputFormat.java:41) at org.apache.hadoop.mapred.ReduceTask$OldTrackingRecordWriter.write(ReduceTask.java:493) at org.apache.hadoop.mapred.ReduceTask$3.collect(ReduceTask.java:422) at org.apache.nutch.indexer.IndexerMapReduce.reduce(IndexerMapReduce.java:369) at org.apache.nutch.indexer.IndexerMapReduce.reduce(IndexerMapReduce.java:57) at org.apache.hadoop.mapred.ReduceTask.runOldReducer(ReduceTask.java:444) at org.apache.hadoop.mapred.ReduceTask.run(ReduceTask.java:392) at org.apache.hadoop.mapred.YarnChild$2.run(YarnChild.java:170) at java.security.AccessController.doPrivileged(Native Method) at javax.security.auth.Subject.doAs(Subject.java:422) at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1866) at org.apache.hadoop.mapred.YarnChild.main(YarnChild.java:164) It happens when it tries to write escape characters for some string like this 6 GT Année 2017-2018 Deuxième degré (orientation) : Quatrième année de l’enseignement de transition de qualification général technique technique professionnel Classe 4 GT 4 GT 4 GT 4 TT 4 TQA 4 TQR 4 PR 4 PV Option principale S c ie n c e s S c ie n c e s é c o n o m iq u e s S c ie n c e s s o c ia le s E d u c a ti o n p h y s iq u e T e c h n iq What is kind of strange because in my version of Nutch all punctuation characters should be deleted. The other lines look like this and parsed correctly: atletica de sports aerobics club voor jou atletica de sports aerobics club ---------------------------------------------------------------- This is an automated message from the Apache Git Service. To respond to the message, please log on GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org > Indexer plugin to write CSV > --------------------------- > > Key: NUTCH-1541 > URL: https://issues.apache.org/jira/browse/NUTCH-1541 > Project: Nutch > Issue Type: New Feature > Components: indexer > Affects Versions: 1.7 > Reporter: Sebastian Nagel > Priority: Minor > Attachments: NUTCH-1541-v1.patch, NUTCH-1541-v2.patch > > > With the new pluggable indexer a simple plugin would be handy to write > configurable fields into a CSV file - for further analysis or just for export. -- This message was sent by Atlassian JIRA (v7.6.3#76005)