[ 
https://issues.apache.org/jira/browse/NUTCH-1541?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16395312#comment-16395312
 ] 

ASF GitHub Bot commented on NUTCH-1541:
---------------------------------------

okedoki commented on issue #294: NUTCH-1541 Indexer plugin to write CSV
URL: https://github.com/apache/nutch/pull/294#issuecomment-372329755
 
 
   Hi @sebastian-nagel ,
   I have the following exception, I'm not sure that it is related to the 
distributed mode
   
   Error: java.lang.StringIndexOutOfBoundsException: String index out of range: 
-2827 at java.lang.String.substring(String.java:1967) at 
org.apache.nutch.indexwriter.csv.CSVIndexWriter.writeEscaped(CSVIndexWriter.java:357)
 at 
org.apache.nutch.indexwriter.csv.CSVIndexWriter.writeQuoted(CSVIndexWriter.java:320)
 at 
org.apache.nutch.indexwriter.csv.CSVIndexWriter.write(CSVIndexWriter.java:248) 
at org.apache.nutch.indexer.IndexWriters.write(IndexWriters.java:87) at 
org.apache.nutch.indexer.IndexerOutputFormat$1.write(IndexerOutputFormat.java:50)
 at 
org.apache.nutch.indexer.IndexerOutputFormat$1.write(IndexerOutputFormat.java:41)
 at 
org.apache.hadoop.mapred.ReduceTask$OldTrackingRecordWriter.write(ReduceTask.java:493)
 at org.apache.hadoop.mapred.ReduceTask$3.collect(ReduceTask.java:422) at 
org.apache.nutch.indexer.IndexerMapReduce.reduce(IndexerMapReduce.java:369) at 
org.apache.nutch.indexer.IndexerMapReduce.reduce(IndexerMapReduce.java:57) at 
org.apache.hadoop.mapred.ReduceTask.runOldReducer(ReduceTask.java:444) at 
org.apache.hadoop.mapred.ReduceTask.run(ReduceTask.java:392) at 
org.apache.hadoop.mapred.YarnChild$2.run(YarnChild.java:170) at 
java.security.AccessController.doPrivileged(Native Method) at 
javax.security.auth.Subject.doAs(Subject.java:422) at 
org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1866)
 at org.apache.hadoop.mapred.YarnChild.main(YarnChild.java:164)
   
   
   It happens when it tries to write escape characters for some string like this
   6 GT
   Année 2017-2018 Deuxième degré (orientation) :
   Quatrième année de l’enseignement
   de transition de qualification
   général technique technique professionnel
   Classe 4 GT 4 GT 4 GT 4 TT 4 TQA 4 TQR 4 PR 4 PV
   Option principale
   S c ie
   n c e s
   S c ie
   n c e s
   é c o
   n o
   m iq
   u e s
   S c ie
   n c e s
   s o
   c ia
   le s
   E d
   u c a ti
   o n
   p
   h y s iq
   u e
   T e c h
   n iq
   
   What is kind of strange because in my version of Nutch all punctuation 
characters should be deleted.
   The other lines look like this and parsed correctly:
   atletica de sports aerobics club voor jou atletica de sports aerobics club
   
   
   

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


> Indexer plugin to write CSV
> ---------------------------
>
>                 Key: NUTCH-1541
>                 URL: https://issues.apache.org/jira/browse/NUTCH-1541
>             Project: Nutch
>          Issue Type: New Feature
>          Components: indexer
>    Affects Versions: 1.7
>            Reporter: Sebastian Nagel
>            Priority: Minor
>         Attachments: NUTCH-1541-v1.patch, NUTCH-1541-v2.patch
>
>
> With the new pluggable indexer a simple plugin would be handy to write 
> configurable fields into a CSV file - for further analysis or just for export.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

Reply via email to