Are you using MappingCharFilter? It unfortunately has known bugs which require controversial API changes to fix: https://issues.apache.org/jira/browse/LUCENE-6595
Mike McCandless http://blog.mikemccandless.com On Sat, Oct 3, 2015 at 6:02 PM, Uwe Schindler <u...@thetaphi.de> wrote: > Hi, > > Lucene does not remove the \r\n while indexing or storing fields. The > Analyzer just splits e.g., at whitespace (depends on Analyzer). So if you > original data has \r\n, then the offsets would be according to that (it > counts 2 chars). > > Could it be that you read it using a BufferedReader per line and pass as > Strings? > > Uwe > > ----- > Uwe Schindler > H.-H.-Meier-Allee 63, D-28213 Bremen > http://www.thetaphi.de > eMail: u...@thetaphi.de > > >> -----Original Message----- >> From: Ziqi Zhang [mailto:ziqi.zh...@sheffield.ac.uk] >> Sent: Saturday, October 03, 2015 5:01 PM >> To: java-user@lucene.apache.org >> Subject: lucene deliberately removes \r (windows carriage char) >> >> Hi >> >> I am trying to pin-point a mismatch between the offsets produced by lucene >> indexing process when I use the offsets to substring from the original >> document content. >> >> I try to debug as far as I can go but I lost track of lucene when I am at >> line 298 >> of DefaultIndexingChain (lucene 5.3.0): >> >> for (IndexableField field : docState.doc) { >> fieldCount = processField(field, fieldGen, fieldCount); >> } >> >> Basically at this point I can see that the content field (one of the >> IndexableField) I am interested in has already removed all "\r" from the >> "\r\n" newline characters (windows) from the content. But I am unable to >> trace how these IndexableField are generated, and how the raw content is >> passed to them. >> >> I can be certain that my program did pass strings with lots of "\r\n" >> >> So the question is is this (i.e., removing \r) deliberate? >> >> Thanks >> >> >> >> --------------------------------------------------------------------- >> To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org >> For additional commands, e-mail: java-user-h...@lucene.apache.org > > > --------------------------------------------------------------------- > To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org > For additional commands, e-mail: java-user-h...@lucene.apache.org > --------------------------------------------------------------------- To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org For additional commands, e-mail: java-user-h...@lucene.apache.org