I played with this possibility on the extremely experimental https://issues.apache.org/jira/browse/LUCENE-5012 which I haven't gotten back to for a long time...
The changes on that branch adds the idea of a "deleted token", by just setting a new DeletedAttribute marking whether the token is deleted or not. Otherwise all other token attributes are visible like normal. I.e., tokens are deleted the way documents are deleted in Lucene (marked with a bit but not actually deleted until "later"). E.g. StopFilter (on that branch) just sets that attribute to true, instead of removing the token and leaving a hole. The branch also had an InsertDeletedPunctuationTokenStage that would detect when the tokenizer had dropped punctuation and then insert [deleted] punctuation tokens. This way IndexWriter could still ignore such tokens (since they are marked as deleted), but other token filters would still see the deleted tokens and be able to make decisions based on them... Anyway, the branch is far far away from committing, but maybe we could just pull off of it the idea of a "deleted bit" that we mark on a given Token to tell IndexWriter not to index it, but subsequent token filters would be able to see it ... Mike McCandless http://blog.mikemccandless.com On Wed, Oct 1, 2014 at 3:08 AM, Dawid Weiss <dawid.we...@gmail.com> wrote: > Hi Steve, > > I have to admit I also find it frequently useful to include > punctuation as tokens (even if it's filtered out by subsequent token > filters for indexing, it's a useful to-have for other NLP tasks). Do > you think it'd be possible (read: relatively easy) to create an > analyzer (or a modification of the standard one's lexer) so that > punctuation is returned as a separate token type? > > Dawid > > > On Wed, Oct 1, 2014 at 7:01 AM, Steve Rowe <sar...@gmail.com> wrote: >> Hi Paul, >> >> StandardTokenizer implements the Word Boundaries rules in the Unicode Text >> Segmentation Standard Annex UAX#29 - here’s the relevant section for Unicode >> 6.1.0, which is the version supported by Lucene 4.1.0: >> <http://www.unicode.org/reports/tr29/tr29-19.html#Word_Boundaries>. >> >> Only those sequences between boundaries that contain letters and/or digits >> are returned as tokens; all other sequences between boundaries are skipped >> over and not returned as tokens. >> >> Steve >> >> On Sep 30, 2014, at 3:54 PM, Paul Taylor <paul_t...@fastmail.fm> wrote: >> >>> Does StandardTokenizer remove punctuation (in Lucene 4.1) >>> >>> Im just trying to move back to StandardTokenizer from my own old custom >>> implemenation because the newer version seems to have much better support >>> for Asian languages >>> >>> However this code except fails on incrementToken() implying that the !!! >>> are removed from output, yet looking at the jflex classes I cant see >>> anything to indicate punctuation is removed, is it removed and if so can i >>> remove it ? >>> >>> Tokenizer tokenizer = new StandardTokenizer(LuceneVersion.LUCENE_VERSION, >>> new StringReader("!!!")); >>> assertNotNull(tokenizer); >>> tokenizer.reset(); >>> assertTrue(tokenizer.incrementToken()); >>> >>> --------------------------------------------------------------------- >>> To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org >>> For additional commands, e-mail: java-user-h...@lucene.apache.org >>> >> >> >> --------------------------------------------------------------------- >> To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org >> For additional commands, e-mail: java-user-h...@lucene.apache.org >> > > --------------------------------------------------------------------- > To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org > For additional commands, e-mail: java-user-h...@lucene.apache.org > --------------------------------------------------------------------- To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org For additional commands, e-mail: java-user-h...@lucene.apache.org