[ https://issues.apache.org/jira/browse/LUCENE-2413?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12868038#action_12868038 ]
Robert Muir commented on LUCENE-2413: ------------------------------------- bq. May this much faster than CharArraySet I ran indexing tests a while ago (reuters) with CharArraySet itself implemented with a DFA, and it was slightly faster, but not much. I think this is because english words are usually not very long (average length=5). For other languages this technique might save some cpu time, but there are some "problems" i imagine # building an automaton from a list of words is more expensive, although Dawid Weiss has implemented an addition to automaton that does this fast. # in general building automaton and runautomaton etc is more "heavy" i would think, but Mike Mccandless hacked away a lot of this heaviness when we converted to UTF-32. # the CharacterRunAutomaton is not optimized right now, we disabled the classmap[] for chars because it consume more RAM. I think if we were to care about performance on char[] we should make it classmap 0x0-0xffff and binary search the rest, or something similar. currently it binarysearches on each input character. Somewhat related, a while ago i tested this with CharArraySet as a DFA, and opened this issue: LUCENE-2227. But obviously this is not the only way, as this example shows filtering on the dfa itself (and not using chararrayset at all). So in general, i have those concerns right now, but maybe in the future once some things are addressed we could at least make an optional stopfilter impl or something similar. One thing i like about this filter personally, is that rejected terms always get (optionally) the posInc increased... I do not think our existing KeepWord or LengthFilters do this, but maybe i am wrong. > Consolidate all (Solr's & Lucene's) analyzers into modules/analysis > ------------------------------------------------------------------- > > Key: LUCENE-2413 > URL: https://issues.apache.org/jira/browse/LUCENE-2413 > Project: Lucene - Java > Issue Type: Improvement > Components: contrib/analyzers > Reporter: Michael McCandless > Assignee: Robert Muir > Fix For: 4.0 > > Attachments: LUCENE-2413-charfilter.patch, LUCENE-2413-PFAW+LF.patch, > LUCENE-2413_commongrams.patch, LUCENE-2413_folding.patch, > LUCENE-2413_htmlstrip.patch, LUCENE-2413_keep_hyphen_trim.patch, > LUCENE-2413_mockfilter.patch, LUCENE-2413_mockfilter.patch, > LUCENE-2413_pattern.patch, LUCENE-2413_porter.patch, > LUCENE-2413_removeDups.patch, LUCENE-2413_synonym.patch, > LUCENE-2413_teesink.patch, LUCENE-2413_testanalyzer.patch, > LUCENE-2413_testanalyzer.patch, LUCENE-2413_tests2.patch, > LUCENE-2413_wdf.patch > > > We've been wanting to do this for quite some time now... I think, now that > Solr/Lucene are merged, and we're looking at opening an unstable line of > development for Solr/Lucene, now is the right time to do it. > A standalone module for all analyzers also empowers apps to separately > version the analyzers from which version of Solr/Lucene they use, possibly > enabling us to remove Version entirely from the analyzers. > We should also do LUCENE-2309 (decouple, as much as possible, indexer from > the analysis API), but I don't think that issue needs to block this > consolidation. > Once we do this, there is one place where our users can find all the > analyzers that Solr/Lucene provide. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. --------------------------------------------------------------------- To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org