[jira] Commented: (LUCENE-2413) Consolidate all (Solr's & Lucene's) analyzers into modules/analysis

Robert Muir (JIRA) Sun, 16 May 2010 13:49:08 -0700

    [ 
https://issues.apache.org/jira/browse/LUCENE-2413?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12868038#action_12868038
 ]


Robert Muir commented on LUCENE-2413:
-------------------------------------

bq. May this much faster than CharArraySet

I ran indexing tests a while ago (reuters) with CharArraySet itself implemented 
with a DFA, and it was slightly faster, but not much. I think this is because 
english words are usually not very long (average length=5). For other languages 
this technique might save some cpu time, but there are some "problems" i imagine

# building an automaton from a list of words is more expensive, although Dawid 
Weiss has implemented an addition to automaton that does this fast.
# in general building automaton and runautomaton etc is more "heavy" i would 
think, but Mike Mccandless hacked away a lot of this heaviness when we 
converted to UTF-32.
# the CharacterRunAutomaton is not optimized right now, we disabled the 
classmap[] for chars because it consume more RAM. I think if we were to care 
about performance on char[] we should make it classmap 0x0-0xffff and binary 
search the rest, or something similar. currently it binarysearches on each 
input character.

Somewhat related, a while ago i tested this with CharArraySet as a DFA, and 
opened this issue: LUCENE-2227. But obviously this is not the only way, as this 
example shows filtering on the dfa itself (and not using chararrayset at all). 

So in general, i have those concerns right now, but maybe in the future once 
some things are addressed we could at least make an optional stopfilter impl or 
something similar.

One thing i like about this filter personally, is that rejected terms always 
get (optionally) the posInc increased... I do not think our existing KeepWord 
or LengthFilters do this, but maybe i am wrong.


> Consolidate all (Solr's & Lucene's) analyzers into modules/analysis
> -------------------------------------------------------------------
>
>                 Key: LUCENE-2413
>                 URL: https://issues.apache.org/jira/browse/LUCENE-2413
>             Project: Lucene - Java
>          Issue Type: Improvement
>          Components: contrib/analyzers
>            Reporter: Michael McCandless
>            Assignee: Robert Muir
>             Fix For: 4.0
>
>         Attachments: LUCENE-2413-charfilter.patch, LUCENE-2413-PFAW+LF.patch, 
> LUCENE-2413_commongrams.patch, LUCENE-2413_folding.patch, 
> LUCENE-2413_htmlstrip.patch, LUCENE-2413_keep_hyphen_trim.patch, 
> LUCENE-2413_mockfilter.patch, LUCENE-2413_mockfilter.patch, 
> LUCENE-2413_pattern.patch, LUCENE-2413_porter.patch, 
> LUCENE-2413_removeDups.patch, LUCENE-2413_synonym.patch, 
> LUCENE-2413_teesink.patch, LUCENE-2413_testanalyzer.patch, 
> LUCENE-2413_testanalyzer.patch, LUCENE-2413_tests2.patch, 
> LUCENE-2413_wdf.patch
>
>
> We've been wanting to do this for quite some time now...  I think, now that 
> Solr/Lucene are merged, and we're looking at opening an unstable line of 
> development for Solr/Lucene, now is the right time to do it.
> A standalone module for all analyzers also empowers apps to separately 
> version the analyzers from which version of Solr/Lucene they use, possibly 
> enabling us to remove Version entirely from the analyzers.
> We should also do LUCENE-2309 (decouple, as much as possible, indexer from 
> the analysis API), but I don't think that issue needs to block this 
> consolidation.
> Once we do this, there is one place where our users can find all the 
> analyzers that Solr/Lucene provide.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] Commented: (LUCENE-2413) Consolidate all (Solr's & Lucene's) analyzers into modules/analysis

Reply via email to