Re: Should ASCIIFoldingFilter be deprecated?

2011-02-08 Thread Robert Muir
On Mon, Feb 7, 2011 at 10:51 PM, Steven A Rowe sar...@syr.edu wrote:
 I haven't done any benchmarking, but I'm pretty sure that ASCIIFoldingFilter 
 can achieve a significantly higher throughput rate than MappingCharFilter, 
 and given that, it probably makes sense to keep both, to allow people to make 
 the choice about the tradeoff between the flexibility provided by the 
 human-readable (and editable) mapping file and the speed provided by 
 ASCIIFoldingFilter.

I agree... have you seen http://bugs.icu-project.org/trac/ticket/7743 ?

Hopefully something along those lines would allow us to support the
flexibility in a factory or whatever (even better as described, when
you just want a small tweak) but still with good performance.

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



RE: Should ASCIIFoldingFilter be deprecated?

2011-02-08 Thread David Smiley (@MITRE.org)


Chris Hostetter-3 wrote:
 
 CharFilters and TokenFilters have different purposes though...
 
 http://wiki.apache.org/solr/AnalyzersTokenizersTokenFilters#When_To_use_a_CharFilter_vs_a_TokenFilter
 
 (ie: If you use MappingCharFilter, you can't then tokenize on some of the 
 characters you filtered away)
 

Right, but it’s hard to imagine wanting to tokenize on an accent character
or some other modification specified in these particular mapping files.


Steven A Rowe wrote:
 
 AFAIK, ISOLatin1AccentFilter was deprecated because ASCIIFoldingFilter
 provides a superset of it mappings.
 

*If* that is the case then this file should also be removed:
solr/example/solr/conf/mapping-ISOLatin1Accent.txt


Steven A Rowe wrote:
 
 I haven't done any benchmarking, but I'm pretty sure that
 ASCIIFoldingFilter can achieve a significantly higher throughput rate than
 MappingCharFilter, and given that, it probably makes sense to keep both,
 to allow people to make the choice about the tradeoff between the
 flexibility provided by the human-readable (and editable) mapping file and
 the speed provided by ASCIIFoldingFilter.
 

I'm skeptical that whatever the difference is is relevant in the scheme of
things. The cost to keeping it is introducing confusion on users, and more
code to maintain.

~ David Smiley

-
 Author: https://www.packtpub.com/solr-1-4-enterprise-search-server/book
-- 
View this message in context: 
http://lucene.472066.n3.nabble.com/Should-ASCIIFoldingFilter-be-deprecated-tp2448919p2451504.html
Sent from the Solr - Dev mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



Re: Should ASCIIFoldingFilter be deprecated?

2011-02-08 Thread Robert Muir
On Tue, Feb 8, 2011 at 9:12 AM, David Smiley (@MITRE.org)
dsmi...@mitre.org wrote:

 I'm skeptical that whatever the difference is is relevant in the scheme of
 things. The cost to keeping it is introducing confusion on users, and more
 code to maintain.


its pretty significant. charfilters are not reusable, and box every
character and lookup out of a hashmap (i made a patch to fix the
reusability, but no one has commented) :
https://issues.apache.org/jira/browse/LUCENE-2788

asciifoldingfilter does a huge switch (which still isnt optimal), but
its way way faster than mappingcharfilter, especially since its a
no-op for chars  0x7F.

icufoldingfilter precompiles a recursively decomposed trie, so its
lookup is a unicode folded trie
(icu-project.org/docs/papers/foldedtrie_iuc21.ppt). I think its a tad
slower than asciifoldingfilter but it also incorporates case folding
and unicode normalization: neither asciifoldingfilter nor
mappingcharfilter will not properly fold
http://www.geonames.org/search.html?q=Ab%C5%AB+Z%CC%A7abycountry=,
because there is no composed form for Z + combining cedilla, but
icufoldingfilter will.

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



Re: Should ASCIIFoldingFilter be deprecated?

2011-02-08 Thread David Smiley (@MITRE.org)


Robert Muir wrote:
 
 On Tue, Feb 8, 2011 at 9:12 AM, David Smiley (@MITRE.org)
 dsmi...@mitre.org wrote:
 
 I'm skeptical that whatever the difference is is relevant in the scheme
 of
 things. The cost to keeping it is introducing confusion on users, and
 more
 code to maintain.

 
 its pretty significant. charfilters are not reusable, and box every
 character and lookup out of a hashmap (i made a patch to fix the
 reusability, but no one has commented) :
 https://issues.apache.org/jira/browse/LUCENE-2788
 
 asciifoldingfilter does a huge switch (which still isnt optimal), but
 its way way faster than mappingcharfilter, especially since its a
 no-op for chars  0x7F.
 

Well then I see a path forward to speed up MappingCharFilter substantially. 
There's your LUCENE-2788, and then you could easily add the same no-op
optimization for the smallest char value in the HashMap.

-
 Author: https://www.packtpub.com/solr-1-4-enterprise-search-server/book
-- 
View this message in context: 
http://lucene.472066.n3.nabble.com/Should-ASCIIFoldingFilter-be-deprecated-tp2448919p2451800.html
Sent from the Solr - Dev mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



Re: Should ASCIIFoldingFilter be deprecated?

2011-02-08 Thread Robert Muir
On Tue, Feb 8, 2011 at 10:05 AM, David Smiley (@MITRE.org)
dsmi...@mitre.org wrote:

 Well then I see a path forward to speed up MappingCharFilter substantially.
 There's your LUCENE-2788, and then you could easily add the same no-op
 optimization for the smallest char value in the HashMap.

only for the smallest starter, and still mappingcharfilter has to
maintain an array of any offset changes (this is now binary searched)
for correctOffset.

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



Re: Should ASCIIFoldingFilter be deprecated?

2011-02-08 Thread Robert Zotter

unsubscribe

On 2/8/11 7:05 AM, David Smiley (@MITRE.org) wrote:


Robert Muir wrote:

On Tue, Feb 8, 2011 at 9:12 AM, David Smiley (@MITRE.org)
dsmi...@mitre.org  wrote:


I'm skeptical that whatever the difference is is relevant in the scheme
of
things. The cost to keeping it is introducing confusion on users, and
more
code to maintain.


its pretty significant. charfilters are not reusable, and box every
character and lookup out of a hashmap (i made a patch to fix the
reusability, but no one has commented) :
https://issues.apache.org/jira/browse/LUCENE-2788

asciifoldingfilter does a huge switch (which still isnt optimal), but
its way way faster than mappingcharfilter, especially since its a
no-op for chars  0x7F.


Well then I see a path forward to speed up MappingCharFilter substantially.
There's your LUCENE-2788, and then you could easily add the same no-op
optimization for the smallest char value in the HashMap.

-
  Author: https://www.packtpub.com/solr-1-4-enterprise-search-server/book


-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



RE: Should ASCIIFoldingFilter be deprecated?

2011-02-07 Thread Steven A Rowe
AFAIK, ISOLatin1AccentFilter was deprecated because ASCIIFoldingFilter provides 
a superset of it mappings.

I haven't done any benchmarking, but I'm pretty sure that ASCIIFoldingFilter 
can achieve a significantly higher throughput rate than MappingCharFilter, and 
given that, it probably makes sense to keep both, to allow people to make the 
choice about the tradeoff between the flexibility provided by the 
human-readable (and editable) mapping file and the speed provided by 
ASCIIFoldingFilter.

Steve

 -Original Message-
 From: David Smiley (@MITRE.org) [mailto:dsmi...@mitre.org]
 Sent: Monday, February 07, 2011 10:34 PM
 To: solr-...@lucene.apache.org
 Subject: Should ASCIIFoldingFilter be deprecated?
 
 
 ISOLatin1AccentFilter is deprecated, presumably because you can (and
 should)
 use MappingCharFilter configured with mapping-ISOLatin1Accent.txt.  By
 that
 same reasoning, shouldn't ASCIIFoldingFilter be deprecated in favor of
 using
 mapping-FoldToASCII.txt ?
 
 ~ David Smiley
 
 -
  Author: https://www.packtpub.com/solr-1-4-enterprise-search-server/book
 --
 View this message in context: http://lucene.472066.n3.nabble.com/Should-
 ASCIIFoldingFilter-be-deprecated-tp2448919p2448919.html
 Sent from the Solr - Dev mailing list archive at Nabble.com.
 
 -
 To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
 For additional commands, e-mail: dev-h...@lucene.apache.org



Re: Should ASCIIFoldingFilter be deprecated?

2011-02-07 Thread Chris Hostetter
: 
: ISOLatin1AccentFilter is deprecated, presumably because you can (and should)
: use MappingCharFilter configured with mapping-ISOLatin1Accent.txt.  By that
: same reasoning, shouldn't ASCIIFoldingFilter be deprecated in favor of using
: mapping-FoldToASCII.txt ?

CharFilters and TokenFilters have different purposes though...

http://wiki.apache.org/solr/AnalyzersTokenizersTokenFilters#When_To_use_a_CharFilter_vs_a_TokenFilter

(ie: If you use MappingCharFilter, you can't then tokenize on some of the 
characters you filtered away)



: 
: ~ David Smiley
: 
: -
:  Author: https://www.packtpub.com/solr-1-4-enterprise-search-server/book
: -- 
: View this message in context: 
http://lucene.472066.n3.nabble.com/Should-ASCIIFoldingFilter-be-deprecated-tp2448919p2448919.html
: Sent from the Solr - Dev mailing list archive at Nabble.com.
: 
: -
: To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
: For additional commands, e-mail: dev-h...@lucene.apache.org
: 
: 

-Hoss

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org