Re: Should ASCIIFoldingFilter be deprecated?
On Mon, Feb 7, 2011 at 10:51 PM, Steven A Rowe sar...@syr.edu wrote: I haven't done any benchmarking, but I'm pretty sure that ASCIIFoldingFilter can achieve a significantly higher throughput rate than MappingCharFilter, and given that, it probably makes sense to keep both, to allow people to make the choice about the tradeoff between the flexibility provided by the human-readable (and editable) mapping file and the speed provided by ASCIIFoldingFilter. I agree... have you seen http://bugs.icu-project.org/trac/ticket/7743 ? Hopefully something along those lines would allow us to support the flexibility in a factory or whatever (even better as described, when you just want a small tweak) but still with good performance. - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
RE: Should ASCIIFoldingFilter be deprecated?
Chris Hostetter-3 wrote: CharFilters and TokenFilters have different purposes though... http://wiki.apache.org/solr/AnalyzersTokenizersTokenFilters#When_To_use_a_CharFilter_vs_a_TokenFilter (ie: If you use MappingCharFilter, you can't then tokenize on some of the characters you filtered away) Right, but it’s hard to imagine wanting to tokenize on an accent character or some other modification specified in these particular mapping files. Steven A Rowe wrote: AFAIK, ISOLatin1AccentFilter was deprecated because ASCIIFoldingFilter provides a superset of it mappings. *If* that is the case then this file should also be removed: solr/example/solr/conf/mapping-ISOLatin1Accent.txt Steven A Rowe wrote: I haven't done any benchmarking, but I'm pretty sure that ASCIIFoldingFilter can achieve a significantly higher throughput rate than MappingCharFilter, and given that, it probably makes sense to keep both, to allow people to make the choice about the tradeoff between the flexibility provided by the human-readable (and editable) mapping file and the speed provided by ASCIIFoldingFilter. I'm skeptical that whatever the difference is is relevant in the scheme of things. The cost to keeping it is introducing confusion on users, and more code to maintain. ~ David Smiley - Author: https://www.packtpub.com/solr-1-4-enterprise-search-server/book -- View this message in context: http://lucene.472066.n3.nabble.com/Should-ASCIIFoldingFilter-be-deprecated-tp2448919p2451504.html Sent from the Solr - Dev mailing list archive at Nabble.com. - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
Re: Should ASCIIFoldingFilter be deprecated?
On Tue, Feb 8, 2011 at 9:12 AM, David Smiley (@MITRE.org) dsmi...@mitre.org wrote: I'm skeptical that whatever the difference is is relevant in the scheme of things. The cost to keeping it is introducing confusion on users, and more code to maintain. its pretty significant. charfilters are not reusable, and box every character and lookup out of a hashmap (i made a patch to fix the reusability, but no one has commented) : https://issues.apache.org/jira/browse/LUCENE-2788 asciifoldingfilter does a huge switch (which still isnt optimal), but its way way faster than mappingcharfilter, especially since its a no-op for chars 0x7F. icufoldingfilter precompiles a recursively decomposed trie, so its lookup is a unicode folded trie (icu-project.org/docs/papers/foldedtrie_iuc21.ppt). I think its a tad slower than asciifoldingfilter but it also incorporates case folding and unicode normalization: neither asciifoldingfilter nor mappingcharfilter will not properly fold http://www.geonames.org/search.html?q=Ab%C5%AB+Z%CC%A7abycountry=, because there is no composed form for Z + combining cedilla, but icufoldingfilter will. - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
Re: Should ASCIIFoldingFilter be deprecated?
Robert Muir wrote: On Tue, Feb 8, 2011 at 9:12 AM, David Smiley (@MITRE.org) dsmi...@mitre.org wrote: I'm skeptical that whatever the difference is is relevant in the scheme of things. The cost to keeping it is introducing confusion on users, and more code to maintain. its pretty significant. charfilters are not reusable, and box every character and lookup out of a hashmap (i made a patch to fix the reusability, but no one has commented) : https://issues.apache.org/jira/browse/LUCENE-2788 asciifoldingfilter does a huge switch (which still isnt optimal), but its way way faster than mappingcharfilter, especially since its a no-op for chars 0x7F. Well then I see a path forward to speed up MappingCharFilter substantially. There's your LUCENE-2788, and then you could easily add the same no-op optimization for the smallest char value in the HashMap. - Author: https://www.packtpub.com/solr-1-4-enterprise-search-server/book -- View this message in context: http://lucene.472066.n3.nabble.com/Should-ASCIIFoldingFilter-be-deprecated-tp2448919p2451800.html Sent from the Solr - Dev mailing list archive at Nabble.com. - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
Re: Should ASCIIFoldingFilter be deprecated?
On Tue, Feb 8, 2011 at 10:05 AM, David Smiley (@MITRE.org) dsmi...@mitre.org wrote: Well then I see a path forward to speed up MappingCharFilter substantially. There's your LUCENE-2788, and then you could easily add the same no-op optimization for the smallest char value in the HashMap. only for the smallest starter, and still mappingcharfilter has to maintain an array of any offset changes (this is now binary searched) for correctOffset. - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
Re: Should ASCIIFoldingFilter be deprecated?
unsubscribe On 2/8/11 7:05 AM, David Smiley (@MITRE.org) wrote: Robert Muir wrote: On Tue, Feb 8, 2011 at 9:12 AM, David Smiley (@MITRE.org) dsmi...@mitre.org wrote: I'm skeptical that whatever the difference is is relevant in the scheme of things. The cost to keeping it is introducing confusion on users, and more code to maintain. its pretty significant. charfilters are not reusable, and box every character and lookup out of a hashmap (i made a patch to fix the reusability, but no one has commented) : https://issues.apache.org/jira/browse/LUCENE-2788 asciifoldingfilter does a huge switch (which still isnt optimal), but its way way faster than mappingcharfilter, especially since its a no-op for chars 0x7F. Well then I see a path forward to speed up MappingCharFilter substantially. There's your LUCENE-2788, and then you could easily add the same no-op optimization for the smallest char value in the HashMap. - Author: https://www.packtpub.com/solr-1-4-enterprise-search-server/book - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
RE: Should ASCIIFoldingFilter be deprecated?
AFAIK, ISOLatin1AccentFilter was deprecated because ASCIIFoldingFilter provides a superset of it mappings. I haven't done any benchmarking, but I'm pretty sure that ASCIIFoldingFilter can achieve a significantly higher throughput rate than MappingCharFilter, and given that, it probably makes sense to keep both, to allow people to make the choice about the tradeoff between the flexibility provided by the human-readable (and editable) mapping file and the speed provided by ASCIIFoldingFilter. Steve -Original Message- From: David Smiley (@MITRE.org) [mailto:dsmi...@mitre.org] Sent: Monday, February 07, 2011 10:34 PM To: solr-...@lucene.apache.org Subject: Should ASCIIFoldingFilter be deprecated? ISOLatin1AccentFilter is deprecated, presumably because you can (and should) use MappingCharFilter configured with mapping-ISOLatin1Accent.txt. By that same reasoning, shouldn't ASCIIFoldingFilter be deprecated in favor of using mapping-FoldToASCII.txt ? ~ David Smiley - Author: https://www.packtpub.com/solr-1-4-enterprise-search-server/book -- View this message in context: http://lucene.472066.n3.nabble.com/Should- ASCIIFoldingFilter-be-deprecated-tp2448919p2448919.html Sent from the Solr - Dev mailing list archive at Nabble.com. - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
Re: Should ASCIIFoldingFilter be deprecated?
: : ISOLatin1AccentFilter is deprecated, presumably because you can (and should) : use MappingCharFilter configured with mapping-ISOLatin1Accent.txt. By that : same reasoning, shouldn't ASCIIFoldingFilter be deprecated in favor of using : mapping-FoldToASCII.txt ? CharFilters and TokenFilters have different purposes though... http://wiki.apache.org/solr/AnalyzersTokenizersTokenFilters#When_To_use_a_CharFilter_vs_a_TokenFilter (ie: If you use MappingCharFilter, you can't then tokenize on some of the characters you filtered away) : : ~ David Smiley : : - : Author: https://www.packtpub.com/solr-1-4-enterprise-search-server/book : -- : View this message in context: http://lucene.472066.n3.nabble.com/Should-ASCIIFoldingFilter-be-deprecated-tp2448919p2448919.html : Sent from the Solr - Dev mailing list archive at Nabble.com. : : - : To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org : For additional commands, e-mail: dev-h...@lucene.apache.org : : -Hoss - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org