[ https://issues.apache.org/jira/browse/LUCENE-966?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12516762 ]
Stanislaw Osinski commented on LUCENE-966: ------------------------------------------ Thanks for spotting the differences, I'll add them to the unit tests and will correct the tokenizer accordingly. One doubt I have is about the filename-like tokens, e.g.: OLD: (2004.jpg,34461,34469,type=<HOST>) NEW: (2004.jpg,34461,34469,type=<NUM>) To be honest, both variants seem "almost" correct. If you try 2007.org -- this is a correct domain name (and has a funny website on it :), so, given the fact that we don't check for typical suffixes, such as ".com", <HOST> doesn't seem wrong. On the other hand, 2004.jpg may well have been some sort of numerical code or a product number, so <NUM> is not totally irrelevant either. For the JFlex-based tokenizer, I put the <NUM> rule matching first, as it gives some nice performance benefits. We can put HOST first, and then we'll get compliance with the old version. Another option we might consider is: * adding a new token type for file names (get a list of common extensions or even assume that an extension is simply three alphanumerical characters) * checking for common domain names in hosts (something along the lines of: "mil" | "info" | "gov" | "edu" | "biz" | "com" | "org" | "net" | "arpa" | {LETTER}{2}) I'm not sure how this will affect the performance of the tokenizer, but my rough guess is that if we don't come up with very complex/ backtracking-prone rules there should not be too much of a difference. On the other hand, if 100% compatibility with the old tokenizer is a priority, adding new token types is not a good idea, I guess. Finally, when it comes to the initialization time of the new tokenizer -- according to the JFlex documentation, some time is required to unpack the transition tables. But the unpacking takes place during the initialization of static fields, so once the class is loaded the overhead should be negligible. > A faster JFlex-based replacement for StandardAnalyzer > ----------------------------------------------------- > > Key: LUCENE-966 > URL: https://issues.apache.org/jira/browse/LUCENE-966 > Project: Lucene - Java > Issue Type: Improvement > Components: Analysis > Reporter: Stanislaw Osinski > Fix For: 2.3 > > Attachments: AnalyzerBenchmark.java, jflex-analyzer-patch.txt, > jflex-analyzer-r560135-patch.txt, jflex-analyzer-r561292-patch.txt > > > JFlex (http://www.jflex.de/) can be used to generate a faster (up to several > times) replacement for StandardAnalyzer. Will add a patch and a simple > benchmark code in a while. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. --------------------------------------------------------------------- To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]