[jira] Commented: (LUCENE-966) A faster JFlex-based replacement for StandardAnalyzer

Stanislaw Osinski (JIRA) Tue, 31 Jul 2007 11:15:22 -0700

    [ 
https://issues.apache.org/jira/browse/LUCENE-966?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12516762
 ]


Stanislaw Osinski commented on LUCENE-966:
------------------------------------------

Thanks for spotting the differences, I'll add them to the unit tests and will 
correct the tokenizer accordingly.

One doubt I have is about the filename-like tokens, e.g.:

      OLD: (2004.jpg,34461,34469,type=<HOST>)
      NEW: (2004.jpg,34461,34469,type=<NUM>) 

To be honest, both variants seem "almost" correct. If you try 2007.org -- this 
is a correct domain name (and has a funny website on it :), so, given the fact 
that we don't check for typical suffixes, such as ".com", <HOST> doesn't seem 
wrong. On the other hand, 2004.jpg may well have been some sort of numerical 
code or a product number, so <NUM> is not totally irrelevant either.

For the JFlex-based tokenizer, I put the <NUM> rule matching first, as it gives 
some nice performance benefits. We can put HOST first, and then we'll get 
compliance with the old version.

Another option we might consider is:

* adding a new token type for file names (get a list of common extensions or 
even assume that an extension is simply three alphanumerical characters) 
* checking for common domain names in hosts (something along the lines of: 
"mil" | "info" | "gov" | "edu" | "biz" | "com" | "org" | "net" |  "arpa" | 
{LETTER}{2})

I'm not sure how this will affect the performance of the tokenizer, but my 
rough guess is that if we don't come up with very complex/ backtracking-prone 
rules there should not be too much of a difference. On the other hand, if 100% 
compatibility with the old tokenizer is a priority, adding new token types is 
not a good idea, I guess.

Finally, when it comes to the initialization time of the new tokenizer -- 
according to the JFlex documentation, some time is required to unpack the 
transition tables. But the unpacking takes place during the initialization of 
static fields, so once the class is loaded the overhead should be negligible.

> A faster JFlex-based replacement for StandardAnalyzer
> -----------------------------------------------------
>
>                 Key: LUCENE-966
>                 URL: https://issues.apache.org/jira/browse/LUCENE-966
>             Project: Lucene - Java
>          Issue Type: Improvement
>          Components: Analysis
>            Reporter: Stanislaw Osinski
>             Fix For: 2.3
>
>         Attachments: AnalyzerBenchmark.java, jflex-analyzer-patch.txt, 
> jflex-analyzer-r560135-patch.txt, jflex-analyzer-r561292-patch.txt
>
>
> JFlex (http://www.jflex.de/) can be used to generate a faster (up to several 
> times) replacement for StandardAnalyzer. Will add a patch and a simple 
> benchmark code in a while.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

[jira] Commented: (LUCENE-966) A faster JFlex-based replacement for StandardAnalyzer

Reply via email to