On Sep 22, 2005, at 4:36 AM, Endre Stølsvik wrote:


| The StandardTokenizer is the most sophisticated one built into Lucene. You
| can see the types of tokens it emits by looking at the javadoc here:
| <http://lucene.apache.org/java/docs/api/org/apache/lucene/ analysis/standard/StandardTokenizer.html>
|
| It recognizes e-mail addresses, interior apostrophe words (like o'clock), | hostnames/IP addresses (like lucene.apache.org), acronyms, and CJK characters.

It would be great if it also separated "UpperCamelCase" and
"lowerCamelCase" words into both the different words, and one long word. Several uppercase, followed by lowercase, would most probably be best done
like HTTPUnit -> http unit.
This is of course due to, for my part, java language influence. But I believe it is custom in many programming languages to use lowerCamelCase
for e.g. variables. Filenames too.

I strongly disagree. It would not be good at all for StandardTokenizer to do this. It would be easy to write a CamelCaseSplitFilter that could be used in conjunction with any tokenizer.

It is important to design filters and tokenizers in the most single- purpose way to allow them to be combined for various scenarios.

If such a filter is contributed, I'd happily add it to contrib/ analyzers - seems useful to have around.

    Erik


---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

Reply via email to