Re: Splitting of words

Erik Hatcher Thu, 22 Sep 2005 05:50:52 -0700


On Sep 22, 2005, at 4:36 AM, Endre Stølsvik wrote:

| The StandardTokenizer is the most sophisticated one built intoLucene. You
| can see the types of tokens it emits by looking at the javadoc here:
| <http://lucene.apache.org/java/docs/api/org/apache/lucene/analysis/standard/StandardTokenizer.html>
|
| It recognizes e-mail addresses, interior apostrophe words (likeo'clock),| hostnames/IP addresses (like lucene.apache.org), acronyms, andCJK characters.
It would be great if it also separated "UpperCamelCase" and
"lowerCamelCase" words into both the different words, and one longword.Several uppercase, followed by lowercase, would most probably bebest done
like HTTPUnit -> http unit.
This is of course due to, for my part, java language influence.But Ibelieve it is custom in many programming languages to uselowerCamelCase
for e.g. variables. Filenames too.

I strongly disagree. It would not be good at all forStandardTokenizer to do this. It would be easy to write aCamelCaseSplitFilter that could be used in conjunction with anytokenizer.

It is important to design filters and tokenizers in the most single-purpose way to allow them to be combined for various scenarios.

If such a filter is contributed, I'd happily add it to contrib/analyzers - seems useful to have around.


    Erik


---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

Re: Splitting of words

Reply via email to