On Sep 27, 2005, at 6:29 AM, Endre Stølsvik wrote:
On Thu, 22 Sep 2005, Erik Hatcher wrote:
|
| On Sep 22, 2005, at 4:36 AM, Endre Stølsvik wrote:
|
| >
| > | The StandardTokenizer is the most sophisticated one built
into Lucene.
| > You
| > | can see the types of tokens it emits by looking at the
javadoc here:
| > |
| > <http://lucene.apache.org/java/docs/api/org/apache/lucene/
analysis/standard/StandardTokenizer.html>
| > |
| > | It recognizes e-mail addresses, interior apostrophe words
(like o'clock),
| > | hostnames/IP addresses (like lucene.apache.org), acronyms,
and CJK
| > characters.
| >
| > It would be great if it also separated "UpperCamelCase" and
| > "lowerCamelCase" words into both the different words, and one
long word.
| > Several uppercase, followed by lowercase, would most probably
be best done
| > like HTTPUnit -> http unit.
| > This is of course due to, for my part, java language
influence. But I
| > believe it is custom in many programming languages to use
lowerCamelCase
| > for e.g. variables. Filenames too.
|
| I strongly disagree. It would not be good at all for
StandardTokenizer to do
| this.
...
|
| It is important to design filters and tokenizers in the most
single-purpose
| way to allow them to be combined for various scenarios.
Okay. Why? Just wondering what the reasoning behind this is? What
is the
logic behind the StandardTokenizer as it stands? (Note: There are
strong
reasons to believe that I'm just not quite up to speed on how this all
fits together..!)
The StandardTokenizer is a general purpose tokenizer designed to
split text not just at whitespace boundaries, but also to keep CJK
characters separate, e-mail addresses as a unit, and to deal with
common things like part numbers where alphabetic and numeric
characters are mixed. It's a good tokenizer to start with and evolve
from there.
The StandardTokenizer does more than certain scenarios demand, and
less than other situations - it's a nice happy medium.
Erik
---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]