On Jun 13, 2005, at 10:55 AM, Andy Roberts wrote:
On Monday 13 Jun 2005 13:18, Markus Wiederkehr wrote:
I see, the list of exceptions makes this a lot more complicated
than I
thought... Thanks a lot, Erik!
I expect you'll need to do some pre-processing. Read in your text
into a
buffer, line-by-line. If a given line ends with a hyphen, you can
manipulate
the buffer to merge the hyphenated tokens.
The problem I encountered when indexing "Lucene in Action" was that I
couldn't just blindly concatenate two tokens because the first ends
with a hyphen. Some lines ended with a hyphen because it was a dash,
not a hyphenated word.
I'm sure other more clever implementations could do this better, by
looking up the concatenated word in a dictionary for instance.
Erik
---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]