On Jun 13, 2005, at 10:55 AM, Andy Roberts wrote:

On Monday 13 Jun 2005 13:18, Markus Wiederkehr wrote:

I see, the list of exceptions makes this a lot more complicated than I
thought... Thanks a lot, Erik!



I expect you'll need to do some pre-processing. Read in your text into a buffer, line-by-line. If a given line ends with a hyphen, you can manipulate
the buffer to merge the hyphenated tokens.

The problem I encountered when indexing "Lucene in Action" was that I couldn't just blindly concatenate two tokens because the first ends with a hyphen. Some lines ended with a hyphen because it was a dash, not a hyphenated word.

I'm sure other more clever implementations could do this better, by looking up the concatenated word in a dictionary for instance.

    Erik


---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

Reply via email to