On Fri, Jan 15, 2010 at 9:30 AM, Grant Ingersoll <[email protected]> wrote:
>
> Yeah, I've even found using Java's BreakIterator (there's one for Sentences
> and it is supposedly Locale aware) plus some simple edge case modifications
> does quite well. I've got an implementation/demo in Taming Text, I think,
> but may also have one laying around somewhere else.
>
> Only trick thing is you have to buffer the tokens in Lucene, which is
> slightly annoying with the new incrementToken API, but not horrible. Then,
> once you find the break, just output a > > special token. Maybe also
> consider increasing the position increment.
In this case it sounds like it might be useful to do sentence chinking
prior to even getting the Analyzer involved. The BreakIterator returns
offsets which can be used in a substring call to create a StringReader
which then gets passed to the Analyzer. substring operates on the
char[] of the original string, so the only overhead would be the
allocation of the StringReaders.
E.g, something like
while (breakIterator.next()) {
StringReader r = new
StringReader(input.substring(breakIterator.start(),
breakIterator.end());
a.tokenStream(null, r);
[..generate and collect ngrams here..]
r.close()
}
On second thought however it would probably be more convenient if I
packaged the sentence boundary detection into the analyzer itself so
that the behavior can be easily changed by the end user. This would
include the way in which I use the ShingleFilter to generate n-grams
which is currently external to the Analyzer that gets plugged in.
Any idea what sort of edge cases I need to look for when using BreakIterator?
At this point, I'm thinking it is probably worth trying to get
something self-contained implemented for this relatively
straightforward need as opposed to pulling in something like OpenNLP
or Gate.
Drew