I'm looking at SentenceDetector from ctakes-core.  It has a surprising
idea of what counts as a "sentence".  Before I delve any deeper,
I wanted to ask whether there is a reason for what it's doing, in particular
whether there's anything in the clinical pipeline that's depending on its
behavior specifically.

The main problem I have is that it's splitting on characters like colon and
semicolon, which aren't usually considered sentence separators, with the
result that it often ends up tagging phrases rather than whole sentences.

It's using SentenceDetectorCtakes and EndOfSentenceScannerImpl, which seem
to be derived from equivalents in OpenNLP, but with changes that I can't
track (they date from the original edu.mayo import as far as I can tell).
Other than the additional separator characters, I can't tell whether these
classes are doing anything important that you wouldn't equally get from
OpenNLP's SentenceDetectorME, so I don't know why they're being used.

SentenceDetector is also splitting on newlines after passing the text through
the max entropy sentence model.  I don't see the point in this -- if you're
going to split on newlines anyway, then why not do that before passing
through the entropy model?  Or just have newline as one of the potential
EOS characters and treat it as a possible break point rather than a definite
one?

Any insight would be welcome.

Thanks,

Ewan.

Reply via email to