On Wed, Feb 8, 2012 at 5:52 PM, Katrin Tomanek <katrin.toma...@averbis.com>wrote:
> Hi everybody, > > I was just evaluating the opennlp sentence detector trained on some of our > data (using the Evaluator-class provided with opennlp). It did not perform > very well and when I checked out the misclassified sentences and debugged a > little bit, I realized that only these EOS (end of sentence) characters are > currently supported: > > '.', '!', '?' > > However, in our case we have many other EOS (":" as one of the most common > ones) > > As I understood, the EOS s definied in DefaultSDContextGenerator.java > which is called from SentenceDetectorME.train(...). > > If I got it correctly, there is currently no way to configure (as a > parameter or so) the EOS characters. Right? > > Of course, I could write my own train method and do things differently, > but then, I would not be able to use the Evaluator and CrossValidator > classes which I find very handy. > > Did I miss understand anything and is there a way to configure which EOS > characters should be used ? If not, do you think it would be a good thing > to have and if so, how can I contribute at this point? > > You are absolutely right we should have this option. William just started a thread on the dev list to discuss this. Our current idea to solve it is that you can pass in the name of a Factory class which can put the SentenceDetector together the way you need it. But when I now think about it we maybe should define a Properties file which can contain custom configuration for a component. In this file we could have a property for a custom factory class and maybe a property which contains the EOS chars for the Sentence Detector. Anyway help is always very welcome. We should make a decision on how we will implement it in the thread on the dev list and then we can open a few jiras to actually do the work. This way you should be able to contribute easily. Jörn