Problems training my own sentence splitter, with dictionary

Riccardo Tasso Tue, 27 Sep 2011 06:38:01 -0700

I'm trying to use OpenNLP to train a sentence splitter model, followingthe manual, and everything is ok with it.

I've noticed that the train method supports a Dictionary ofabbreviations, which tipically give me problems with my sentencesplitter and I wanted to try this strategy.

1) If I train a sentence splitter model passing a Dictionary I've buildand I serialize it (just as I serialize a simpler model) I have problemsin loading the model from file:

SentenceModel model = SentenceDetectorME.train(language, sampleStream,true, abbreviations, 5, 100);modelOut = new BufferedOutputStream(newFileOutputStream("/home/lib/apache-opennlp-1.5.1-incubating/models/it/it-sent.bin"));

model.serialize(modelOut);
[...]

modelIn = newFileInputStream("/home/lib/apache-opennlp-1.5.1-incubating/models/it/it-sent.bin");

final SentenceModel sentenceModel = new SentenceModel(modelIn);

i. e. the last instruction gives me the following exception:
java.io.IOException: Stream closed
    at java.util.zip.ZipInputStream.ensureOpen(ZipInputStream.java:61)
    at java.util.zip.ZipInputStream.closeEntry(ZipInputStream.java:108)
    at opennlp.tools.util.model.BaseModel.<init>(BaseModel.java:137)
    at opennlp.tools.sentdetect.SentenceModel.<init>(SentenceModel.java:77)
    at SentenceDetector.main(SentenceDetector.java:18)

2) I tried to skip the serialization/de-serialization phase, to go onwith my tests:SentenceModel model = SentenceDetectorME.train(language, sampleStream,true, abbreviations, 5, 100);

String[] sentences = sentenceDetector.sentDetect(document);

However sentences are splitted also on abbreviations which I declared inmy Dictionary, which isn't exactly what I expected. E.g. : "Sono Mr.Brown" is uncorrectly splitted in "Sono Mr." and "Brown.".

Can you help me with these two problems, which seems to be different,but may share the same issue.


Thanks,
    Riccardo

Problems training my own sentence splitter, with dictionary

Reply via email to