I'm trying to use OpenNLP to train a sentence splitter model, following
the manual, and everything is ok with it.
I've noticed that the train method supports a Dictionary of
abbreviations, which tipically give me problems with my sentence
splitter and I wanted to try this strategy.
1) If I train a sentence splitter model passing a Dictionary I've build
and I serialize it (just as I serialize a simpler model) I have problems
in loading the model from file:
SentenceModel model = SentenceDetectorME.train(language, sampleStream,
true, abbreviations, 5, 100);
modelOut = new BufferedOutputStream(new
FileOutputStream("/home/lib/apache-opennlp-1.5.1-incubating/models/it/it-sent.bin"));
model.serialize(modelOut);
[...]
modelIn = new
FileInputStream("/home/lib/apache-opennlp-1.5.1-incubating/models/it/it-sent.bin");
final SentenceModel sentenceModel = new SentenceModel(modelIn);
i. e. the last instruction gives me the following exception:
java.io.IOException: Stream closed
at java.util.zip.ZipInputStream.ensureOpen(ZipInputStream.java:61)
at java.util.zip.ZipInputStream.closeEntry(ZipInputStream.java:108)
at opennlp.tools.util.model.BaseModel.<init>(BaseModel.java:137)
at opennlp.tools.sentdetect.SentenceModel.<init>(SentenceModel.java:77)
at SentenceDetector.main(SentenceDetector.java:18)
2) I tried to skip the serialization/de-serialization phase, to go on
with my tests:
SentenceModel model = SentenceDetectorME.train(language, sampleStream,
true, abbreviations, 5, 100);
String[] sentences = sentenceDetector.sentDetect(document);
However sentences are splitted also on abbreviations which I declared in
my Dictionary, which isn't exactly what I expected. E.g. : "Sono Mr.
Brown" is uncorrectly splitted in "Sono Mr." and "Brown.".
Can you help me with these two problems, which seems to be different,
but may share the same issue.
Thanks,
Riccardo