Re: Problems training my own sentence splitter, with dictionary

william.co...@gmail.com Tue, 27 Sep 2011 07:18:45 -0700

Hi, Riccardo,

We improved a little how the abbreviation dictionary is handled in OpenNLP
1.5.2. You should try it using the current release candidate:
http://people.apache.org/~**joern/releases/opennlp-1.5.2-**incubating/rc1/<http://people.apache.org/~joern/releases/opennlp-1.5.2-incubating/rc1/>


1.5.2 includes a command line tool named DictionaryBuilder:

$ bin/opennlp DictionaryBuilder
Usage: opennlp DictionaryBuilder -inputFile in -outputFile out [-encoding
charsetName]

Arguments description:
-inputFile in
 Plain file with one entry per line
-outputFile out
 The dictionary file.
-encoding charsetName
 specifies the encoding which should be used for reading and writing text.
If not specified the system default will be used.

This tool should be used to create an abbreviation dictionary in XML format
as expected by the SentenceDetector and Tokenizer tools.

Regards,
William


On Tue, Sep 27, 2011 at 10:37 AM, Riccardo Tasso
<riccardo.ta...@gmail.com>wrote:

> I'm trying to use OpenNLP to train a sentence splitter model, following the
> manual, and everything is ok with it.
>
> I've noticed that the train method supports a Dictionary of abbreviations,
> which tipically give me problems with my sentence splitter and I wanted to
> try this strategy.
>
> 1) If I train a sentence splitter model passing a Dictionary I've build and
> I serialize it (just as I serialize a simpler model) I have problems in
> loading the model from file:
>
> SentenceModel model = SentenceDetectorME.train(**language, sampleStream,
> true, abbreviations, 5, 100);
> modelOut = new BufferedOutputStream(new FileOutputStream("/home/lib/**
> apache-opennlp-1.5.1-**incubating/models/it/it-sent.**bin"));
> model.serialize(modelOut);
> [...]
> modelIn = new FileInputStream("/home/lib/**apache-opennlp-1.5.1-**
> incubating/models/it/it-sent.**bin");
> final SentenceModel sentenceModel = new SentenceModel(modelIn);
>
> i. e. the last instruction gives me the following exception:
> java.io.IOException: Stream closed
>    at java.util.zip.ZipInputStream.**ensureOpen(ZipInputStream.**java:61)
>    at java.util.zip.ZipInputStream.**closeEntry(ZipInputStream.**java:108)
>    at opennlp.tools.util.model.**BaseModel.<init>(BaseModel.**java:137)
>    at opennlp.tools.sentdetect.**SentenceModel.<init>(**
> SentenceModel.java:77)
>    at SentenceDetector.main(**SentenceDetector.java:18)
>
> 2) I tried to skip the serialization/de-serialization phase, to go on with
> my tests:
> SentenceModel model = SentenceDetectorME.train(**language, sampleStream,
> true, abbreviations, 5, 100);
> String[] sentences = sentenceDetector.sentDetect(**document);
>
> However sentences are splitted also on abbreviations which I declared in my
> Dictionary, which isn't exactly what I expected. E.g. : "Sono Mr. Brown" is
> uncorrectly splitted in "Sono Mr." and "Brown.".
>
> Can you help me with these two problems, which seems to be different, but
> may share the same issue.
>
> Thanks,
>    Riccardo
>

Re: Problems training my own sentence splitter, with dictionary

Reply via email to