Re: Problems training my own sentence splitter, with dictionary

Riccardo Tasso Tue, 27 Sep 2011 15:07:35 -0700

In the meanwhile I think the simplest solution to implement my own
dictionary is to extend the original one.


Thanks,
   Riccardo
 Il giorno 28/set/2011 00:03, "william.co...@gmail.com" <
william.co...@gmail.com> ha scritto:
> Hi, Ricardo,
>
> On Tue, Sep 27, 2011 at 1:07 PM, Riccardo Tasso <riccardo.ta...@gmail.com
>wrote:
>
>> Hi William,
>> the upgrade solved my problem with the serialization of the model, thank
>> you.
>>
>> Another question about dictionaries: I'm interested to implement my own
>> Dictionary classes (especially for POSDictionary) whith its own backend
>> (e.g. Redis instead of memory). It wouldn't be better if train method had
a
>> Dictionary interface, instead of a class as a parameter?
>>
>
> +1 to include a Dictionary interface, but we will have to wait for the
next
> major release because we want 1.5.2 to be backward compatible with 1.5.x
>
> We already have the TagDictionary interface and the POSDictionary
implements
> it. You can use this interface to implement your own tag dictionary. We
have
> other constructors at POSTaggerME that takes a Dictionary, but they are
the
> n-gram dictionaries. N-gram dictionaries in POSTagger is an experimental
> feature and we was discussing if it should be removed or not.
>
> William
>
>
>>
>> Thank you,
>> Riccardo
>>
>>
>> On 27/09/2011 16:17, william.co...@gmail.com wrote:
>>
>>> Hi, Riccardo,
>>>
>>> We improved a little how the abbreviation dictionary is handled in
OpenNLP
>>> 1.5.2. You should try it using the current release candidate:
>>> http://people.apache.org/~****joern/releases/opennlp-1.5.2-***
>>> *incubating/rc1/<
http://people.apache.org/~**joern/releases/opennlp-1.5.2-**incubating/rc1/>
>>> <http://**people.apache.org/~joern/**releases/opennlp-1.5.2-**
>>> incubating/rc1/<
http://people.apache.org/~joern/releases/opennlp-1.5.2-incubating/rc1/>
>>> >
>>>
>>>
>>> 1.5.2 includes a command line tool named DictionaryBuilder:
>>>
>>> $ bin/opennlp DictionaryBuilder
>>> Usage: opennlp DictionaryBuilder -inputFile in -outputFile out
[-encoding
>>> charsetName]
>>>
>>> Arguments description:
>>> -inputFile in
>>> Plain file with one entry per line
>>> -outputFile out
>>> The dictionary file.
>>> -encoding charsetName
>>> specifies the encoding which should be used for reading and writing
text.
>>> If not specified the system default will be used.
>>>
>>> This tool should be used to create an abbreviation dictionary in XML
>>> format
>>> as expected by the SentenceDetector and Tokenizer tools.
>>>
>>> Regards,
>>> William
>>>
>>>
>>> On Tue, Sep 27, 2011 at 10:37 AM, Riccardo Tasso
>>> <riccardo.ta...@gmail.com>**wrote:
>>>
>>> I'm trying to use OpenNLP to train a sentence splitter model, following
>>>> the
>>>> manual, and everything is ok with it.
>>>>
>>>> I've noticed that the train method supports a Dictionary of
>>>> abbreviations,
>>>> which tipically give me problems with my sentence splitter and I wanted
>>>> to
>>>> try this strategy.
>>>>
>>>> 1) If I train a sentence splitter model passing a Dictionary I've build
>>>> and
>>>> I serialize it (just as I serialize a simpler model) I have problems in
>>>> loading the model from file:
>>>>
>>>> SentenceModel model = SentenceDetectorME.train(****language,
>>>> sampleStream,
>>>> true, abbreviations, 5, 100);
>>>> modelOut = new BufferedOutputStream(new FileOutputStream("/home/lib/**
>>>> apache-opennlp-1.5.1-****incubating/models/it/it-sent.****bin"));
>>>> model.serialize(modelOut);
>>>> [...]
>>>> modelIn = new FileInputStream("/home/lib/****apache-opennlp-1.5.1-**
>>>> incubating/models/it/it-sent.****bin");
>>>> final SentenceModel sentenceModel = new SentenceModel(modelIn);
>>>>
>>>> i. e. the last instruction gives me the following exception:
>>>> java.io.IOException: Stream closed
>>>> at java.util.zip.ZipInputStream.****ensureOpen(ZipInputStream.****
>>>> java:61)
>>>> at java.util.zip.ZipInputStream.****closeEntry(ZipInputStream.****
>>>> java:108)
>>>> at opennlp.tools.util.model.****BaseModel.<init>(BaseModel.****
>>>> java:137)
>>>> at opennlp.tools.sentdetect.****SentenceModel.<init>(**
>>>> SentenceModel.java:77)
>>>> at SentenceDetector.main(****SentenceDetector.java:18)
>>>>
>>>> 2) I tried to skip the serialization/de-serialization phase, to go on
>>>> with
>>>> my tests:
>>>> SentenceModel model = SentenceDetectorME.train(****language,
>>>> sampleStream,
>>>> true, abbreviations, 5, 100);
>>>> String[] sentences = sentenceDetector.sentDetect(****document);
>>>>
>>>> However sentences are splitted also on abbreviations which I declared
in
>>>> my
>>>> Dictionary, which isn't exactly what I expected. E.g. : "Sono Mr.
Brown"
>>>> is
>>>> uncorrectly splitted in "Sono Mr." and "Brown.".
>>>>
>>>> Can you help me with these two problems, which seems to be different,
but
>>>> may share the same issue.
>>>>
>>>> Thanks,
>>>> Riccardo
>>>>
>>>>
>>

Re: Problems training my own sentence splitter, with dictionary

Reply via email to