Hi, Ricardo, On Tue, Sep 27, 2011 at 1:07 PM, Riccardo Tasso <riccardo.ta...@gmail.com>wrote:
> Hi William, > the upgrade solved my problem with the serialization of the model, thank > you. > > Another question about dictionaries: I'm interested to implement my own > Dictionary classes (especially for POSDictionary) whith its own backend > (e.g. Redis instead of memory). It wouldn't be better if train method had a > Dictionary interface, instead of a class as a parameter? > +1 to include a Dictionary interface, but we will have to wait for the next major release because we want 1.5.2 to be backward compatible with 1.5.x We already have the TagDictionary interface and the POSDictionary implements it. You can use this interface to implement your own tag dictionary. We have other constructors at POSTaggerME that takes a Dictionary, but they are the n-gram dictionaries. N-gram dictionaries in POSTagger is an experimental feature and we was discussing if it should be removed or not. William > > Thank you, > Riccardo > > > On 27/09/2011 16:17, william.co...@gmail.com wrote: > >> Hi, Riccardo, >> >> We improved a little how the abbreviation dictionary is handled in OpenNLP >> 1.5.2. You should try it using the current release candidate: >> http://people.apache.org/~****joern/releases/opennlp-1.5.2-*** >> *incubating/rc1/<http://people.apache.org/~**joern/releases/opennlp-1.5.2-**incubating/rc1/> >> <http://**people.apache.org/~joern/**releases/opennlp-1.5.2-** >> incubating/rc1/<http://people.apache.org/~joern/releases/opennlp-1.5.2-incubating/rc1/> >> > >> >> >> 1.5.2 includes a command line tool named DictionaryBuilder: >> >> $ bin/opennlp DictionaryBuilder >> Usage: opennlp DictionaryBuilder -inputFile in -outputFile out [-encoding >> charsetName] >> >> Arguments description: >> -inputFile in >> Plain file with one entry per line >> -outputFile out >> The dictionary file. >> -encoding charsetName >> specifies the encoding which should be used for reading and writing text. >> If not specified the system default will be used. >> >> This tool should be used to create an abbreviation dictionary in XML >> format >> as expected by the SentenceDetector and Tokenizer tools. >> >> Regards, >> William >> >> >> On Tue, Sep 27, 2011 at 10:37 AM, Riccardo Tasso >> <riccardo.ta...@gmail.com>**wrote: >> >> I'm trying to use OpenNLP to train a sentence splitter model, following >>> the >>> manual, and everything is ok with it. >>> >>> I've noticed that the train method supports a Dictionary of >>> abbreviations, >>> which tipically give me problems with my sentence splitter and I wanted >>> to >>> try this strategy. >>> >>> 1) If I train a sentence splitter model passing a Dictionary I've build >>> and >>> I serialize it (just as I serialize a simpler model) I have problems in >>> loading the model from file: >>> >>> SentenceModel model = SentenceDetectorME.train(****language, >>> sampleStream, >>> true, abbreviations, 5, 100); >>> modelOut = new BufferedOutputStream(new FileOutputStream("/home/lib/** >>> apache-opennlp-1.5.1-****incubating/models/it/it-sent.****bin")); >>> model.serialize(modelOut); >>> [...] >>> modelIn = new FileInputStream("/home/lib/****apache-opennlp-1.5.1-** >>> incubating/models/it/it-sent.****bin"); >>> final SentenceModel sentenceModel = new SentenceModel(modelIn); >>> >>> i. e. the last instruction gives me the following exception: >>> java.io.IOException: Stream closed >>> at java.util.zip.ZipInputStream.****ensureOpen(ZipInputStream.**** >>> java:61) >>> at java.util.zip.ZipInputStream.****closeEntry(ZipInputStream.**** >>> java:108) >>> at opennlp.tools.util.model.****BaseModel.<init>(BaseModel.**** >>> java:137) >>> at opennlp.tools.sentdetect.****SentenceModel.<init>(** >>> SentenceModel.java:77) >>> at SentenceDetector.main(****SentenceDetector.java:18) >>> >>> 2) I tried to skip the serialization/de-serialization phase, to go on >>> with >>> my tests: >>> SentenceModel model = SentenceDetectorME.train(****language, >>> sampleStream, >>> true, abbreviations, 5, 100); >>> String[] sentences = sentenceDetector.sentDetect(****document); >>> >>> However sentences are splitted also on abbreviations which I declared in >>> my >>> Dictionary, which isn't exactly what I expected. E.g. : "Sono Mr. Brown" >>> is >>> uncorrectly splitted in "Sono Mr." and "Brown.". >>> >>> Can you help me with these two problems, which seems to be different, but >>> may share the same issue. >>> >>> Thanks, >>> Riccardo >>> >>> >