In the meanwhile I think the simplest solution to implement my own dictionary is to extend the original one.
Thanks, Riccardo Il giorno 28/set/2011 00:03, "william.co...@gmail.com" < william.co...@gmail.com> ha scritto: > Hi, Ricardo, > > On Tue, Sep 27, 2011 at 1:07 PM, Riccardo Tasso <riccardo.ta...@gmail.com >wrote: > >> Hi William, >> the upgrade solved my problem with the serialization of the model, thank >> you. >> >> Another question about dictionaries: I'm interested to implement my own >> Dictionary classes (especially for POSDictionary) whith its own backend >> (e.g. Redis instead of memory). It wouldn't be better if train method had a >> Dictionary interface, instead of a class as a parameter? >> > > +1 to include a Dictionary interface, but we will have to wait for the next > major release because we want 1.5.2 to be backward compatible with 1.5.x > > We already have the TagDictionary interface and the POSDictionary implements > it. You can use this interface to implement your own tag dictionary. We have > other constructors at POSTaggerME that takes a Dictionary, but they are the > n-gram dictionaries. N-gram dictionaries in POSTagger is an experimental > feature and we was discussing if it should be removed or not. > > William > > >> >> Thank you, >> Riccardo >> >> >> On 27/09/2011 16:17, william.co...@gmail.com wrote: >> >>> Hi, Riccardo, >>> >>> We improved a little how the abbreviation dictionary is handled in OpenNLP >>> 1.5.2. You should try it using the current release candidate: >>> http://people.apache.org/~****joern/releases/opennlp-1.5.2-*** >>> *incubating/rc1/< http://people.apache.org/~**joern/releases/opennlp-1.5.2-**incubating/rc1/> >>> <http://**people.apache.org/~joern/**releases/opennlp-1.5.2-** >>> incubating/rc1/< http://people.apache.org/~joern/releases/opennlp-1.5.2-incubating/rc1/> >>> > >>> >>> >>> 1.5.2 includes a command line tool named DictionaryBuilder: >>> >>> $ bin/opennlp DictionaryBuilder >>> Usage: opennlp DictionaryBuilder -inputFile in -outputFile out [-encoding >>> charsetName] >>> >>> Arguments description: >>> -inputFile in >>> Plain file with one entry per line >>> -outputFile out >>> The dictionary file. >>> -encoding charsetName >>> specifies the encoding which should be used for reading and writing text. >>> If not specified the system default will be used. >>> >>> This tool should be used to create an abbreviation dictionary in XML >>> format >>> as expected by the SentenceDetector and Tokenizer tools. >>> >>> Regards, >>> William >>> >>> >>> On Tue, Sep 27, 2011 at 10:37 AM, Riccardo Tasso >>> <riccardo.ta...@gmail.com>**wrote: >>> >>> I'm trying to use OpenNLP to train a sentence splitter model, following >>>> the >>>> manual, and everything is ok with it. >>>> >>>> I've noticed that the train method supports a Dictionary of >>>> abbreviations, >>>> which tipically give me problems with my sentence splitter and I wanted >>>> to >>>> try this strategy. >>>> >>>> 1) If I train a sentence splitter model passing a Dictionary I've build >>>> and >>>> I serialize it (just as I serialize a simpler model) I have problems in >>>> loading the model from file: >>>> >>>> SentenceModel model = SentenceDetectorME.train(****language, >>>> sampleStream, >>>> true, abbreviations, 5, 100); >>>> modelOut = new BufferedOutputStream(new FileOutputStream("/home/lib/** >>>> apache-opennlp-1.5.1-****incubating/models/it/it-sent.****bin")); >>>> model.serialize(modelOut); >>>> [...] >>>> modelIn = new FileInputStream("/home/lib/****apache-opennlp-1.5.1-** >>>> incubating/models/it/it-sent.****bin"); >>>> final SentenceModel sentenceModel = new SentenceModel(modelIn); >>>> >>>> i. e. the last instruction gives me the following exception: >>>> java.io.IOException: Stream closed >>>> at java.util.zip.ZipInputStream.****ensureOpen(ZipInputStream.**** >>>> java:61) >>>> at java.util.zip.ZipInputStream.****closeEntry(ZipInputStream.**** >>>> java:108) >>>> at opennlp.tools.util.model.****BaseModel.<init>(BaseModel.**** >>>> java:137) >>>> at opennlp.tools.sentdetect.****SentenceModel.<init>(** >>>> SentenceModel.java:77) >>>> at SentenceDetector.main(****SentenceDetector.java:18) >>>> >>>> 2) I tried to skip the serialization/de-serialization phase, to go on >>>> with >>>> my tests: >>>> SentenceModel model = SentenceDetectorME.train(****language, >>>> sampleStream, >>>> true, abbreviations, 5, 100); >>>> String[] sentences = sentenceDetector.sentDetect(****document); >>>> >>>> However sentences are splitted also on abbreviations which I declared in >>>> my >>>> Dictionary, which isn't exactly what I expected. E.g. : "Sono Mr. Brown" >>>> is >>>> uncorrectly splitted in "Sono Mr." and "Brown.". >>>> >>>> Can you help me with these two problems, which seems to be different, but >>>> may share the same issue. >>>> >>>> Thanks, >>>> Riccardo >>>> >>>> >>