Hi, I have a large tagger dictionary (~600k words), and it is taking a long time (~45 seconds) to load it with the current implementation. Part of the time it is spent loading the XML into memory, and other part is spent validating the tags, if they are known by the model.
Maybe we should change this implementation to avoid always validating the tags of a dictionary already bundled inside the model. This validation should be done only once while build the model. What do you think? I also have been searching for an alternative to the XML, and I found a very nice one. It is the FSA dictionaries implementation included in a tool called Morfologik (BSD license): http://sourceforge.net/projects/morfologik/ http://languagetool.wikidot.com/developing-a-tagger-dictionary I tried it already and I could reduce dramatically the size of the dictionary, from something like 10 MB (zipped) to 300 KB. Also there is almost no loading time and it requires very low memory. I did a simple benchmarked to evaluate the impact during runtime, and looks like the access time is the same as of the XML dictionary. Maybe we should create an optional package that would allow using this FSA dictionaries with OpenNLP. What do you think? Thanks, William