Hi,

I have a large tagger dictionary (~600k words), and it is taking a long
time (~45 seconds) to load it with the current implementation. Part of the
time it is spent loading the XML into memory, and other part is spent
validating the tags, if they are known by the model.

Maybe we should change this implementation to avoid always validating the
tags of a dictionary already bundled inside the model. This validation
should be done only once while build the model. What do you think?



I also have been searching for an alternative to the XML, and I found a
very nice one. It is the FSA dictionaries implementation included in a tool
called Morfologik (BSD license): http://sourceforge.net/projects/morfologik/
 http://languagetool.wikidot.com/developing-a-tagger-dictionary

I tried it already and I could reduce dramatically the size of the
dictionary, from something like 10 MB (zipped) to 300 KB. Also there is
almost no loading time and it requires very low memory. I did a simple
benchmarked to evaluate the impact during runtime, and looks like the
access time is the same as of the XML dictionary.

Maybe we should create an optional package that would allow using this FSA
dictionaries with OpenNLP. What do you think?

Thanks,
William

Reply via email to