On 07/08/2012 10:28 PM, William Colen wrote:
Hi,
I have a large tagger dictionary (~600k words), and it is taking a long
time (~45 seconds) to load it with the current implementation. Part of the
time it is spent loading the XML into memory, and other part is spent
validating the tags, if they are known by the model.
Maybe we should change this implementation to avoid always validating the
tags of a dictionary already bundled inside the model. This validation
should be done only once while build the model. What do you think?
+1 to check this only during training time sounds safe to me. Or is
there a way to speed it up? Maybe our implementation is just bad.
I also have been searching for an alternative to the XML, and I found a
very nice one. It is the FSA dictionaries implementation included in a tool
called Morfologik (BSD license): http://sourceforge.net/projects/morfologik/
http://languagetool.wikidot.com/developing-a-tagger-dictionary
I tried it already and I could reduce dramatically the size of the
dictionary, from something like 10 MB (zipped) to 300 KB. Also there is
almost no loading time and it requires very low memory. I did a simple
benchmarked to evaluate the impact during runtime, and looks like the
access time is the same as of the XML dictionary.
Maybe we should create an optional package that would allow using this FSA
dictionaries with OpenNLP. What do you think?
+1 sounds good. I think it is nice to have a basic core which can be
extended
with other more specialized libraries for certain tasks. I personally
would love
to work on support for different machine learning libraries such as
MALLET or
libsvm.
Jörn