Great stuff, William.

I have been using Morfologik stemming for a long time and when we
included it we put it as an addon. I assume that the reason was its
license, but reading Morfologik license it is not clear to me why is
is not Apache compatible.

If it is, it would be nice to include it directly in OpenNLP.

Can anyone shed any light on this?

Thanks,

R

On Fri, Jul 15, 2016 at 12:02 AM, William Colen <william.co...@gmail.com> wrote:
> Hello,
>
> A while back we started working on a Morfologik Addon.
>
> http://svn.apache.org/viewvc/opennlp/addons/
>
> I checked it out last week and notice it was outdated, specially because it
> was not using the latest Morfologik version. Also it was missing
> documentation.
>
> You can find more about Morfologik here:
> https://github.com/morfologik/morfologik-stemming
>
> Morfologik provides tools for finite state automata (FSA) construction and
> dictionary-based morphological dictionaries.
>
> The Morfologik Addon implements some OpenNLP interfaces and extends some
> classes to make it easier to use of FSA Morfologik dictionaries:
>
>    - opennlp.morfologik.tagdict.MorfologikPOSTaggerFactory
>       - Extends: opennlp.tools.postag.POSTaggerFactory
>       - Helps creating a POSTagger model with an embedded TagDictionary
>       based on FSA
>    - opennlp.morfologik.tagdict.MorfologikTagDictionary
>    - Implements: opennlp.tools.postag.TagDictionary
>       - A TagDictionary based on FSA is much smaller than the defaul XML
>       based, and consumes less memory.
>    - opennlp.morfologik.lemmatizer.MorfologikLemmatizer
>    - Implements: opennlp.tools.lemmatizer.DictionaryLemmatizer
>       - A dictionary based lemmatizer that uses FSA dictionary.
>
> It also provides a command line interface that allows:
>
>    - MorfologikDictionaryBuilder
>       - builds a binary POS Dictionary using Morfologik
>    - XMLDictionaryToTable
>       - reads an OpenNLP XML tag dictionary and outputs it in a tab
>       separated file that can be built into a FSA dictionary
>
>
> In a project I developed it was of great help. The TAG Dictionary for POS
> Tag was huge (something like 50 MB), requiring a lot of memory.
> Migrating it to a FSA dictionary allowed not only a smaller model, but also
> I could use the model without the need to increase the JVM memory.
>
> More here:
> https://cwiki.apache.org/confluence/display/OPENNLP/FSA+Dictionary+with+morfologik-addon
>
> Hope it will be helpful.
>
> William

Reply via email to