[
https://issues.apache.org/jira/browse/OPENNLP-760?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15286187#comment-15286187
]
Rodrigo Agerri commented on OPENNLP-760:
----------------------------------------
There is no default Lemmatizer model. You need a three column tabulated corpus
containing:
token\tpostag\tlemma
one token per row, one newline character between sentences.
You can find corpus in CoNLL 2009 shared task and in the Universal dependencies
site.
HTH,
R
> probabilistic lemmatizer
> ------------------------
>
> Key: OPENNLP-760
> URL: https://issues.apache.org/jira/browse/OPENNLP-760
> Project: OpenNLP
> Issue Type: New Feature
> Components: Lemmatizer
> Reporter: Rodrigo Agerri
> Assignee: Rodrigo Agerri
> Priority: Minor
> Fix For: 1.6.1
>
>
> Current SimpleLemmatizer is dictionary-based. A probabilistic lemmatizer
> works better for unknown words and can be combined with dictionaries.
> The method we will implement here is based on:
> Grzegorz ChrupaĆa. 2008. Towards a Machine-Learning Architecture for Lexical
> Functional Grammar Parsing. PhD dissertation, Dublin City University.
> http://grzegorz.chrupala.me/papers/phd-single.pdf
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)