[
https://issues.apache.org/jira/browse/OPENNLP-701?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14029059#comment-14029059
]
Chris Krol / IBM edited comment on OPENNLP-701 at 6/12/14 12:45 PM:
--------------------------------------------------------------------
Thanks for your response.
Could you point me to the important packages or interfaces that would have to
be implemented in order for the added support to fit well into the general
OpenNLP design?
My current idea is an opennlp.tools.lang.polish dedicated Parser class for the
corpus native format. Which would lead to implementing a corpus loader class,
designed to work with this particular corpus. Would it be okay to add a
dedicated corpus loader (with searching, indexing etc.) for Polish & National
Corpus? The aforementioned is the golden standard (manually annotated) for
Polish, so it would be advisable to include such support.
I would be still contributing at least sentence detection and tokenizer,
because they were created using a huge plaintext data set that's free to use
and that doesn't require any pre-processing.
was (Author: kris.chris):
Thanks for your response.
Could you point me to the important packages or interfaces that would have to
be implemented in order for the added support to fit well into the general
OpenNLP design?
My current idea is an opennlp.tools.lang.polish dedicated Parser class for the
corpus native format.
I would be still contributing at least sentence detection and tokenizer,
because they were created using a huge plaintext data set that's free to use
and that doesn't require any pre-processing.
> Polish language support - Maxent binaries
> -----------------------------------------
>
> Key: OPENNLP-701
> URL: https://issues.apache.org/jira/browse/OPENNLP-701
> Project: OpenNLP
> Issue Type: New Feature
> Reporter: Chris Krol / IBM
> Priority: Minor
>
> Hi,
> Currently I'm working at IBM Poland and my manager approved the idea of
> contributing various Maxent binaries for Polish language (sentence split,
> sentence detection, POS tagging and morphological analysis, NER).
> You could possibly put them on your download page.
> We trained them using the Golden Standard human-annotated Polish National
> Corpus (GPL 3.0).
> Would this be also possible to give some credit (or any) to the fact that the
> job's been done at IBM?
> I've already sent a mail to the devs, but haven't seen any response for two
> weeks now.
--
This message was sent by Atlassian JIRA
(v6.2#6252)