[jira] [Comment Edited] (OPENNLP-701) Polish language support - Maxent binaries

Chris Krol / IBM (JIRA) Thu, 12 Jun 2014 05:47:20 -0700

    [ 
https://issues.apache.org/jira/browse/OPENNLP-701?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14029059#comment-14029059
 ]


Chris Krol / IBM edited comment on OPENNLP-701 at 6/12/14 12:46 PM:
--------------------------------------------------------------------

Thanks for your response. 

Could you point me to the important packages or interfaces that would have to 
be implemented in order for the added support to fit well into the general 
OpenNLP design? 

My current idea is an  opennlp.tools.lang.polish dedicated Parser class for the 
corpus native format. Which would lead to implementing a corpus loader class, 
designed to work with this particular corpus. Would it be okay to add a 
dedicated corpus loader (with searching, indexing etc.)  for Polish & National 
Corpus? The aforementioned is the golden standard (manually annotated) for 
Polish, so it would be advisable to include such support. 

I would be still contributing at least sentence detection and tokenizer 
binaries, because they were created using a huge plaintext data set that's free 
to use and that doesn't require any pre-processing. 


was (Author: kris.chris):
Thanks for your response. 

Could you point me to the important packages or interfaces that would have to 
be implemented in order for the added support to fit well into the general 
OpenNLP design? 

My current idea is an  opennlp.tools.lang.polish dedicated Parser class for the 
corpus native format. Which would lead to implementing a corpus loader class, 
designed to work with this particular corpus. Would it be okay to add a 
dedicated corpus loader (with searching, indexing etc.)  for Polish & National 
Corpus? The aforementioned is the golden standard (manually annotated) for 
Polish, so it would be advisable to include such support. 

I would be still contributing at least sentence detection and tokenizer, 
because they were created using a huge plaintext data set that's free to use 
and that doesn't require any pre-processing. 

> Polish language support - Maxent binaries
> -----------------------------------------
>
>                 Key: OPENNLP-701
>                 URL: https://issues.apache.org/jira/browse/OPENNLP-701
>             Project: OpenNLP
>          Issue Type: New Feature
>            Reporter: Chris Krol / IBM
>            Priority: Minor
>
> Hi, 
> Currently I'm working at IBM Poland and my manager approved the idea of 
> contributing various Maxent binaries for Polish language (sentence split, 
> sentence detection, POS tagging and morphological analysis, NER). 
> You could possibly put them on your download page. 
> We trained them using the Golden Standard human-annotated Polish National 
> Corpus (GPL 3.0). 
> Would this be also possible to give some credit (or any) to the fact that the 
> job's been done at IBM?
> I've already sent a mail to the devs,  but haven't seen any response for two 
> weeks now. 



--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Comment Edited] (OPENNLP-701) Polish language support - Maxent binaries

Reply via email to