[ 
https://issues.apache.org/jira/browse/OPENNLP-715?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14353594#comment-14353594
 ] 

Joern Kottmann commented on OPENNLP-715:
----------------------------------------

+1 to the suggested refactorings. Using the resource name as a prefix could 
consume quite a bit of memory since the name is repeated for every entry in the 
dictionary. Is there some other common format for word clusterings other the 
one w2v uses?

It would be nice to get the release out, but after we released that we can't do 
your proposed changed anyway more in that manner.
If it doesn't take too long to do those changes it would be great to get them 
in.

> Clark clusters NameFinder features
> ----------------------------------
>
>                 Key: OPENNLP-715
>                 URL: https://issues.apache.org/jira/browse/OPENNLP-715
>             Project: OpenNLP
>          Issue Type: New Feature
>          Components: Name Finder
>    Affects Versions: 1.6.0
>            Reporter: Rodrigo Agerri
>            Assignee: Rodrigo Agerri
>            Priority: Minor
>             Fix For: 1.6.0
>
>
> Add token based features from Clark clusters (Clark 2003). This feature is 
> actually the same as the one implemented in the WordClusterFeatureGenerator, 
> but we should somehow make them separate (perhaps implementing a dynamic 
> prefix id for each one, as in the dictionary features) as it has been shown 
> that the combination of these clustering-based features improve results. 
> Clark clusters can be generated using this tool: 
> https://github.com/ninjin/clark_pos_induction



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Reply via email to