2011/6/22 Jörn Kottmann <[email protected]>: > On 6/11/11 5:06 PM, Olivier Grisel wrote: >> >> 2011/6/11 Grant Ingersoll<[email protected]>: >>> >>> I signed off on the BR, but a couple of questions: >>> >>> What do we need to do on the IP front? Is that really a blocker for >>> graduation? >>> >>> Also, I don't think the regression tests are a blocker for graduation. >>> >>> I did add that we need to find some more candidates for >>> contributions/committership, which I do think is a blocker. >> >> I am willing to be a new candidate for committership if the opennlp >> devs judge that the corpus-refiner tooling introduced in the other >> thread would fit somewhere somewhere in the project (probably as a new >> maven artifact). >> >> I assume that Hannes might be interested as well. > > Nice, lets open a new thread and speak it bit more about this contribution, > is it something new you want to work on, or do you speak about contributing > an existing code base?
There is indeed a proof of concept here: https://github.com/ogrisel/bbuzz-semantic-hackathon/tree/master/corpus-refiner/ Currently there is only a basic command line interface. I plan to work on a SWING version too and Hannes started to work on a HTML / Javascript frontend. I think the existing corpus refiner need to be able to store the validations / corrections in a separate file or database (e.g. derby) + another tool to take a OpenNLP formatted corpus + a validation DB to generate a new version of the corpus file I could also contribute my pig scripts and UDF from https://github.com/ogrisel/pignlproc [1] but I feel that soon enough Spark [2] will be mature enough enough to rewrite them in scala. Sparse just lacks an efficient JOIN/COGROUP operation to be able to do so but this will probably soon be the case [3]. So I suggest that we wait before considering a contribution of pignlproc code base to opennlp. [1] https://github.com/ogrisel/pignlproc [2] http://www.spark-project.org/ [3] https://github.com/mesos/spark/issues/4 -- Olivier http://twitter.com/ogrisel - http://github.com/ogrisel
