Re: Rule based sentence detector

2021-01-21 Thread William Colen
Hi Alan,

Do you have a PR for the implementation?

Thank you,
William

Em ter., 19 de jan. de 2021 às 23:52, Alan Wang  escreveu:

> Hi all,
>
> I created a rule based sentence detector for OpenNLP
> .
> There are two kinds of rules:
>
> 1. break rules: specifying the sentence break
> 2. no-break rules: disallowing the sentence break
>
> All rules have two parts:
>
> Before the break
> After the break
>
> The algorithm idea:
>
> Retrieves the break rules.
> If none of the no-break rules is matched at the break location, the text
> is marked as split and a new segment is created
>
> Features:
>
> Text Cleanup and Preprocessing
> Easy to extend other languages
>
> Reference:
>
> This library use "Golden Rule" test of pragmatic_segmenter
> 
>
> Currently, the pass rate of test cases is 92.31%. The following test cases
> fail: 39, 50, 53, 52
> For details, see the attachment.
>
> --
>
>
>
>
>
>


Re: OpenNLP UD models

2021-01-18 Thread William Colen
Hello Jeff! Nice work!!

Did you store the evaluation results somewhere?

Does UD have Named Entity annotation? Do you have any reference to share?

Why did you select only these languages? Any restrictions?

Thank you
William

Em dom., 17 de jan. de 2021 às 21:15, Jeff Zemerick 
escreveu:

> Thanks, Bruno.
>
> If there aren't any major concerns I will kick off a VOTE thread for
> releasing these models.
>
> The overall plan is to:
>
> 1. Release these models by making them available for download on the
> website.
> 2. Submit the pull request to enable automatic downloading for the
> tokenizer, sentence, and POS tagger models.
> 3. Update user's guide and release new version.
> 4. Get NameFinder models trained and available.
> 5. Establish a more automated and documented process for training the
> models.
>
> Always open to suggestions and comments! Otherwise watch for a VOTE
> thread over the next few days.
>
> Thanks,
> Jeff
>
>
> On Wed, Jan 6, 2021 at 7:24 PM Bruno P. Kinoshita 
> wrote:
>
> >  Hi Jeff,
> >
> > Cannot comment much on the process or direction, except that it looks
> good
> > to me.
> >
> > >While decent performance is always beneficial, the primary purpose
> > of this task is to provide working OpenNLP models the project can
> > distribute. Having these models will help reduce the barrier to entry for
> > users new to OpenNLP.
> >
> > +1! Had a read on the UD page, and looks well maintained, and even
> > includes a pt-br dataset.
> >
> > Thanks!
> > Bruno
> >
> >
> >
> >
> >
> > On Wednesday, 6 January 2021, 11:31:32 am NZDT, Jeff Zemerick <
> > jzemer...@apache.org> wrote:
> >
> >  Hi all,
> >
> > I have created a script [1] to train OpenNLP models from Universal
> > Dependencies [2] data to give OpenNLP models that can be distributed
> under
> > the Apache license,
> >
> > The script automates the training of tokenizer, sentence, and POS models
> > for English, Dutch, French, German, and Italian. (The NameFinder does not
> > currently support the input annotation format so those models will come
> > later.) While decent performance is always beneficial, the primary
> purpose
> > of this task is to provide working OpenNLP models the project can
> > distribute. Having these models will help reduce the barrier to entry for
> > users new to OpenNLP.
> >
> > Once voted and approved, the trained models will be pushed to Subversion
> > alongside the current OpenNLP language detection model. From there, the
> > models can be made available for download on the OpenNLP website and
> > programmatically through OPENNLP-1318 [3]. The script to train the models
> > and instructions will be added to the OpenNLP repository.
> >
> > To use the script:
> >
> > 1. Download and extract UD.
> > 2. Download and extract OpenNLP.
> > 3. Create a directory to store the trained models.
> > 3. Modify the ud-train.sh script to set the path to those three
> > directories.
> > 4. Execute the ud-train.sh script.
> >
> > The training log, evaluation output, and model files will be saved to the
> > $OUTPUT_MODELS directory. Models and the output files I trained using
> this
> > script can be viewed on Dropbox [4].
> >
> > Before calling a vote to release the models, I would like to see if there
> > is any feedback on the process or direction. If you have any comments
> > please feel free.
> >
> > Thanks,
> > Jeff
> >
> > [1]
> > https://github.com/jzonthemtn/opennlp/blob/ud-models/scripts/ud-train.sh
> > [2] https://universaldependencies.org/
> > [3] https://issues.apache.org/jira/browse/OPENNLP-1318
> > [4]
> >
> https://www.dropbox.com/sh/p8focuz0qwvw84b/AAC6GqO8mqZn_xkAqHZsVAsoa?dl=0
> >
>


Re: [VOTE] Apache OpenNLP 1.9.2 Release Candidate

2019-12-23 Thread William Colen
+1 Biding

Tried it with DKPro and other projects.

William Colen

Em seg., 23 de dez. de 2019 às 12:04, Suneel Marthi 
escreveu:

> +1 binding
>
>
> On Mon, Dec 23, 2019 at 9:28 AM Tommaso Teofili  >
> wrote:
>
> > +1 (binding)
> >
> > tag build succeeds (jdk 8), signatures ok.
> >
> > Regards,
> > Tommaso
> >
> > On Mon, 23 Dec 2019 at 13:32, Jeff Zemerick 
> wrote:
> >
> > > +1 binding
> > >
> > > verified signatures
> > > built and tested from opennlp-1.9.2 tag using openjdk 8
> > >
> > > On Fri, Dec 20, 2019 at 11:07 AM Jeff Zemerick 
> > > wrote:
> > >
> > > > Hi folks,
> > > >
> > > > I have posted a 1st release candidate for the Apache OpenNLP 1.9.2
> > > release
> > > > and it is ready for testing.
> > > >
> > > > The distributables can be downloaded from:
> > > >
> > > >
> > >
> >
> https://repository.apache.org/content/repositories/orgapacheopennlp-1026/org/apache/opennlp/opennlp-distr/1.9.2/
> > > >
> > > > The release was made from the Apache OpenNLP 1.9.2 tag at:
> > > > https://github.com/apache/opennlp/tree/opennlp-1.9.2
> > > >
> > > > To use it in a maven build set the version for opennlp-tools or
> > > > opennlp-uima to 1.9.2 and add the following URL to your settings.xml
> > > file:
> > > >
> > https://repository.apache.org/content/repositories/orgapacheopennlp-1026
> > > >
> > > > The release was made using the OpenNLP release process, documented on
> > the
> > > > website:
> > > > https://opennlp.apache.org/release.html
> > > >
> > > > Please vote on releasing these packages as Apache OpenNLP 1.9.2. The
> > vote
> > > > is open for at least the next 72 hours.
> > > >
> > > > Only votes from OpenNLP PMC are binding, but everyone is welcome to
> > check
> > > > the release candidate and vote.
> > > > The vote passes if at least three binding +1 votes are cast.
> > > >
> > > > [ ] +1 Release the packages as Apache OpenNLP 1.9.2
> > > > [ ] -1 Do not release the packages because...
> > > >
> > > > Thanks!
> > > >
> > > > Jeff
> > > >
> > >
> >
>


Re: [VOTE] Apache OpenNLP 1.9.0 Release Candidate 2

2018-07-02 Thread William Colen
+1

William Colen

Em seg, 2 de jul de 2018 às 08:07, Jeff Zemerick 
escreveu:

> +1
>
> Jeff
>
>
> On Mon, Jul 2, 2018 at 5:22 AM Tommaso Teofili 
> wrote:
>
> > +1
> > Il giorno lun 2 lug 2018 alle 10:34 Rodrigo Agerri
>  > >
> > ha scritto:
> >
> > > +1
> > >
> > > Rodrigo
> > >
> > > On Sun, Jul 1, 2018 at 12:42 AM, Koji Sekiguchi
> > >  wrote:
> > > > I tested mvn install and some Eval tests (OntoNotes4NameFinderEval,
> > > > Conll02NameFinderEval, OntoNotes4PosTaggerEval) which use
> > > > FeatureGeneratorUtil.
> > > >
> > > > +1
> > > >
> > > > Koji
> > > >
> > > >
> > > >
> > > > On 2018/06/29 20:45, Jeff Zemerick wrote:
> > > >>
> > > >> Hi folks,
> > > >>
> > > >> I have posted a 2nd release candidate for the Apache OpenNLP 1.9.0
> > > release
> > > >> and it is ready for testing.
> > > >>
> > > >> The distributables can be downloaded from:
> > > >>
> > > >>
> > >
> >
> https://repository.apache.org/content/repositories/orgapacheopennlp-1022/org/apache/opennlp/opennlp-distr/1.9.0/
> > > >>
> > > >> The release was made from the Apache OpenNLP 1.9.0 RC2 tag at:
> > > >> https://github.com/apache/opennlp/tree/opennlp-1.9.0-rc2
> > > >>
> > > >> To use it in a maven build set the version for opennlp-tools or
> > > >> opennlp-uima to 1.9.0 and add the following URL to your settings.xml
> > > file:
> > > >>
> > >
> https://repository.apache.org/content/repositories/orgapacheopennlp-1022
> > > >>
> > > >> The release was made using the OpenNLP release process, documented
> on
> > > the
> > > >> website:
> > > >> https://opennlp.apache.org/release.html
> > > >>
> > > >> Please vote on releasing these packages as Apache OpenNLP 1.9.0. The
> > > vote
> > > >> is open for at least the next 72 hours.
> > > >>
> > > >> Only votes from OpenNLP PMC are binding, but everyone is welcome to
> > > check
> > > >> the release candidate and vote.
> > > >> The vote passes if at least three binding +1 votes are cast.
> > > >>
> > > >> [ ] +1 Release the packages as Apache OpenNLP 1.9.0
> > > >> [ ] -1 Do not release the packages because...
> > > >>
> > > >> Thanks!
> > > >> Jeff
> > > >>
> > > >
> > >
> >
>


Re: [VOTE] Apache OpenNLP 1.8.4 Release Candidate

2017-12-22 Thread William Colen
+1 binding

Executed eval-tests suite.

--
William

2017-12-22 6:05 GMT-02:00 Rodrigo Agerri :

> +1 binding
>
> R
>
> On Fri, Dec 22, 2017 at 2:02 AM, Koji Sekiguchi
>  wrote:
> > +1
> >
> > I checked files included in the -src package, built successfully, etc.
> >
> > Koji
> >
> >
> > On 2017/12/21 23:44, Jeff Zemerick wrote:
> >>
> >> Hi Folks,
> >>
> >> I have posted a first release candidate for the Apache OpenNLP 1.8.4
> >> release and it is ready for testing.
> >>
> >> The RC1 distributables can be downloaded from here:
> >>
> >> https://repository.apache.org/content/repositories/
> orgapacheopennlp-1020/org/apache/opennlp/opennlp-distr/1.8.4
> >>
> >> The release was made from the Apache OpenNLP 1.8.4 tag at
> >> https://github.com/apache/opennlp/tree/opennlp-1.8.4
> >>
> >> To use it in a maven build set the version for opennlp-tools or
> >> opennlp-uima to 1.8.4 and add the following URL to your settings.xml
> file:
> >> https://repository.apache.org/content/repositories/
> orgapacheopennlp-1020
> >>
> >> The release was made using the OpenNLP release process, documented on
> the
> >> Wiki here:
> >> https://cwiki.apache.org/confluence/display/OPENNLP/Release+Process
> >>
> >> The release contains quite some changes, please refer to the contained
> >> issue list for details.
> >>
> >> Please vote on releasing these packages as Apache OpenNLP 1.8.4. The
> vote
> >> is
> >> open for at least the next 72 hours.
> >>
> >> Only votes from OpenNLP PMC are binding, but folks are welcome to check
> >> the
> >> release candidate and voice their approval or disapproval. The vote
> passes
> >> if at least three binding +1 votes are cast.
> >>
> >> [ ] +1 Release the packages as Apache OpenNLP 
> >> [ ] -1 Do not release the packages because...
> >>
> >> Thanks!
> >> Jeff Zemerick
> >>
> >
>


Re: [VOTE] Language Detector model for Apache OpenNLP 1.8.3 Release Candidate 3

2017-11-02 Thread William Colen
Thanks for the Votes - we are past the 72 hrs and this vote is now closed.

Results:

7 +1 binding

This vote now passes, will send notice out once the release is finalized.

2017-11-02 0:06 GMT-02:00 <william.co...@gmail.com>:

> +1 binding
>
> 2017-10-31 10:37 GMT-02:00 Tommaso Teofili <tommaso.teof...@gmail.com>:
>
>> +1 (binding)
>>
>> Tommaso
>>
>> Il giorno mar 31 ott 2017 alle ore 10:45 Suneel Marthi <
>> suneel.mar...@gmail.com> ha scritto:
>>
>> > +1 binding
>> >
>> > Sent from my iPhone
>> >
>> > > On Oct 31, 2017, at 3:04 PM, Rodrigo Agerri <rodrigo.age...@ehu.eus>
>> > wrote:
>> > >
>> > > +1
>> > >
>> > > Rodrigo
>> > >
>> > > On Tue, Oct 31, 2017 at 2:37 AM, Koji Sekiguchi
>> > > <koji.sekigu...@rondhuit.com> wrote:
>> > >> +1
>> > >>
>> > >> - checked text files in the zipped model file
>> > >> - verified signatures
>> > >> - executed LanguageDetector using the model file
>> > >>
>> > >> Koji
>> > >>
>> > >>
>> > >>> On 2017/10/30 22:30, William Colen wrote:
>> > >>>
>> > >>> The Apache OpenNLP PMC would like to call for a Vote on the Language
>> > >>> Detector model for Apache OpenNLP 1.8.3 Release Candidate 3.
>> > >>>
>> > >>> The Release artifacts can be downloaded from:
>> > >>>
>> > >>> http://people.apache.org/~colen/models/langdetect-183/rc3/
>> > >>>
>> > >>> The model was built with Apache OpenNLP 1.8.3 release, trained with
>> a
>> > >>> portion of the Leipzig corpus, which can be found under this  tag:
>> > >>>
>> > >>> https://svn.apache.org/repos/bigdata/opennlp/tags/langdetect
>> -183_RC3
>> > >>>
>> > >>> The model binary includes the NOTICE, LICENSE and also a README with
>> > >>> details of supported languages, how the Leipzig corpus was created
>> and
>> > the
>> > >>> model was trained. For your convenience the README is available
>> here:
>> > >>>
>> > >>>
>> > >>>
>> > https://svn.apache.org/repos/bigdata/opennlp/tags/langdetect
>> -183_RC3/leipzig/resources/README.txt
>> > >>>
>> > >>> A detailed evaluation report is available here:
>> > >>>
>> > >>>
>> > >>>
>> > http://people.apache.org/~colen/models/langdetect-183/rc3/
>> langdetect-183.bin.report.txt
>> > >>>
>> > >>> To use Language Detector, please follow the documentation here:
>> > >>>
>> > >>>
>> > http://opennlp.apache.org/docs/1.8.3/manual/opennlp.html#
>> tools.langdetect
>> > >>>
>> > >>> It is important to note that this model is trained for and works
>> well
>> > with
>> > >>> longer texts that have at least 2 sentences or more from the same
>> > >>> language.
>> > >>>
>> > >>> The artifacts have been signed with the Key - 524A9649
>> > >>>  found at
>> > >>>
>> > >>> http://people.apache.org/keys/group/opennlp.asc
>> > >>>
>> > >>> Please vote on releasing the model as Apache OpenNLP Language
>> Detector
>> > >>> Model 1.8.3. The vote is open for either the next 72 hours or a
>> > minimum of
>> > >>> 3 +1 PMC binding votes
>> > >>> whichever happens earlier.
>> > >>>
>> > >>> Only votes from OpenNLP PMC are binding, but folks are welcome to
>> check
>> > >>> the
>> > >>> release candidate and voice their approval or disapproval. The vote
>> > passes
>> > >>> if at least three binding +1 votes are cast.
>> > >>>
>> > >>> [ ] +1 Release the packages as Apache OpenNLP Language Detector
>> Model
>> > >>> 1.8.3
>> > >>>
>> > >>> [ ] -1 Do not release the packages because...
>> > >>>
>> > >>> Thanks again to all the committers and contributors for their work
>> over
>> > >>> the
>> > >>> past few weeks.
>> > >>>
>> > >>
>> >
>>
>
>


Re: [VOTE] Language Detector model for Apache OpenNLP 1.8.3 Release Candidate 3

2017-11-01 Thread William Colen
+1 binding

2017-10-31 10:37 GMT-02:00 Tommaso Teofili <tommaso.teof...@gmail.com>:

> +1 (binding)
>
> Tommaso
>
> Il giorno mar 31 ott 2017 alle ore 10:45 Suneel Marthi <
> suneel.mar...@gmail.com> ha scritto:
>
> > +1 binding
> >
> > Sent from my iPhone
> >
> > > On Oct 31, 2017, at 3:04 PM, Rodrigo Agerri <rodrigo.age...@ehu.eus>
> > wrote:
> > >
> > > +1
> > >
> > > Rodrigo
> > >
> > > On Tue, Oct 31, 2017 at 2:37 AM, Koji Sekiguchi
> > > <koji.sekigu...@rondhuit.com> wrote:
> > >> +1
> > >>
> > >> - checked text files in the zipped model file
> > >> - verified signatures
> > >> - executed LanguageDetector using the model file
> > >>
> > >> Koji
> > >>
> > >>
> > >>> On 2017/10/30 22:30, William Colen wrote:
> > >>>
> > >>> The Apache OpenNLP PMC would like to call for a Vote on the Language
> > >>> Detector model for Apache OpenNLP 1.8.3 Release Candidate 3.
> > >>>
> > >>> The Release artifacts can be downloaded from:
> > >>>
> > >>> http://people.apache.org/~colen/models/langdetect-183/rc3/
> > >>>
> > >>> The model was built with Apache OpenNLP 1.8.3 release, trained with a
> > >>> portion of the Leipzig corpus, which can be found under this  tag:
> > >>>
> > >>> https://svn.apache.org/repos/bigdata/opennlp/tags/langdetect-183_RC3
> > >>>
> > >>> The model binary includes the NOTICE, LICENSE and also a README with
> > >>> details of supported languages, how the Leipzig corpus was created
> and
> > the
> > >>> model was trained. For your convenience the README is available here:
> > >>>
> > >>>
> > >>>
> > https://svn.apache.org/repos/bigdata/opennlp/tags/
> langdetect-183_RC3/leipzig/resources/README.txt
> > >>>
> > >>> A detailed evaluation report is available here:
> > >>>
> > >>>
> > >>>
> > http://people.apache.org/~colen/models/langdetect-183/
> rc3/langdetect-183.bin.report.txt
> > >>>
> > >>> To use Language Detector, please follow the documentation here:
> > >>>
> > >>>
> > http://opennlp.apache.org/docs/1.8.3/manual/opennlp.
> html#tools.langdetect
> > >>>
> > >>> It is important to note that this model is trained for and works well
> > with
> > >>> longer texts that have at least 2 sentences or more from the same
> > >>> language.
> > >>>
> > >>> The artifacts have been signed with the Key - 524A9649
> > >>>  found at
> > >>>
> > >>> http://people.apache.org/keys/group/opennlp.asc
> > >>>
> > >>> Please vote on releasing the model as Apache OpenNLP Language
> Detector
> > >>> Model 1.8.3. The vote is open for either the next 72 hours or a
> > minimum of
> > >>> 3 +1 PMC binding votes
> > >>> whichever happens earlier.
> > >>>
> > >>> Only votes from OpenNLP PMC are binding, but folks are welcome to
> check
> > >>> the
> > >>> release candidate and voice their approval or disapproval. The vote
> > passes
> > >>> if at least three binding +1 votes are cast.
> > >>>
> > >>> [ ] +1 Release the packages as Apache OpenNLP Language Detector Model
> > >>> 1.8.3
> > >>>
> > >>> [ ] -1 Do not release the packages because...
> > >>>
> > >>> Thanks again to all the committers and contributors for their work
> over
> > >>> the
> > >>> past few weeks.
> > >>>
> > >>
> >
>


[VOTE] Language Detector model for Apache OpenNLP 1.8.3 Release Candidate 3

2017-10-30 Thread William Colen
The Apache OpenNLP PMC would like to call for a Vote on the Language
Detector model for Apache OpenNLP 1.8.3 Release Candidate 3.

The Release artifacts can be downloaded from:

http://people.apache.org/~colen/models/langdetect-183/rc3/

The model was built with Apache OpenNLP 1.8.3 release, trained with a
portion of the Leipzig corpus, which can be found under this  tag:

https://svn.apache.org/repos/bigdata/opennlp/tags/langdetect-183_RC3

The model binary includes the NOTICE, LICENSE and also a README with
details of supported languages, how the Leipzig corpus was created and the
model was trained. For your convenience the README is available here:

https://svn.apache.org/repos/bigdata/opennlp/tags/langdetect-183_RC3/leipzig/resources/README.txt

A detailed evaluation report is available here:

http://people.apache.org/~colen/models/langdetect-183/rc3/langdetect-183.bin.report.txt

To use Language Detector, please follow the documentation here:

http://opennlp.apache.org/docs/1.8.3/manual/opennlp.html#tools.langdetect

It is important to note that this model is trained for and works well with
longer texts that have at least 2 sentences or more from the same language.

The artifacts have been signed with the Key - 524A9649
 found at

http://people.apache.org/keys/group/opennlp.asc

Please vote on releasing the model as Apache OpenNLP Language Detector
Model 1.8.3. The vote is open for either the next 72 hours or a minimum of
3 +1 PMC binding votes
whichever happens earlier.

Only votes from OpenNLP PMC are binding, but folks are welcome to check the
release candidate and voice their approval or disapproval. The vote passes
if at least three binding +1 votes are cast.

[ ] +1 Release the packages as Apache OpenNLP Language Detector Model 1.8.3

[ ] -1 Do not release the packages because...

Thanks again to all the committers and contributors for their work over the
past few weeks.


Re: [VOTE] Language Detector model for Apache OpenNLP 1.8.3 Release Candidate 2

2017-10-30 Thread William Colen
Thank you, Koji.

Let's fix it and start another RC.

2017-10-30 6:36 GMT-02:00 Koji Sekiguchi <koji.sekigu...@rondhuit.com>:

> Hi,
>
> When I unzip langdetect-183.bin to read text files in it, the README.txt
> says that its version is 1.8.2 in this line:
>
> This is the release 1 of the Apache OpenNLP Language Detector model
> version 1.8.2.
>
> I'm not sure but shouldn't it be 1.8.3? I'm not sure because I don't
> understand very well this part "... the release *1* of the Apache OpenNLP
> ..." in the above line. If it is still 1.8.2, here's my +1 (verifying
> signatures, running LanguageDetector under OpenNLP 1.8.3, etc.)
>
> Thanks!
>
> Koji
>
>
>
> On 2017/10/29 11:48, William Colen wrote:
>
>> The Apache OpenNLP PMC would like to call for a Vote on the Language
>> Detector model for Apache OpenNLP 1.8.3 Release Candidate 2.
>>
>> The Release artifacts can be downloaded from:
>>
>> http://people.apache.org/~colen/models/langdetect-183/rc2/
>>
>> The model was built with Apache OpenNLP 1.8.3 release, trained with a
>> portion of the Leipzig corpus, which can be found under this  tag:
>>
>> https://svn.apache.org/repos/bigdata/opennlp/tags/langdetect-183_RC2
>>
>> The model binary includes the NOTICE, LICENSE and also a README with
>> details of supported languages, how the Leipzig corpus was created and the
>> model was trained. For your convenience the README is available here:
>>
>> https://svn.apache.org/repos/bigdata/opennlp/tags/langdetect
>> -183_RC2/leipzig/resources/README.txt
>>
>> A detailed evaluation report is available here:
>>
>> http://people.apache.org/~colen/models/langdetect-183/rc2/
>> langdetect-183.bin.report.txt
>>
>> To use Language Detector, please follow the documentation here:
>>
>> http://opennlp.apache.org/docs/1.8.3/manual/opennlp.html#tools.langdetect
>>
>> It is important to note that this model is trained for and works well with
>> longer texts that have at least 2 sentences or more from the same
>> language.
>>
>> The artifacts have been signed with the Key - 524A9649
>>   found at
>>
>> http://people.apache.org/keys/group/opennlp.asc
>>
>> Please vote on releasing the model as Apache OpenNLP Language Detector
>> Model 1.8.3. The vote is open for either the next 72 hours or a minimum of
>> 3 +1 PMC binding votes
>> whichever happens earlier.
>>
>> Only votes from OpenNLP PMC are binding, but folks are welcome to check
>> the
>> release candidate and voice their approval or disapproval. The vote passes
>> if at least three binding +1 votes are cast.
>>
>> [ ] +1 Release the packages as Apache OpenNLP Language Detector Model
>> 1.8.3
>>
>> [ ] -1 Do not release the packages because...
>>
>> Thanks again to all the committers and contributors for their work over
>> the
>> past few weeks.
>>
>>


[VOTE] Language Detector model for Apache OpenNLP 1.8.3 Release Candidate 2

2017-10-28 Thread William Colen
The Apache OpenNLP PMC would like to call for a Vote on the Language
Detector model for Apache OpenNLP 1.8.3 Release Candidate 2.

The Release artifacts can be downloaded from:

http://people.apache.org/~colen/models/langdetect-183/rc2/

The model was built with Apache OpenNLP 1.8.3 release, trained with a
portion of the Leipzig corpus, which can be found under this  tag:

https://svn.apache.org/repos/bigdata/opennlp/tags/langdetect-183_RC2

The model binary includes the NOTICE, LICENSE and also a README with
details of supported languages, how the Leipzig corpus was created and the
model was trained. For your convenience the README is available here:

https://svn.apache.org/repos/bigdata/opennlp/tags/langdetect-183_RC2/leipzig/resources/README.txt

A detailed evaluation report is available here:

http://people.apache.org/~colen/models/langdetect-183/rc2/langdetect-183.bin.report.txt

To use Language Detector, please follow the documentation here:

http://opennlp.apache.org/docs/1.8.3/manual/opennlp.html#tools.langdetect

It is important to note that this model is trained for and works well with
longer texts that have at least 2 sentences or more from the same language.

The artifacts have been signed with the Key - 524A9649
 found at

http://people.apache.org/keys/group/opennlp.asc

Please vote on releasing the model as Apache OpenNLP Language Detector
Model 1.8.3. The vote is open for either the next 72 hours or a minimum of
3 +1 PMC binding votes
whichever happens earlier.

Only votes from OpenNLP PMC are binding, but folks are welcome to check the
release candidate and voice their approval or disapproval. The vote passes
if at least three binding +1 votes are cast.

[ ] +1 Release the packages as Apache OpenNLP Language Detector Model 1.8.3

[ ] -1 Do not release the packages because...

Thanks again to all the committers and contributors for their work over the
past few weeks.


Re: [VOTE] Language Detector model for Apache OpenNLP 1.8.3 Release Candidate

2017-10-28 Thread William Colen
RC1 vote is cancelled due to a issue in the model.
It was including a 3 MB unnecessary report. The report will be removed from
the model and made available as an optional download.

2017-10-27 22:13 GMT-02:00 William Colen <co...@apache.org>:

> The Apache OpenNLP PMC would like to call for a Vote on the Language
> Detector model for Apache OpenNLP 1.8.3 Release Candidate.
>
> The Release artifacts can be downloaded from:
>
> http://people.apache.org/~colen/models/langdetect-183/rc1/
>
> The model was built with Apache OpenNLP 1.8.3 release, trained with a
> portion of the Leipzig corpus, which can be found under this  tag:
>
> https://svn.apache.org/repos/bigdata/opennlp/tags/langdetect-183_RC1
>
> The model binary includes the NOTICE, LICENSE and also a README with
> details of supported languages, how the Leipzig corpus was created and the
> model was trained. For your convenience the README is available here:
>
> https://svn.apache.org/repos/bigdata/opennlp/tags/
> langdetect-183_RC1/leipzig/resources/README.txt
>
> A detailed evaluation report is available here:
>
> http://people.apache.org/~colen/models/langdetect-183/
> rc1/langdetect-183.bin.report.txt
>
> To use Language Detector, please follow the documentation here:
>
> http://opennlp.apache.org/docs/1.8.3/manual/opennlp.html#tools.langdetect
>
> It is important to note that this model is trained for and works well with
> longer texts that have at least 2 sentences or more from the same language.
>
> The artifacts have been signed with the Key - 524A9649
>
> found at
>
> http://people.apache.org/keys/group/opennlp.asc
>
> Please vote on releasing the model as Apache OpenNLP Language Detector
> Model 1.8.3. The vote is open for either the next 72 hours or a minimum of
> 3 +1 PMC binding votes
>
> whichever happens earlier.
>
> Only votes from OpenNLP PMC are binding, but folks are welcome to check
> the release candidate and voice their approval or disapproval. The vote
> passes if at least three binding +1 votes are cast.
>
> [ ] +1 Release the packages as Apache OpenNLP Language Detector Model 1.8.3
>
> [ ] -1 Do not release the packages because...
>
> Thanks again to all the committers and contributors for their work over
> the past few weeks.
>
>


[VOTE] Language Detector model for Apache OpenNLP 1.8.3 Release Candidate

2017-10-27 Thread William Colen
The Apache OpenNLP PMC would like to call for a Vote on the Language
Detector model for Apache OpenNLP 1.8.3 Release Candidate.

The Release artifacts can be downloaded from:

http://people.apache.org/~colen/models/langdetect-183/rc1/

The model was built with Apache OpenNLP 1.8.3 release, trained with a
portion of the Leipzig corpus, which can be found under this  tag:

https://svn.apache.org/repos/bigdata/opennlp/tags/langdetect-183_RC1

The model binary includes the NOTICE, LICENSE and also a README with
details of supported languages, how the Leipzig corpus was created and the
model was trained. For your convenience the README is available here:

https://svn.apache.org/repos/bigdata/opennlp/tags/langdetect-183_RC1/leipzig/resources/README.txt

A detailed evaluation report is available here:

http://people.apache.org/~colen/models/langdetect-183/rc1/langdetect-183.bin.report.txt

To use Language Detector, please follow the documentation here:

http://opennlp.apache.org/docs/1.8.3/manual/opennlp.html#tools.langdetect

It is important to note that this model is trained for and works well with
longer texts that have at least 2 sentences or more from the same language.

The artifacts have been signed with the Key - 524A9649

found at

http://people.apache.org/keys/group/opennlp.asc

Please vote on releasing the model as Apache OpenNLP Language Detector
Model 1.8.3. The vote is open for either the next 72 hours or a minimum of
3 +1 PMC binding votes

whichever happens earlier.

Only votes from OpenNLP PMC are binding, but folks are welcome to check the
release candidate and voice their approval or disapproval. The vote passes
if at least three binding +1 votes are cast.

[ ] +1 Release the packages as Apache OpenNLP Language Detector Model 1.8.3

[ ] -1 Do not release the packages because...

Thanks again to all the committers and contributors for their work over the
past few weeks.


Re: [VOTE] Apache OpenNLP 1.8.3 Release Candidate

2017-10-25 Thread William Colen
+1 binding

- eval tests ok
- unit test ok
- build from tag ok
- distribution execution ok
- distribution ok



2017-10-25 14:46 GMT-02:00 Tommaso Teofili :

> +1 (binding)
>
> - source build from tag ok
> - sigs and checks ok
>
> Il giorno mer 25 ott 2017 alle ore 18:09 Steve Blackmon <
> sblack...@apache.org> ha scritto:
>
> >  +1 non-binding
> >
> > - source builds, tests pass
> > - verified checksums and signatures
> >
> > Steve Blackmon
> > sblack...@apache.org
> >
> > On Oct 25, 2017 at 10:17 AM, Dan Russ  wrote:
> >
> >
> > +1 burrito
> >
> > ran units test on my downstream code that uses opennlp-tools.
> >
> > On Oct 25, 2017, at 6:58 AM, Suneel Marthi  wrote:
> >
> > +1 binding
> >
> > 1. Verified Sigs and hashes
> > 2. Ran a clean build from {src} * {zip, tar}
> > 3. All unit tests pass
> >
> > On Wed, Oct 25, 2017 at 3:08 PM, Bruno P. Kinoshita <
> > brunodepau...@yahoo.com.br.invalid> wrote:
> >
> > [ X ] +1 Release the packages as Apache OpenNLP 1.8.3
> >
> > `mvn clean test install` working fine, checked artefacts signatures,
> > matching with what was in the vote e-mail.
> >
> > Currently on tag 1.8.3, commit b317159cb9857dc509c08a31a98dc61209f39bff
> >
> > Thanks for preparing this release.
> >
> > Cheers
> > Bruno
> >
> >
> >
> > 
> > From: Suneel Marthi 
> > To: dev@opennlp.apache.org; us...@opennlp.apache.org
> > Sent: Tuesday, 24 October 2017 10:29 PM
> > Subject: [VOTE] Apache OpenNLP 1.8.3 Release Candidate
> >
> >
> >
> > The Apache OpenNLP PMC would like to call for a Vote on Apache OpenNLP
> >
> > 1.8.3 Release Candidate.
> >
> >
> > The Release artifacts can be downloaded from:
> >
> >
> > https://repository.apache.org/content/repositories/orgapache
> >
> > opennlp-1010/org/apache/opennlp/opennlp-distr/1.7.2/
> >
> >
> > The release was made from the Apache OpenNLP 1.8.3 tag at
> >
> >
> > https://github.com/apache/opennlp/tree/opennlp-1.8.3
> >
> >
> > To use it in a maven build set the version for opennlp-tools or
> > opennlp-uima
> >
> > to 1.8.3
> >
> >
> > and add the following URL to your settings.xml file:
> >
> >
> > https://repository.apache.org/content/repositories/
> > orgapacheopennlp-1019/org/apache/opennlp/opennlp-distr/1.8.3/
> >
> >
> > The artifacts have been signed with the Key - D3541808 found at
> >
> >
> > http://people.apache.org/keys/group/opennlp.asc
> >
> >
> > Please vote on releasing these packages as Apache OpenNLP 1.8.3. The vote
> > is
> >
> >
> > open for either the next 72 hours or a minimum of 3 +1 PMC binding votes
> >
> > whichever happens earlier.
> >
> >
> > Only votes from OpenNLP PMC are binding, but folks are welcome to check
> the
> >
> >
> > release candidate and voice their approval or disapproval. The vote
> passes
> >
> >
> > if at least three binding +1 votes are cast.
> >
> >
> > [ ] +1 Release the packages as Apache OpenNLP 1.8.3
> >
> >
> > [ ] -1 Do not release the packages because...
> >
> >
> > Thanks again to all the committers and contributors for their work
> >
> > over the past
> >
> > few weeks.
> >
>


Re: [VOTE] Apache OpenNLP 1.8.2 Release Candidate 2

2017-09-13 Thread William Colen
Evaluation tests OK
LD with Leipzig OK

+1 (binding)

2017-09-12 17:52 GMT-03:00 Richard Eckart de Castilho :

> On 11.09.2017, at 09:12, Joern Kottmann  wrote:
> >
> > I have posted a second release candidate for the Apache OpenNLP 1.8.2
> > release and it is ready for testing.
>
> +1 (non-binding)
>
> -- Richard
>


Re: Sentence Detector

2017-08-25 Thread William Colen
The writer did a mistake by not adding a space after the dot. The sentence
detector model will not know how to deal with it because not very often
there are dots without space splitting sentences.

This is very common in social network. I apply some regex to check if it is
not a UR, email or number, them add the missing space.



2017-08-25 9:31 GMT-03:00 Manoj B. Narayanan :

> Hi,
>
> The OpenNLP sentence detector detects sentences when the period at the end
> of a sentence and the next word are separated by a . If there is no
>  in between it doesn't split them. Is there a way that could help me
> solve this?
>
> *Example.1*
>
> It is with great pleasure that I write to invite you to the launch of the
> University of Reading’s Centre for Food Security on Thursday 25 November
> 2010.** The Centre offers a new focus for research on the challenges
> of meeting global demands for food in a sustainable way.
>
> *Output1*
> It is with great pleasure that I write to invite you to the launch of the
> University of Reading’s Centre for Food Security on Thursday 25 November
> 2010.
> The Centre offers a new focus for research on the challenges of meeting
> global demands for food in a sustainable way.
>
>
> *Example.2*
>
> It is with great pleasure that I write to invite you to the launch of the
> University of Reading’s Centre for Food Security on Thursday 25 November
> 2010.The Centre offers a new focus for research on the challenges of
> meeting global demands for food in a sustainable way.
>
> *Output2*
> It is with great pleasure that I write to invite you to the launch of the
> University of Reading’s Centre for Food Security on Thursday 25 November
> 2010.The Centre offers a new focus for research on the challenges of
> meeting global demands for food in a sustainable way.
>
> Thanks,
> Manoj.
>


Re: Releasing a Language Detection Model

2017-07-11 Thread William Colen
ant
> >> a sensible
> >> / delivered internal classpath default and the ability for run-time, non
> >> zipped up/messing
> >> with JAR file override. Think about people who are using OpenNLP in both
> >> Java/Python
> >> environments as an example.
> >>
> >> Cheers,
> >> Chris
> >>
> >>
> >>
> >>
> >> On 7/11/17, 3:25 AM, "Joern Kottmann" <kottm...@gmail.com> wrote:
> >>
> >> I would not change the CLI to load models from jar files. I never
> used
> >> or saw a command line tool that expects a file as an input and would
> >> then also load it from inside a jar file. It will be hard to
> >> communicate how that works precisely in the CLI usage texts and this
> >> is not a feature anyone would expect to be there. The intention of
> the
> >> CLI is to give users the ability to quickly test OpenNLP before they
> >> integrate it into their software and to train and evaluate models
> >>
> >> Users who for some reason have a jar file with a model inside can
> just
> >> write "unzip model.jar".
> >>
> >> After all I think this is quite  a bit of complexity we would need
> to
> >> add for it and it will have very limited use.
> >>
> >> The use case of publishing jar files is to make the models easily
> >> available to people who have a build system with dependency
> >> management, they won't have to download models manually, and when
> they
> >> update OpenNLP then can also update the models with a version string
> >> change.
> >>
> >> For the command line "quick start" use case we should offer the
> models
> >> on a download page as we do today. This page could list both, the
> >> download link and the maven dependency.
> >>
> >> Jörn
> >>
> >> On Mon, Jul 10, 2017 at 8:50 PM, William Colen <co...@apache.org>
> >> wrote:
> >> > We need to address things such as sharing the evaluation results
> and
> >> how to
> >> > reproduce the training.
> >> >
> >> > There are several possibilities for that, but there are points to
> >> consider:
> >> >
> >> > Will we store the model itself in a SCM repository or only the
> code
> >> that
> >> > can build it?
> >> > Will we deploy the models to a Maven Central repository? It is
> good
> >> for
> >> > people using the Java API but not for command line interface,
> should
> >> we
> >> > change the CLI to handle models in the classpath?
> >> > Should we keep a copy of the training model or always download
> from
> >> the
> >> > original provider? We can't guarantee that the corpus will be
> there
> >> > forever, not only because it changed license, but simple because
> the
> >> > provider is not keeping the server up anymore.
> >> >
> >> > William
> >> >
> >> >
> >> >
> >> > 2017-07-10 14:52 GMT-03:00 Joern Kottmann <kottm...@gmail.com>:
> >> >
> >> >> Hello all,
> >> >>
> >> >> since Apache OpenNLP 1.8.1 we have a new language detection
> >> component
> >> >> which like all our components has to be trained. I think we
> should
> >> >> release a pre-build model for it trained on the Leipzig corpus.
> This
> >> >> will allow the majority of our users to get started very quickly
> >> with
> >> >> language detection without the need to figure out on how to train
> >> it.
> >> >>
> >> >> How should this project release models?
> >> >>
> >> >> Jörn
> >> >>
> >>
> >>
> >>
> >>
>


Re: Releasing a Language Detection Model

2017-07-10 Thread William Colen
We need to address things such as sharing the evaluation results and how to
reproduce the training.

There are several possibilities for that, but there are points to consider:

Will we store the model itself in a SCM repository or only the code that
can build it?
Will we deploy the models to a Maven Central repository? It is good for
people using the Java API but not for command line interface, should we
change the CLI to handle models in the classpath?
Should we keep a copy of the training model or always download from the
original provider? We can't guarantee that the corpus will be there
forever, not only because it changed license, but simple because the
provider is not keeping the server up anymore.

William



2017-07-10 14:52 GMT-03:00 Joern Kottmann :

> Hello all,
>
> since Apache OpenNLP 1.8.1 we have a new language detection component
> which like all our components has to be trained. I think we should
> release a pre-build model for it trained on the Leipzig corpus. This
> will allow the majority of our users to get started very quickly with
> language detection without the need to figure out on how to train it.
>
> How should this project release models?
>
> Jörn
>


Re: [VOTE] Apache OpenNLP 1.8.1 Release Candidate 3

2017-07-07 Thread William Colen
+1 - Tested with multiple other projects. Tested language detector.

2017-07-07 10:52 GMT-03:00 Joern Kottmann :

> +1 i did run the eval the tests and they passed
>
> Jörn
>
> On Fri, Jul 7, 2017 at 1:06 PM, Bruno P. Kinoshita
>  wrote:
> > Build passing OK with the following environment:
> > Apache Maven 3.3.9 (bb52d8502b132ec0a5a3f4c09453c07478323dc5;
> 2015-11-11T05:41:47+13:00)
> > Maven home: /opt/maven
> > Java version: 1.8.0_131, vendor: Oracle Corporation
> > Java home: /usr/lib/jvm/java-8-oracle/jre
> > Default locale: en_US, platform encoding: UTF-8
> > OS name: "linux", version: "4.4.0-83-generic", arch: "amd64", family:
> "unix"
> >
> > Had a look at simple reports (findbugs, pmd), all looking good.
> > [ X ] +1 Release the packages as Apache OpenNLP 1.8.1
> >
> > ThanksBruno
> > 
> > On Thursday, 6 July 2017, 1:21:32 AM NZST, Suneel Marthi <
> smar...@apache.org> wrote:
> >
> >
> > The Apache OpenNLP PMC would like to call for a Vote on Apache OpenNLP
> 1.8.1
> > Release Candidate 3.
> >
> > The Release artifacts can be downloaded from:
> >
> > https://repository.apache.org/content/repositories/
> orgapacheopennlp-1016/org/apache/opennlp/opennlp-distr/1.8.1/
> >
> > The release was made from the Apache OpenNLP 1.8.1 tag at
> >
> > https://github.com/apache/opennlp/tree/opennlp-1.8.1
> >
> > To use it in a maven build set the version for opennlp-tools or
> opennlp-uima
> > to 1.8.1
> >
> > and add the following URL to your settings.xml file:
> >
> > https://repository.apache.org/content/repositories/
> orgapacheopennlp-1016/
> >
> > The artifacts have been signed with the Key - D3541808 found at
> >
> > http://people.apache.org/keys/group/opennlp.asc
> >
> > Please vote on releasing these packages as Apache OpenNLP 1.8.1. The
> vote is
> >
> > open for the next 72 hours *ending on Saturday, July 8AM EST *.
> >
> > Only votes from OpenNLP PMC are binding, but folks are welcome to check
> the
> >
> > release candidate and voice their approval or disapproval. The vote
> passes
> >
> > if at least three binding +1 votes are cast.
> >
> > [ ] +1 Release the packages as Apache OpenNLP 1.8.1
> >
> > [ ] -1 Do not release the packages because...
> >
> > Thanks again to all the committers and contributors for their work
> > over the past
> > few weeks.
>


Re: [GitHub] opennlp pull request #143: OPENNLP-788: Add initial langdetect implementatio...

2017-05-18 Thread William Colen
+1 to think how to do it. Polyglot is doing it already.



2017-05-18 22:28 GMT-03:00 :

> Can the language detector find when the language changes?  I have data in
> French and English, I would love to be able to pull separate the two.
> Daniel
>
>
> > On May 17, 2017, at 12:38 PM, wcolen  wrote:
> >
> > Github user wcolen closed the pull request at:
> >
> >https://github.com/apache/opennlp/pull/143
> >
> >
> > ---
> > If your project is set up for it, you can reply to this email and have
> your
> > reply appear on GitHub as well. If your project does not have this
> feature
> > enabled and wishes so, or if the feature is enabled but not working,
> please
> > contact infrastructure at infrastruct...@apache.org or file a JIRA
> ticket
> > with INFRA.
> > ---
>
>


Re: [VOTE] Apache OpenNLP 1.8.0 Release Candidate 3

2017-05-18 Thread William Colen
+1 (binding)

Successfully executed complete evaluation tests in source deliverable.
Tried it with DKPro and after updating the Lemmatizer and Chunker usage
there was two test failures that we could trace back to issues fixed in
OPENNLP-125 and OPENNLP-989 that would affect evaluation results.



2017-05-18 10:08 GMT-03:00 Tommaso Teofili :

> +1 (binding)
>
> Regards,
> Tommaso
>
> p.s.:
>
> +1 also to Bruno's side comments
>
> Il giorno gio 18 mag 2017 alle ore 12:43 Bruno P. Kinoshita
>  ha scritto:
>
> >
> > [ X ] +1 Release the packages as Apache OpenNLP 1.8.0
> >
> > Not binding
> >
> > Side note: would be nice later to start fixing some issues found via
> > FindBugs. Running `mvn clean findbugs:findbugs findbugs:gui` shows
> several
> > errors, some seem important, like using equals() for array objects (which
> > will always be false).
> >
> > See
> >
> >
> > https://github.com/apache/opennlp/blob/73c8e5b9d8e055fefb53f7f3c2487d
> 05c9788c6a/opennlp-tools/src/main/java/opennlp/tools/util/
> TokenTag.java#L85
> >
> > And
> >
> >
> >
> > https://github.com/apache/opennlp/blob/73c8e5b9d8e055fefb53f7f3c2487d
> 05c9788c6a/opennlp-tools/src/main/java/opennlp/tools/util/featuregen/
> POSTaggerNameFeatureGenerator.java#L59
> > Plus other NullPointerException's that can be prevented, and other minor
> > issues. Not blockers for the release though, IMO.
> >
> > Cheers
> > Bruno
> >
> >
> > 
> > From: Joern Kottmann 
> > To: dev@opennlp.apache.org
> > Sent: Thursday, 18 May 2017 9:49 AM
> > Subject: [VOTE] Apache OpenNLP 1.8.0 Release Candidate 3
> >
> >
> >
> > The Apache OpenNLP PMC would like to call for a Vote on Apache OpenNLP
> >
> > 1.8.0 Release Candidate 3.
> >
> >
> > The RC 3 distributables can be downloaded from here:
> >
> > https://repository.apache.org/content/repositories/orgapacheopennlp-101
> >
> > 3/org/apache/opennlp/opennlp-distr/1.8.0/
> >
> >
> > The release was made from the Apache OpenNLP 1.8.0 tag at
> >
> > https://github.com/apache/opennlp/tree/opennlp-1.8.0
> >
> >
> >
> > To use it in a maven build set the version for opennlp-tools or
> >
> > opennlp-uima to 1.8.0 and add the following URL to your settings.xml
> >
> > file:
> >
> > https://repository.apache.org/content/repositories/orgapacheopennlp-101
> >
> > 3
> >
> >
> >
> > The release was made using the OpenNLP release process, documented on
> >
> > the Wiki here:
> >
> > https://cwiki.apache.org/confluence/display/OPENNLP/Release+Process
> >
> >
> >
> > The release contains quite some changes, please refer to the contained
> >
> > issue list for details.
> >
> >
> >
> > Please vote on releasing these packages as Apache OpenNLP 1.8.0. The
> >
> > vote is open for at least the next 72 hours.
> >
> >
> >
> > Only votes from OpenNLP PMC are binding, but folks are welcome to check
> >
> > the release candidate and voice their approval or disapproval. The vote
> >
> > passes if at least three binding +1 votes are cast.
> >
> >
> >
> > [ ] +1 Release the packages as Apache OpenNLP 1.8.0
> >
> > [ ] -1 Do not release the packages because...
> >
> >
> >
> >
> >
> > Thanks!
> >
> >
> > Jörn
> >
> >
> > P.S. Here is my +1.
> >
>


Re: [VOTE] Apache OpenNLP 1.8.0 Release Candidate 2

2017-05-17 Thread William Colen
Would be a pleasure. Let's prepare the next OpenNLP RC and I create a PR
with the update.



2017-05-16 14:36 GMT-03:00 Richard Eckart de Castilho <r...@apache.org>:

> Hi William,
>
> > On 16.05.2017, at 14:35, William Colen <william.co...@gmail.com> wrote:
> >
> > I cloned DKPro code and tried Rodrigo proposed changes. Your test passes
> > with it.
>
> cool :)
>
> Would you like to contribute the changes to DKPro Core?
>
> Cheers,
>
> -- Richard
>


Re: [VOTE] Apache OpenNLP 1.8.0 Release Candidate 2

2017-05-16 Thread William Colen
Hi Richard,

I cloned DKPro code and tried Rodrigo proposed changes. Your test passes
with it.

Thank you
William

2017-05-15 18:51 GMT-03:00 Rodrigo Agerri :

> Hello Richard,
>
> I have tried with various corpora, including GUM, but I cannot reproduce
> that error.
>
> https://github.com/apache/opennlp/commit/8a3b3b537a30b14c4ffb5eb32ffa41
> d5027bddad
>
> Please note that commit O-904 changed (broke) the lemmatizer API
> substantially to make it uniform between DictionaryLemmatizer and the
> LemmatizerME (e.g., doing the decoding of lemmas internally and so on) so
> that this line for tagging with the LemmatizerME is not required:
>
> https://github.com/dkpro/dkpro-core/blob/89f144a63b214cd584b3cd0e6c499d
> ff6cbcd9ca/dkpro-core-opennlp-asl/src/main/java/de/
> tudarmstadt/ukp/dkpro/core/opennlp/OpenNlpLemmatizer.java#L135
>
> Also, that commit changed the LemmaSampleStream and LemmaSample classes, so
> it is possible that is affecting this class:
>
> https://github.com/dkpro/dkpro-core/blob/89f144a63b214cd584b3cd0e6c499d
> ff6cbcd9ca/dkpro-core-opennlp-asl/src/main/java/de/
> tudarmstadt/ukp/dkpro/core/opennlp/internal/CasLemmaSampleStream.java
>
> I understand the logic of this class correctly as it stands it will take an
> already encoded SES and will try to encoded it again?
>
> Could you please take a look and see if that could be the problem?
>
> Cheers,
>
> Rodrigo
>
> On Mon, May 15, 2017 at 6:21 PM, Richard Eckart de Castilho <
> r...@apache.org>
> wrote:
>
> > > On 15.05.2017, at 16:35, Joern Kottmann  wrote:
> > >
> > > Richard, I believe I found the problem with the parser, would you mind
> to
> > > take a look?
> > >
> > > This PR should fix it:
> > > https://github.com/apache/opennlp/pull/199
> >
> > The parser test works nicely with the PR.
> >
> > The lemmatizer test still behaves strange.
> >
> > Cheers,
> >
> > -- Richard
> >
> >
>


Re: [VOTE] Apache OpenNLP 1.8.0 Release Candidate 2

2017-05-13 Thread William Colen
With the issues reported by Richard we should cancel the vote and rollback
the release.

I change my vote to -1 (binding)

2017-05-13 19:08 GMT-03:00 Richard Eckart de Castilho :

>
> > On 13.05.2017, at 22:35, Richard Eckart de Castilho 
> wrote:
> >
> > Should OpenNLP 1.8.0 yield identical results as 1.7.2 when the same
> > training data is used during training?
> >
> > I have a test that trains a lemmatizer model on GUM 3.0.0. With 1.7.2,
> > this model reached an f-score of ~0.96. With 1.8.0, I only get ~0.84.
>
> Also, this test which trains and evaluates a lemmatizer model
> takes ~8 sec with 1.7.2 and ~170 sec with 1.8.0. Even when only
> considering the training phase (no evaluation), the test runs
> much faster with 1.7.2 than with 1.8.0.
>
> Here are some details on the training phase.
>
> It seems odd that the events, outcomes, and predicates change that much.
>
> === 1.7.2
>
> done. 50697 events
> Indexing...  done.
> Sorting and merging events... done. Reduced 50697 events to 12675.
> Done indexing.
> Incorporating indexed data for training...
> done.
> Number of Event Tokens: 12675
> Number of Outcomes: 389
>   Number of Predicates: 13488
> ...done.
> Computing model parameters ...
> Performing 10 iterations.
>   1:  ... loglikelihood=-302335.58198350534 0.8420616604532812
>   2:  ... loglikelihood=-61602.20311717376  0.9492672150225852
>   3:  ... loglikelihood=-30747.954089148297 0.9769217113438665
>   4:  ... loglikelihood=-19986.853691639506 0.9850484249561118
>   5:  ... loglikelihood=-14672.523462458894 0.9881255301102629
>   6:  ... loglikelihood=-11572.587093608756 0.9893879322247865
>   7:  ... loglikelihood=-9571.242700030467  0.9900783083811665
>   8:  ... loglikelihood=-8185.39402892  0.9906897844053889
>   9:  ... loglikelihood=-7174.66904253965   0.9912223602974535
>  10:  ... loglikelihood=-6407.42781438460.9917746612225575
>
>
> === 1.8.0
>
> done. 50697 events
> Indexing...  done.
> Sorting and merging events... done. Reduced 50697 events to 26026.
> Done indexing.
> Incorporating indexed data for training...
> done.
> Number of Event Tokens: 26026
> Number of Outcomes: 7668
>   Number of Predicates: 15279
> ...done.
> Computing model parameters ...
> Performing 10 iterations.
>   1:  ... loglikelihood=-453475.08854769287 1.972503303943034E-5
>   2:  ... loglikelihood=-165718.68620632993 0.9509241177978973
>   3:  ... loglikelihood=-85388.42871190465  0.9761327100222893
>   4:  ... loglikelihood=-56404.00400621838  0.9892104069274316
>   5:  ... loglikelihood=-41004.08840359108  0.9938457896916977
>   6:  ... loglikelihood=-31539.64788603799  0.9955421425330887
>   7:  ... loglikelihood=-25264.889481438582 0.9964889441189814
>   8:  ... loglikelihood=-20883.72059438774  0.9972384953744797
>   9:  ... loglikelihood=-17699.228362701586 0.9977710712665444
>  10:  ... loglikelihood=-15306.654021266759 0.9980669467621358
>
>
> I also get some differences in f-score for other tests that train models,
> but not as significant as when training a lemmatizer model.
>
> -- Richard
>


Re: [VOTE] Apache OpenNLP 1.8.0 Release Candidate 2

2017-05-12 Thread William Colen
+1 binding
Executed the complete evaluation suite, both in source distribution and the
git tag. Integrated and tested with other tools.


2017-05-12 9:48 GMT-03:00 Joern Kottmann :

> The vote is still open and we won't close it before the entire active PMC
> voted or the time passed.
>
> Jörn
>
> On Fri, May 12, 2017 at 2:29 PM, Daniel Russ  wrote:
>
> > Even though we have enough binding votes to release, can I have a few
> hours
> > to complete testing of my code with 1.8.0RC2 before release.
> > Daniel
> >
> > On May 11, 2017 12:38 PM, "Joern Kottmann"  wrote:
> >
> > > The Apache OpenNLP PMC would like to call for a Vote on Apache OpenNLP
> > > 1.8.0 Release Candidate 2.
> > >
> > > The RC 2 distributables can be downloaded from here:
> > > https://repository.apache.org/content/repositories/
> orgapacheopennlp-101
> > > 2/org/apache/opennlp/opennlp-distr/1.8.0/
> > >
> > > The release was made from the Apache OpenNLP 1.8.0 tag at
> > > https://github.com/apache/opennlp/tree/opennlp-1.8.0
> > >
> > > To use it in a maven build set the version for opennlp-tools or
> > > opennlp-uima to 1.8.0 and add the following URL to your settings.xml
> > > file:
> > > https://repository.apache.org/content/repositories/
> orgapacheopennlp-101
> > > 2
> > >
> > > The release was made using the OpenNLP release process, documented on
> > > the Wiki here:
> > > https://cwiki.apache.org/confluence/display/OPENNLP/Release+Process
> > >
> > > The release contains quite some changes, please refer to the contained
> > > issue list for details.
> > >
> > > Please vote on releasing these packages as Apache OpenNLP 1.8.0. The
> > > vote is open for at least the next 72 hours.
> > >
> > > Only votes from OpenNLP PMC are binding, but folks are welcome to check
> > > the release candidate and voice their approval or disapproval. The vote
> > > passes if at least three binding +1 votes are cast.
> > >
> > > [ ] +1 Release the packages as Apache OpenNLP 1.8.0
> > > [ ] -1 Do not release the packages because...
> > >
> > >
> > > Thanks!
> > >
> > > Jörn
> > >
> > > P.S. Here is my +1.
> > >
> >
>


Re: closeQuietly() for stream try/catch

2017-04-16 Thread William Colen
We try to avoid external dependencies, including Apache Commons.

Take a look if it is possible to use try-with-resources statement.

Thank you,
William

2017-04-16 8:46 GMT-03:00 Jeff Zemerick :

> In cases of code like this when closing a stream:
>
> finally {
>   try {
> sampleStream.close();
>   } catch (IOException e) {
> // sorry that this can fail
>   }
> }
>
> I thought it might be a bit cleaner looking to replace the empty try/catch
> with Apache Commons IO's IOUtils.closeQuietly(). I noticed that the Apache
> Commons IO dependency's scope is currently set to test. If
> you agree that the code change would be cleaner, is there any problem with
> changing that dependency to a compile dependency instead of test?
>
> Thanks,
> Jeff
>


Re: Update web site layout

2017-03-03 Thread William Colen
Hi, Bruno,

What do you think if we instead of using maven site we do it using Jekyll +
github?
That way we don't need to separate the site and documentation deploy.

Thank you
William

2017-03-03 10:03 GMT-03:00 Bruno P. Kinoshita <
brunodepau...@yahoo.com.br.invalid>:

> Hi all,
>
> Didn't find an issue for that, so thought about asking here before
> creating a ticket in JIRA.
>
>
> Normally I find what I need using the current site (normally models,
> version, and manual) but thought maybe an update in the web site layout
> could be a good idea?
>
> I thought combined with OPENNLP-6 and OPENNLP-504 (and maybe later
> OPENNLP-48 as well) this could attract more users / developers.
>
> Here's an example, using the Maven Site plug-in, with the Fluid skin:
> https://kinow.github.io/opennlp/
>
> Cheers
> Bruno
>


Re: Hardcoded length in prefix and suffix feature generators

2017-02-09 Thread William Colen
Looks good! Thanks for the unit tests.
Please open a Jira, squash your commits and open the PR.

2017-02-09 19:55 GMT-02:00 Jeffrey Zemerick :

> Hi,
>
> I noticed that the length is hardcoded to 4 in the PrefixFeatureGenerator
> and the SuffixFeatureGenerator. I made this value configurable in the XML
> for each feature generator. I also add a check for the length to keep
> duplicate prefixes or suffixes being returned. (If the token is "yes" with
> a length of 4 there would be two "yes" features returned.) If a value is
> not provided in the XML it uses the default value of 4.
>
> You can preview the changes here:
> https://github.com/apache/opennlp/compare/master...
> jzonthemtn:prefixsuffix?expand=1
>
> If this is a change that's desired by the group I can make a JIRA and a
> pull request.
>
> Thanks,
> Jeff
>


Re: [VOTE] Apache OpenNLP 1.7.2 Release Candidate

2017-02-03 Thread William Colen
+1 binding

I did run the eval tests and they all run through, including the one that
needs more memory.

William

2017-02-03 13:35 GMT-02:00 Suneel Marthi :

> +1 binding
>
> Verified {src, bin} * {zip, tar} and all tests pass.
>
> On Fri, Feb 3, 2017 at 10:08 AM, Russ, Daniel (NIH/CIT) [E] <
> dr...@mail.nih.gov> wrote:
>
> > +1 (non-binding)  Have not run across problems with external code that
> > uses OpenNLP
> >
> > On 2/3/17, 9:57 AM, "Rodrigo Agerri"  wrote:
> >
> > +1 also pass tests
> >
> > On Fri, Feb 3, 2017 at 3:34 PM, Jeffrey Zemerick <
> jzemer...@apache.org
> > >
> > wrote:
> >
> > > +1 (non-binding) Build and tests pass with no issues.
> > >
> > >
> > >
> > > On Fri, Feb 3, 2017 at 4:15 AM, Joern Kottmann  >
> > wrote:
> > >
> > > > +1
> > > >
> > > > I did run the eval tests and they all run through except one test
> > which
> > > > needed more memory, that test case has to be adapted to run fast
> > and with
> > > > much less memory, we should do that for the 1.7.3 release.
> > > >
> > > > Jörn
> > > >
> > > > On Wed, Feb 1, 2017 at 5:52 PM, Suneel Marthi <
> smar...@apache.org>
> > > wrote:
> > > >
> > > > > The Apache OpenNLP PMC would like to call for a Vote on Apache
> > OpenNLP
> > > > > 1.7.2
> > > > > Release Candidate.
> > > > >
> > > > > The Release artifacts can be downloaded from:
> > > > >
> > > > > https://repository.apache.org/content/repositories/
> > > > > orgapacheopennlp-1010/org/apache/opennlp/opennlp-distr/1.7.2/
> > > > >
> > > > > The release was made from the Apache OpenNLP 1.7.2 tag at
> > > > >
> > > > > https://github.com/apache/opennlp/tree/opennlp-1.7.2
> > > > >
> > > > > To use it in a maven build set the version for opennlp-tools or
> > > > > opennlp-uima
> > > > > to 1.7.2
> > > > >
> > > > > and add the following URL to your settings.xml file:
> > > > >
> > > > > https://repository.apache.org/content/repositories/
> > > orgapacheopennlp-1010
> > > > >
> > > > > The artifacts have been signed with the Key - D3541808 found at
> > > > >
> > > > > http://people.apache.org/keys/group/opennlp.asc
> > > > >
> > > > > Please vote on releasing these packages as Apache OpenNLP
> 1.7.2.
> > The
> > > vote
> > > > > is
> > > > >
> > > > > open for either the next 72 hours or a minimum of 3 +1 PMC
> > binding
> > > votes
> > > > > whichever happens earlier.
> > > > >
> > > > > Only votes from OpenNLP PMC are binding, but folks are welcome
> > to check
> > > > the
> > > > >
> > > > > release candidate and voice their approval or disapproval. The
> > vote
> > > > passes
> > > > >
> > > > > if at least three binding +1 votes are cast.
> > > > >
> > > > > [ ] +1 Release the packages as Apache OpenNLP 1.7.2
> > > > >
> > > > > [ ] -1 Do not release the packages because...
> > > > >
> > > > > Thanks again to all the committers and contributors for their
> > work
> > > > > over the past
> > > > > few weeks.
> > > > >
> > > >
> > >
> >
> >
> >
>


Re: OpenNLP - Model version 1.6.0 not supported by this (1.5.3) version of OpenNLP

2017-01-13 Thread William Colen
Are you using Maven?


2017-01-13 5:32 GMT-02:00 David Samuel Lim :

> Oops, I meant *opennlp-tools-1.5.3.jar*. My bad.
>
> On Fri, Jan 13, 2017 at 3:29 PM, David Samuel Lim 
> wrote:
>
> > Hi Richard,
> >
> > Thanks for the reply. I've checked the classpath again and it only shows
> > the referenced 1.6.0 libraries.
> >
> > Though, when I initially faced the issue, one strategy I tried was to
> > reference the *opennlp-tools-1.5.0.jar* library. It didn't work, so I
> > removed it. None of the methods I've tried so far have given me a solid
> > answer, not even re-training the model using 1.6.0.
> >
> > *> Maybe OpenNLP classes are included in some non-OpenNLP JAR as well
> that
> > you use in your project?*
> >
> > Sorry, I'm personally not sure what you mean by this. Could you please
> > clarify?
> >
> > Regards,
> > David
> >
> > On Fri, Jan 13, 2017 at 3:14 PM, Richard Eckart de Castilho <
> > r...@apache.org> wrote:
> >
> >> On 13.01.2017, at 02:51, David Samuel Lim 
> wrote:
> >> >
> >> > *To sum up: Unable to load custom trained OpenNLP Name Finder model in
> >> code
> >> > due to apparent OpenNLP version incompatibility. Model was trained in
> >> > OpenNLP 1.6.0, which my project also uses. Other projects also using
> >> 1.6.0
> >> > were able to load the model successfully.*
> >>
> >> Maybe you have OpenNLP twice on the classpath for some reason, once in
> >> 1.5.3
> >> and once in 1.6.0. Maybe OpenNLP classes are included in some
> non-OpenNLP
> >> JAR
> >> as well that you use in your project?
> >>
> >> Cheers,
> >>
> >> -- Richard
> >>
> >
> >
>


Re: Commit message style

2017-01-09 Thread William Colen
+1 for the OPENNLP-xxx: commit message.
Fast to find a commit.


2017-01-09 21:24 GMT-02:00 Joern Kottmann :

> On Mon, 2017-01-09 at 17:02 -0500, Jeffrey Zemerick wrote:
> > I'm personally a fan of the issue number being the first thing on the
> > subject line, like "OPENNLP-xxx: commit message." For me it gives a
> > consistent place to look for the issue without having to read the
> > full
> > message. (That way you can also see the issue number in GitHub's
> > commit
> > list without having to expand the commit.)
>
>
> Yes, it is also faster to write like that, on the other hand if the
> subject line is then too short to write something meaningful it is
> probably better to write it in the body instead.
>
> +1 to write it first thing in the subject line in all cases where it is
> possible, for very rare cases where it doesn't work it can still be in
> the body
>
> Jörn
>


Re: [DISCUSS] 1.7.0 release process issues (was Re: OpenNLP 1.7.0 RC 2 is ready for testing)

2017-01-04 Thread William Colen
https://issues.apache.org/jira/browse/OPENNLP-916

Created a Jira task for it. OPENNLP-916: Create a Release Process page

Thank you

2017-01-04 15:23 GMT-02:00 Chris Mattmann :

> I just wanted to put this on the thread.
>
> The process to do a release:
>
> 1. [VOTE] thread with subject line indicating release version
> 2. Leave VOTE thread open for 72 hours
> 3. If more +1s than -1s and at least 3 PMC +1s, then release.
> 4. Send [RESULT] [VOTE] thread, then update the mirrors, etc.
>
> Please have a look at the tika release process here as an example:
>
> http://wiki.apache.org/tika/ReleaseProcess
>
> This release was closed way too fast.
>
> Please in the future let’s get our process documented.
>
> Cheers,
> Chris
>
>
>
>
> On 1/1/17, 5:20 AM, "Tommaso Teofili"  wrote:
>
> +1
>
> Source build ok
> Sigs ok
> License & co ok
> Il giorno dom 1 gen 2017 alle 03:02 Richard Eckart de Castilho <
> r...@apache.org> ha scritto:
>
> > On 01.01.2017, at 02:41, Suneel Marthi  wrote:
> > >
> > > The release has been finalized - please find the 1.7.0 release
> artifacts
> > at
> > > http://www.apache.org/dist/opennlp/opennlp-1.7.0/
> >
> > Hm, I only saw two binding votes instead of the usual three ones [1].
> >
> >   Jörn: +1
> >   William: +1
> >   Suneel: +1 (non-binding)
> >
> > Did I miss a vote?
> >
> > I also checked the mailing list archive for additional votes [2].
> >
> > Cheers,
> >
> > -- Richard
> >
> > [1] http://apache.org/foundation/voting.html
> > [2]
> > http://mail-archives.apache.org/mod_mbox/opennlp-dev/
> 201612.mbox/thread
>
>
>
>


Re: OpenNLP 1.7.0 RC 2 is ready for testing

2017-01-01 Thread William Colen
Thank you for the voters. We have 3 binding votes and 1 non-binding.
The vote is now closed.

2017-01-01 11:20 GMT-02:00 Tommaso Teofili :

> +1
>
> Source build ok
> Sigs ok
> License & co ok
> Il giorno dom 1 gen 2017 alle 03:02 Richard Eckart de Castilho <
> r...@apache.org> ha scritto:
>
> > On 01.01.2017, at 02:41, Suneel Marthi  wrote:
> > >
> > > The release has been finalized - please find the 1.7.0 release
> artifacts
> > at
> > > http://www.apache.org/dist/opennlp/opennlp-1.7.0/
> >
> > Hm, I only saw two binding votes instead of the usual three ones [1].
> >
> >   Jörn: +1
> >   William: +1
> >   Suneel: +1 (non-binding)
> >
> > Did I miss a vote?
> >
> > I also checked the mailing list archive for additional votes [2].
> >
> > Cheers,
> >
> > -- Richard
> >
> > [1] http://apache.org/foundation/voting.html
> > [2]
> > http://mail-archives.apache.org/mod_mbox/opennlp-dev/201612.mbox/thread
>


Re: OpenNLP 1.7.0 RC 2 is ready for testing

2016-12-31 Thread William Colen
+1


2016-12-31 19:01 GMT-02:00 Suneel Marthi <smar...@apache.org>:

> +1 non-binding
>
> 1. Verified Sigs and Hashes
> 2. Ran clean build from Source and all tests pass
> 3. Verified RAT check
>
> On Sat, Dec 31, 2016 at 3:58 PM, Joern Kottmann <kottm...@gmail.com>
> wrote:
>
> > +1, looks good
> >
> > Jörn
> >
> > On Dec 31, 2016 8:54 PM, "William Colen" <co...@apache.org> wrote:
> >
> > > Hi all,
> > >
> > > Apache OpenNLP 1.7.0 RC 2 is ready for testing. The RC 1 failed due to
> > > missing files and it failed to run 1.6.0 models. There is no new
> features
> > > since RC 1.
> > >
> > > The RC 2 can be downloaded from here:
> > > http://people.apache.org/~colen/releases/opennlp-1.7.0/rc2/
> > >
> > > To use it in a maven build set the version for opennlp-tools or
> > > opennlp-uima to 1.7.0 and add the following URL to your settings.xml
> > file:
> > > https://repository.apache.org/content/repositories/
> orgapacheopennlp-1007
> > >
> > > The current test plan can be found here:
> > > https://cwiki.apache.org/confluence/display/OPENNLP/TestPlan1.7.0
> > >
> > > The release artifacts were signed by KEY - 524A9649.
> > >
> > > Please sign up for tasks in the test plan.
> > >
> > > The release plan can be found here:
> > > https://cwiki.apache.org/confluence/display/OPENNLP/
> > > ReleasePlanAndTasks1.7.0
> > >
> > > The release contains quite some changes, please refer to the contained
> > > issue list for details.
> > >
> > > For your convenience, a copy of the issue list, as well as the release
> > > notes and the readme, can be found in the following link:
> > >
> > > http://people.apache.org/~colen/releases/opennlp-1.7.0/
> > > rc2/RELEASE_NOTES.html
> > >
> > >
> > > Thank you,
> > > William
> > >
> >
>


Re: OpenNLP 1.7.0 RC 1 is ready for testing

2016-12-31 Thread William Colen
Richard,

We fixed it for 1.7.0 RC 2.

Thank you
William


2016-12-31 17:17 GMT-02:00 Richard Eckart de Castilho <r...@apache.org>:

> On 31.12.2016, at 19:12, William Colen <co...@apache.org> wrote:
> >
> > Can you please open a Jira? I am already looking how to fix it.
>
> Here you go: https://issues.apache.org/jira/browse/OPENNLP-906
>
> Cheers,
>
> -- Richard
>


OpenNLP 1.7.0 RC 2 is ready for testing

2016-12-31 Thread William Colen
Hi all,

Apache OpenNLP 1.7.0 RC 2 is ready for testing. The RC 1 failed due to
missing files and it failed to run 1.6.0 models. There is no new features
since RC 1.

The RC 2 can be downloaded from here:
http://people.apache.org/~colen/releases/opennlp-1.7.0/rc2/

To use it in a maven build set the version for opennlp-tools or
opennlp-uima to 1.7.0 and add the following URL to your settings.xml file:
https://repository.apache.org/content/repositories/orgapacheopennlp-1007

The current test plan can be found here:
https://cwiki.apache.org/confluence/display/OPENNLP/TestPlan1.7.0

The release artifacts were signed by KEY - 524A9649.

Please sign up for tasks in the test plan.

The release plan can be found here:
https://cwiki.apache.org/confluence/display/OPENNLP/ReleasePlanAndTasks1.7.0

The release contains quite some changes, please refer to the contained
issue list for details.

For your convenience, a copy of the issue list, as well as the release
notes and the readme, can be found in the following link:

http://people.apache.org/~colen/releases/opennlp-1.7.0/rc2/RELEASE_NOTES.html


Thank you,
William


Re: OpenNLP 1.7.0 RC 1 is ready for testing

2016-12-31 Thread William Colen
Thank you, Richard,

Can you please open a Jira? I am already looking how to fix it.

Thank you,
William

2016-12-31 15:50 GMT-02:00 Richard Eckart de Castilho <r...@apache.org>:

> Hi William,
>
> thanks for the RC. I have tried upgrading DKPro Core to the RC1 and most of
> the tests work, however, in one case I get this message:
>
> Caused by: opennlp.tools.util.InvalidFormatException: Model version 1.6.0
> is not supported by this (1.7.0) version of OpenNLP!
> at opennlp.tools.util.model.BaseModel.validateArtifactMap(
> BaseModel.java:428)
> at opennlp.tools.chunker.ChunkerModel.validateArtifactMap(
> ChunkerModel.java:88)
> at opennlp.tools.util.model.BaseModel.checkArtifactMap(
> BaseModel.java:493)
> ... 48 more
>
> The problematic model is a third-party chunker model:
>
> http://ixa2.si.ehu.es/ixa-pipes/models/chunk-models-1.1.0.tar.gz
> file: en-perceptron-conll00.bin
>
> I believe that 1.6.0 models should still work for 1.7.0, right?
>
> Cheers,
>
> - Richard
>
> > On 31.12.2016, at 03:33, William Colen <co...@apache.org> wrote:
> >
> > Important note: the release artifacts were signed by KEY -  524A9649
> >
> > 2016-12-31 0:24 GMT-02:00 William Colen <co...@apache.org>:
> >
> >> Hi all,
> >>
> >> Apache OpenNLP 1.7.0 RC 1 is ready for testing.
> >>
> >> The RC 1 can be downloaded from here:
> >> http://people.apache.org/~colen/releases/opennlp-1.7.0/rc1/
> >>
> >> To use it in a maven build set the version for opennlp-tools or
> >> opennlp-uima to 1.7.0 and add the following URL to your settings.xml
> file:
> >> https://repository.apache.org/content/repositories/
> orgapacheopennlp-1006
> >>
> >> The current test plan can be found here:
> >> https://cwiki.apache.org/confluence/display/OPENNLP/TestPlan1.7.0
> >>
> >> Please sign up for tasks in the test plan.
> >>
> >> The release plan can be found here:
> >> https://cwiki.apache.org/confluence/display/OPENNLP/
> >> ReleasePlanAndTasks1.7.0
> >>
> >> The release contains quite some changes, please refer to the contained
> >> issue list for details.
> >>
> >> For your convenience, a copy of the issue list, as well as the release
> >> notes and the readme, can be found in the following link:
> >>
> >> http://people.apache.org/~colen/releases/opennlp-1.7.0/
> >> rc1/RELEASE_NOTES.html
> >>
> >> Thank you,
> >> William
>
>


Re: OpenNLP 1.7.0 RC 1 is ready for testing

2016-12-31 Thread William Colen
The RC 1 is canceled due to missing LICENSE and NOTICE.

We will prepare a new release candidate.

Thank you,
William

2016-12-31 14:25 GMT-02:00 Joern Kottmann <kottm...@gmail.com>:

> We are missing the LICENSE and NOTICE files in the binary distribution
> and should make a RC2 for the release.
>
> All the manual tests are now automatic, so we don't this long
> test plan anymore.
>
> Jörn
>
> On Sat, 2016-12-31 at 00:24 -0200, William Colen wrote:
> > Hi all,
> >
> > Apache OpenNLP 1.7.0 RC 1 is ready for testing.
> >
> > The RC 1 can be downloaded from here:
> > http://people.apache.org/~colen/releases/opennlp-1.7.0/rc1/
> >
> > To use it in a maven build set the version for opennlp-tools or
> > opennlp-uima to 1.7.0 and add the following URL to your settings.xml
> > file:
> > https://repository.apache.org/content/repositories/orgapacheopennlp-1
> > 006
> >
> > The current test plan can be found here:
> > https://cwiki.apache.org/confluence/display/OPENNLP/TestPlan1.7.0
> >
> > Please sign up for tasks in the test plan.
> >
> > The release plan can be found here:
> > https://cwiki.apache.org/confluence/display/OPENNLP/ReleasePlanAndTas
> > ks1.7.0
> >
> > The release contains quite some changes, please refer to the
> > contained
> > issue list for details.
> >
> > For your convenience, a copy of the issue list, as well as the
> > release
> > notes and the readme, can be found in the following link:
> >
> > http://people.apache.org/~colen/releases/opennlp-1.7.0/rc1/RELEASE_NO
> > TES.html
> >
> >
> > Thank you,
> > William
>


Re: OpenNLP 1.7.0 RC 1 is ready for testing

2016-12-30 Thread William Colen
Important note: the release artifacts were signed by KEY -  524A9649

2016-12-31 0:24 GMT-02:00 William Colen <co...@apache.org>:

> Hi all,
>
> Apache OpenNLP 1.7.0 RC 1 is ready for testing.
>
> The RC 1 can be downloaded from here:
> http://people.apache.org/~colen/releases/opennlp-1.7.0/rc1/
>
> To use it in a maven build set the version for opennlp-tools or
> opennlp-uima to 1.7.0 and add the following URL to your settings.xml file:
> https://repository.apache.org/content/repositories/orgapacheopennlp-1006
>
> The current test plan can be found here:
> https://cwiki.apache.org/confluence/display/OPENNLP/TestPlan1.7.0
>
> Please sign up for tasks in the test plan.
>
> The release plan can be found here:
> https://cwiki.apache.org/confluence/display/OPENNLP/
> ReleasePlanAndTasks1.7.0
>
> The release contains quite some changes, please refer to the contained
> issue list for details.
>
> For your convenience, a copy of the issue list, as well as the release
> notes and the readme, can be found in the following link:
>
> http://people.apache.org/~colen/releases/opennlp-1.7.0/
> rc1/RELEASE_NOTES.html
>
>
> Thank you,
> William
>


OpenNLP 1.7.0 RC 1 is ready for testing

2016-12-30 Thread William Colen
Hi all,

Apache OpenNLP 1.7.0 RC 1 is ready for testing.

The RC 1 can be downloaded from here:
http://people.apache.org/~colen/releases/opennlp-1.7.0/rc1/

To use it in a maven build set the version for opennlp-tools or
opennlp-uima to 1.7.0 and add the following URL to your settings.xml file:
https://repository.apache.org/content/repositories/orgapacheopennlp-1006

The current test plan can be found here:
https://cwiki.apache.org/confluence/display/OPENNLP/TestPlan1.7.0

Please sign up for tasks in the test plan.

The release plan can be found here:
https://cwiki.apache.org/confluence/display/OPENNLP/ReleasePlanAndTasks1.7.0

The release contains quite some changes, please refer to the contained
issue list for details.

For your convenience, a copy of the issue list, as well as the release
notes and the readme, can be found in the following link:

http://people.apache.org/~colen/releases/opennlp-1.7.0/rc1/RELEASE_NOTES.html


Thank you,
William


Re: Update to Java 8

2016-12-19 Thread William Colen
+1

2016-12-19 21:22 GMT-02:00 Joern Kottmann :

> +1 from me as well
>
> Jörn
>
> On Tue, Dec 20, 2016 at 12:02 AM, Tommaso Teofili <
> tommaso.teof...@gmail.com
> > wrote:
>
> > +1
> >
> > Tommaso
> >
> > Il giorno lun 19 dic 2016 alle ore 22:27 ARUN Thundyill Saseendran <
> > ats0...@gmail.com> ha scritto:
> >
> > > +1 to move to 1.8
> > >
> > > On Tue, Dec 20, 2016 at 2:51 AM, Suneel Marthi <
> > > suneel_mar...@yahoo.com.invalid> wrote:
> > >
> > > > +1 to move to Java 8
> > > >
> > > >
> > > >   From: Joern Kottmann 
> > > >  To: "dev@opennlp.apache.org" 
> > > >  Sent: Monday, December 19, 2016 8:45 AM
> > > >  Subject: Update to Java 8
> > > >
> > > > Hello all,
> > > >
> > > > Java 7 is already EOL.
> > > >
> > > > Should we update OpenNLP to Java 8 for the 1.7.0 release, any
> opinions?
> > > >
> > > > Jörn
> > > >
> > > >
> > >
> > >
> > >
> > >
> > > --
> > >
> >
>


Re: Chunker - proposal to change API (break compatibility)

2016-11-10 Thread William Colen
I tried that, but we have an issue with the factories we created. To
customize we extend the factory, but the method we need to override don't
allow using generic.

  public SequenceValidator getSequenceValidator() {

return new DefaultChunkerSequenceValidator();

  }

I tried to change String to ?, but it breaks a lot of code. I am not sure
if it is a simple change anymore.

Thank you
William

2016-11-10 10:23 GMT-02:00 Joern Kottmann <kottm...@gmail.com>:

> The sequence we have today is usually of type String, but it is generic so
> it could also be about a wrapper object which has the token and tag, e.g.
> TokenWithPos.
> On such a sequence we should be able to use most of the existing interfaces
> without too much change, right?
>
> Jörn
>
> On Thu, Nov 10, 2016 at 10:33 AM, William Colen <william.co...@gmail.com>
> wrote:
>
> > Hi,
> >
> > Today the Chunker sequence is the sentences pos tags.
> >
> > Although we use both the tokens and tags in the context generator, in the
> > current API we ca not use the token in the sequence validator, because we
> > do not have access to it.
> >
> > In Portuguese, I know there will never be some combinations of word + tag
> > in a specific kind of phrase. Today I can not set a rule with this filter
> > to the sequence validator.
> >
> > I know maybe it is better to train the model so it will learn, but the
> hack
> > of adding this rule to the sequence validator is helpful.
> >
> > Do you think we can change it for the release 1.7.0? I already tried this
> > change in a local branch for a personal project and it works (although it
> > was OpenNLP 1.5.3).
> >
> > This would break API backward compatibility, but the exiting models would
> > not be affected.
> >
> > Thank you
> > William
> >
>


Re: Next release

2016-11-10 Thread William Colen
Cool. There is a lot of PlainTextByLineStream references in deprecated
methods, specially main methods. I will ignore them and you can remove the
main method when you go through each tool.
I will focus on PlainTextByLineStream that are not inside deprecated
methods.


2016-11-10 6:39 GMT-02:00 Joern Kottmann <kottm...@gmail.com>:

> Ok, I created a couple of issues and will go through them rather quickly.
>
> Jörn
>
> On Thu, Nov 10, 2016 at 3:36 AM, William Colen <william.co...@gmail.com>
> wrote:
>
> > Jörn, I can help removing deprecated code. I started with
> > PlainTextByLineStream. It is used everywhere so there is a lot to change.
> >
> >
> > 2016-11-08 9:08 GMT-02:00 Joern Kottmann <kottm...@gmail.com>:
> >
> > > I suggest we remove more deprecated code, there is still a lot which
> > could
> > > be removed and is really old.
> > > It is a bit of a boring task, if anyone has some spare cycles help
> would
> > be
> > > welcome.
> > >
> > > Jörn
> > >
> > > On Tue, Nov 8, 2016 at 9:59 AM, Aliaksandr Autayeu <
> > aliaksa...@autayeu.com
> > > >
> > > wrote:
> > >
> > > > +1 for 1.7 (also due to lemmatized changes and removal of deprecated
> > > code).
> > > >
> > > > On 8 November 2016 at 09:48, Rodrigo Agerri <rage...@apache.org>
> > wrote:
> > > >
> > > > > Hello,
> > > > >
> > > > > +1 1.7.0 in next release and +1 for a yearly release
> > > > >
> > > > > Just to provide some info, the main changes in the lemmatizer have
> > > been:
> > > > >
> > > > > 1. Added a supervised statistical lemmatizer, usable from the CLI
> and
> > > > > API. The supervised lemmaitzer now provides a much better coverage
> > for
> > > > > unknown words with respect to the previously existing
> > dictionary-based
> > > > > one.
> > > > > 2. The lemmatizer component has been rewritten and the API
> therefore
> > > > > has substantially changed. Thus, the changes in the
> Dictionary-based
> > > > > lemmatizer are not backward compatible. In any case, I do not think
> > > > > that so many people was using it and the change at using the API is
> > > > > minor.
> > > > >
> > > > > The new statistical lemmatizer can support the Dictionary-based
> > > > > lemmatizers often used to provide features for components such as
> > Word
> > > > > Sense Disambiguation, Opinion Mining/Sentiment Analysis, etc. In
> this
> > > > > regard, it will be nice to aim at working on the development of
> those
> > > > > two components for their release. Maybe the next release is too
> > close,
> > > > > but definitely for the next one.
> > > > >
> > > > > Cheers,
> > > > >
> > > > > Rodrigo
> > > > >
> > > > > On Mon, Nov 7, 2016 at 7:01 PM, Russ, Daniel (NIH/CIT) [E]
> > > > > <dr...@mail.nih.gov> wrote:
> > > > > > Also the lemmatizer has significantly changed.  I vote 1.7
> > > > > >
> > > > > > On 11/7/16, 12:59 PM, "Joern Kottmann" <kottm...@gmail.com>
> wrote:
> > > > > >
> > > > > > Hello all,
> > > > > >
> > > > > > since our last release it has been a while and we received
> > quite
> > > a
> > > > > few
> > > > > > changes which would be nice to get released.
> > > > > >
> > > > > > There are still some open Jira issues, but mostly smaller
> > things
> > > > that
> > > > > > can be wrapped up rather quickly.
> > > > > >
> > > > > > Is there anything important missing which should go into the
> > next
> > > > > > release? Otherwise I think we should also aim for more
> frequent
> > > > > > released and just make one again early next year, with all
> the
> > > > stuff
> > > > > we
> > > > > > might miss out now.
> > > > > >
> > > > > > We took in a patch - as part of OPENNLP-830 - to replace our
> > > > > self-made
> > > > > > hash table with the java.util.HashMap. This change is not
> > > backward
> > > > > > compatible for folks who extend AbstractModel.
> > > > > >
> > > > > > Should we go with 1.6.1 as a next version or should we make
> > 1.7.0
> > > > to
> > > > > > reflect that?
> > > > > >
> > > > > > Previously we only had backward incompatible changes in
> > versions
> > > > > which
> > > > > > bumped by the second number. Maybe that is better choice. It
> > will
> > > > > > probably break some peoples code when they update.
> > > > > >
> > > > > > We also have lots of deprecated API still in OpenNLP, should
> we
> > > try
> > > > > to
> > > > > > remove as much as possible of it now?
> > > > > >
> > > > > > Jörn
> > > > > >
> > > > > >
> > > > >
> > > >
> > >
> >
>


Re: Next release

2016-11-09 Thread William Colen
Jörn, I can help removing deprecated code. I started with
PlainTextByLineStream. It is used everywhere so there is a lot to change.


2016-11-08 9:08 GMT-02:00 Joern Kottmann :

> I suggest we remove more deprecated code, there is still a lot which could
> be removed and is really old.
> It is a bit of a boring task, if anyone has some spare cycles help would be
> welcome.
>
> Jörn
>
> On Tue, Nov 8, 2016 at 9:59 AM, Aliaksandr Autayeu  >
> wrote:
>
> > +1 for 1.7 (also due to lemmatized changes and removal of deprecated
> code).
> >
> > On 8 November 2016 at 09:48, Rodrigo Agerri  wrote:
> >
> > > Hello,
> > >
> > > +1 1.7.0 in next release and +1 for a yearly release
> > >
> > > Just to provide some info, the main changes in the lemmatizer have
> been:
> > >
> > > 1. Added a supervised statistical lemmatizer, usable from the CLI and
> > > API. The supervised lemmaitzer now provides a much better coverage for
> > > unknown words with respect to the previously existing dictionary-based
> > > one.
> > > 2. The lemmatizer component has been rewritten and the API therefore
> > > has substantially changed. Thus, the changes in the Dictionary-based
> > > lemmatizer are not backward compatible. In any case, I do not think
> > > that so many people was using it and the change at using the API is
> > > minor.
> > >
> > > The new statistical lemmatizer can support the Dictionary-based
> > > lemmatizers often used to provide features for components such as Word
> > > Sense Disambiguation, Opinion Mining/Sentiment Analysis, etc. In this
> > > regard, it will be nice to aim at working on the development of those
> > > two components for their release. Maybe the next release is too close,
> > > but definitely for the next one.
> > >
> > > Cheers,
> > >
> > > Rodrigo
> > >
> > > On Mon, Nov 7, 2016 at 7:01 PM, Russ, Daniel (NIH/CIT) [E]
> > >  wrote:
> > > > Also the lemmatizer has significantly changed.  I vote 1.7
> > > >
> > > > On 11/7/16, 12:59 PM, "Joern Kottmann"  wrote:
> > > >
> > > > Hello all,
> > > >
> > > > since our last release it has been a while and we received quite
> a
> > > few
> > > > changes which would be nice to get released.
> > > >
> > > > There are still some open Jira issues, but mostly smaller things
> > that
> > > > can be wrapped up rather quickly.
> > > >
> > > > Is there anything important missing which should go into the next
> > > > release? Otherwise I think we should also aim for more frequent
> > > > released and just make one again early next year, with all the
> > stuff
> > > we
> > > > might miss out now.
> > > >
> > > > We took in a patch - as part of OPENNLP-830 - to replace our
> > > self-made
> > > > hash table with the java.util.HashMap. This change is not
> backward
> > > > compatible for folks who extend AbstractModel.
> > > >
> > > > Should we go with 1.6.1 as a next version or should we make 1.7.0
> > to
> > > > reflect that?
> > > >
> > > > Previously we only had backward incompatible changes in versions
> > > which
> > > > bumped by the second number. Maybe that is better choice. It will
> > > > probably break some peoples code when they update.
> > > >
> > > > We also have lots of deprecated API still in OpenNLP, should we
> try
> > > to
> > > > remove as much as possible of it now?
> > > >
> > > > Jörn
> > > >
> > > >
> > >
> >
>


Re: Moving brat annotator to opennlp.git

2016-10-19 Thread William Colen
+1

Do you think latter we can expand the annotator server to other tools?


2016-10-19 7:05 GMT-02:00 Madhawa Kasun Gunasekara :

> +1
>
> Madhawa
>
> On Wed, Oct 19, 2016 at 2:20 PM, "Shuo Xu"  wrote:
>
> > +1
> >
> >
> > On Wed, Oct 19, 2016 at 12:46 AM, Joern Kottmann 
> > wrote:
> >
> > > Hello all,
> > >
> > > what do you think about including the brat ner annotator in the 1.6.1
> > > release?
> > >
> > > I believe it is important that we include it to allow our users to
> easier
> > > run custom annotation projects, as part of the move we need to extend
> the
> > > documentation so everyone can easily get it up and running and
> understand
> > > how it is supposed to work.
> > >
> > > Jörn
> > >
> >
> >
> >
> > --
> > 徐硕 XU Shuo
> > 中国科学技术信息研究所  Institute of Scientific and Technical
> Information
> > of China (ISTIC)
> > 北京市海淀区复兴路15号  No. 15 Fuxing Rd., Haidian District, Beijing
> > 100038, P.R. China
> > 电话:+86-10-58882447(O)  Tel: +86-10-58882447 (O)
> > BLOG:http://blog.sciencenet.cn/u/xiaohai2008
> > E-mail: "XU Shuo"
> >"XU Shuo"
> >
>


Access to Git

2016-09-09 Thread William Colen
Hello,

Is the Git repository ready for use?
Do we need to wait for it to develop new stuff?

Thank you,
William


Re: Generators

2016-08-17 Thread William Colen
Features does not guarantee that the token will be marked as a NE. Its is
like saying to the model that in the dictionary the token can be a NE, but
of course it will be evaluated with other features.
Remember it is machine learning. You can skip the machine learning using a
DictionaryNameFinder.

http://opennlp.apache.org/documentation/1.6.0/apidocs/
opennlp-tools/opennlp/tools/namefind/DictionaryNameFinder.html

Regards
William

2016-08-16 15:50 GMT-03:00 Damiano Porta :

> Hello,
>
> pardon guys for all these questions but i am trying to study OpenNLP
> deeply.
> I write a simple code, you can see it here:
> https://issues.apache.org/jira/browse/OPENNLP-859?jql=projec
> t%20%3D%20OPENNLP
> I am trying to understand what the generators are and what is their job.
> I know they add features on the tokens list, but what does it mean in
> simple words? (just adding simple codes on each token?) because for example
> i tried the DictionaryFeatureGenerator with a simple list of names but they
> are not recognized when i use the NameFinderME( see the link on jira )
>
> How can i read those features after the find() ?
>
> Thank you so much!
> Damiano
>


Re: Why are you using complete sentences to train a model?

2016-08-12 Thread William Colen
You need to train with a corpus that is as close as possible as your
runtime corpus. If your runtime corpus is like that I think it is ok.
Otherwise, the model can learn that an entity is too often. Like, there is
an entity in the middle of every window.


2016-08-12 11:35 GMT-03:00 Damiano Porta <damianopo...@gmail.com>:

> Ok, but why not just ignore all the others tokens? i mean... when i write 2
> TOKENS + ENTITY + 2 TOKENS i am interested on finding the entity with this
> surrounding tokens so it should mean that other "cases" can be ignored. No?
>
> Why do i need to write all the other cases when those must be ignored.
>
> 2016-08-12 16:26 GMT+02:00 William Colen <william.co...@gmail.com>:
>
> > You also need examples of what is not entities.
> >
> >
> > 2016-08-12 11:21 GMT-03:00 Damiano Porta <damianopo...@gmail.com>:
> >
> > > Hello everyone,
> > > pardon for the stupid question but i really do not get the point about
> > > training a maxent model with complete sentences.
> > >
> > > For example:
> > >
> > >  Pierre Vinken  , 61 years old , will join the board
> > as
> > > a nonexecutive director Nov. 29 .
> > >
> > > it has ~20 tokens.
> > > As described here:
> > > https://opennlp.apache.org/documentation/1.6.0/manual/
> > > opennlp.html#tools.namefind.training.featuregen
> > > the default window should be 2 tokens on the left and 2 tokens on the
> > right
> > > of the entity. So, what's the point of writing the entire sentence if
> > there
> > > are no other entities ?
> > >
> > > As far i have understood it correctly, it should take into account the
> > > Pierre Vinken (as entity name) and "," "61" as the next 2 tokens. So,
> why
> > > do we need "*years old , will join the board as a nonexecutive*" ?
> > >
> > > Thank you in advance for the clarification!
> > >
> > > Best
> > > Damiano
> > >
> >
>


Re: Why are you using complete sentences to train a model?

2016-08-12 Thread William Colen
You also need examples of what is not entities.


2016-08-12 11:21 GMT-03:00 Damiano Porta :

> Hello everyone,
> pardon for the stupid question but i really do not get the point about
> training a maxent model with complete sentences.
>
> For example:
>
>  Pierre Vinken  , 61 years old , will join the board as
> a nonexecutive director Nov. 29 .
>
> it has ~20 tokens.
> As described here:
> https://opennlp.apache.org/documentation/1.6.0/manual/
> opennlp.html#tools.namefind.training.featuregen
> the default window should be 2 tokens on the left and 2 tokens on the right
> of the entity. So, what's the point of writing the entire sentence if there
> are no other entities ?
>
> As far i have understood it correctly, it should take into account the
> Pierre Vinken (as entity name) and "," "61" as the next 2 tokens. So, why
> do we need "*years old , will join the board as a nonexecutive*" ?
>
> Thank you in advance for the clarification!
>
> Best
> Damiano
>


Re: Morfologik Addon

2016-07-15 Thread William Colen
Not only licensing, but also I think we try to keep OpenNLP without
external dependencies. The Morfologik also has some dependencies itself.


2016-07-15 4:55 GMT-03:00 Rodrigo Agerri <rage...@apache.org>:

> Great stuff, William.
>
> I have been using Morfologik stemming for a long time and when we
> included it we put it as an addon. I assume that the reason was its
> license, but reading Morfologik license it is not clear to me why is
> is not Apache compatible.
>
> If it is, it would be nice to include it directly in OpenNLP.
>
> Can anyone shed any light on this?
>
> Thanks,
>
> R
>
> On Fri, Jul 15, 2016 at 12:02 AM, William Colen <william.co...@gmail.com>
> wrote:
> > Hello,
> >
> > A while back we started working on a Morfologik Addon.
> >
> > http://svn.apache.org/viewvc/opennlp/addons/
> >
> > I checked it out last week and notice it was outdated, specially because
> it
> > was not using the latest Morfologik version. Also it was missing
> > documentation.
> >
> > You can find more about Morfologik here:
> > https://github.com/morfologik/morfologik-stemming
> >
> > Morfologik provides tools for finite state automata (FSA) construction
> and
> > dictionary-based morphological dictionaries.
> >
> > The Morfologik Addon implements some OpenNLP interfaces and extends some
> > classes to make it easier to use of FSA Morfologik dictionaries:
> >
> >- opennlp.morfologik.tagdict.MorfologikPOSTaggerFactory
> >   - Extends: opennlp.tools.postag.POSTaggerFactory
> >   - Helps creating a POSTagger model with an embedded TagDictionary
> >   based on FSA
> >- opennlp.morfologik.tagdict.MorfologikTagDictionary
> >- Implements: opennlp.tools.postag.TagDictionary
> >   - A TagDictionary based on FSA is much smaller than the defaul XML
> >   based, and consumes less memory.
> >- opennlp.morfologik.lemmatizer.MorfologikLemmatizer
> >- Implements: opennlp.tools.lemmatizer.DictionaryLemmatizer
> >   - A dictionary based lemmatizer that uses FSA dictionary.
> >
> > It also provides a command line interface that allows:
> >
> >- MorfologikDictionaryBuilder
> >   - builds a binary POS Dictionary using Morfologik
> >- XMLDictionaryToTable
> >   - reads an OpenNLP XML tag dictionary and outputs it in a tab
> >   separated file that can be built into a FSA dictionary
> >
> >
> > In a project I developed it was of great help. The TAG Dictionary for POS
> > Tag was huge (something like 50 MB), requiring a lot of memory.
> > Migrating it to a FSA dictionary allowed not only a smaller model, but
> also
> > I could use the model without the need to increase the JVM memory.
> >
> > More here:
> >
> https://cwiki.apache.org/confluence/display/OPENNLP/FSA+Dictionary+with+morfologik-addon
> >
> > Hope it will be helpful.
> >
> > William
>


Morfologik Addon

2016-07-14 Thread William Colen
Hello,

A while back we started working on a Morfologik Addon.

http://svn.apache.org/viewvc/opennlp/addons/

I checked it out last week and notice it was outdated, specially because it
was not using the latest Morfologik version. Also it was missing
documentation.

You can find more about Morfologik here:
https://github.com/morfologik/morfologik-stemming

Morfologik provides tools for finite state automata (FSA) construction and
dictionary-based morphological dictionaries.

The Morfologik Addon implements some OpenNLP interfaces and extends some
classes to make it easier to use of FSA Morfologik dictionaries:

   - opennlp.morfologik.tagdict.MorfologikPOSTaggerFactory
  - Extends: opennlp.tools.postag.POSTaggerFactory
  - Helps creating a POSTagger model with an embedded TagDictionary
  based on FSA
   - opennlp.morfologik.tagdict.MorfologikTagDictionary
   - Implements: opennlp.tools.postag.TagDictionary
  - A TagDictionary based on FSA is much smaller than the defaul XML
  based, and consumes less memory.
   - opennlp.morfologik.lemmatizer.MorfologikLemmatizer
   - Implements: opennlp.tools.lemmatizer.DictionaryLemmatizer
  - A dictionary based lemmatizer that uses FSA dictionary.

It also provides a command line interface that allows:

   - MorfologikDictionaryBuilder
  - builds a binary POS Dictionary using Morfologik
   - XMLDictionaryToTable
  - reads an OpenNLP XML tag dictionary and outputs it in a tab
  separated file that can be built into a FSA dictionary


In a project I developed it was of great help. The TAG Dictionary for POS
Tag was huge (something like 50 MB), requiring a lot of memory.
Migrating it to a FSA dictionary allowed not only a smaller model, but also
I could use the model without the need to increase the JVM memory.

More here:
https://cwiki.apache.org/confluence/display/OPENNLP/FSA+Dictionary+with+morfologik-addon

Hope it will be helpful.

William


Re: Reg. MaxENT and GIS.

2016-07-06 Thread William Colen
I updated the link to

http://repository.upenn.edu/ircs_reports/60/

Thank you for reporting the broken link.

2016-07-05 9:20 GMT-03:00 Damiano Porta <damianopo...@gmail.com>:

> Hi William,
> we need to update the link, it is pointing to a wrong page. It returns Not
> Found.
>
> 2016-07-05 13:19 GMT+02:00 William Colen <william.co...@gmail.com>:
>
> > It is not that easy. You could start from "Papers implemented by
> OpenNLP":
> >
> > https://cwiki.apache.org/confluence/display/OPENNLP/NLP+Papers
> >
> > I believe the most important regarding MAXENT is "Maximum entropy models
> > for natural language ambiguity resolution, by Adwait Ratnaparkhi"
> >
> >
> >
> > 2016-07-05 2:49 GMT-03:00 Rakesh P <rakeshbe...@gmail.com>:
> >
> > > Dear All,
> > >  Could anyone explain the MAXENT and GIS algorithm in detail.
> > > Please.
> > >
> >
>


Re: Migrate to Git?

2016-07-04 Thread William Colen
+1

2016-07-04 11:59 GMT-03:00 Tommaso Teofili :

> +1
>
> Il giorno lun 4 lug 2016 alle ore 16:41 Madhawa Kasun Gunasekara <
> madhaw...@gmail.com> ha scritto:
>
> > +1
> >
> > Madhawa
> >
> > On Mon, Jul 4, 2016 at 8:09 PM, Anthony Beylerian <
> > anthony.beyler...@gmail.com> wrote:
> >
> > > +1
> > >
> > > On Mon, Jul 4, 2016 at 11:36 PM, Joern Kottmann 
> > > wrote:
> > >
> > > > Hello all,
> > > >
> > > > do we still want to do this? Has been a while since we discussed it.
> > > > I am happy to get it done if we reach consensus on it again.
> > > >
> > > > My +1 again.
> > > >
> > > > Jörn
> > > >
> > > > On Thu, Dec 20, 2012 at 4:40 PM, Tommaso Teofili <
> > > > tommaso.teof...@gmail.com>
> > > > wrote:
> > > >
> > > > > in my opinion that would be good, +1
> > > > > Tommaso
> > > > >
> > > > >
> > > > > 2012/12/19 Jörn Kottmann 
> > > > >
> > > > > > Hi all,
> > > > > >
> > > > > > I heard at ApacheCon Europe that it should be possible to migrate
> > > from
> > > > > > Subverion to Git.
> > > > > >
> > > > > > Is there any interest in doing that? If we decide to do it I
> > suggest
> > > to
> > > > > > wait until the
> > > > > > 1.5.3 release is done so we have a bit time to also migrate our
> > build
> > > > > > process.
> > > > > >
> > > > > > Do have all committers experience with git?
> > > > > >
> > > > > > Jörn
> > > > > >
> > > > >
> > > >
> > >
> >
>


Re: Model to detect the gender

2016-06-29 Thread William Colen
To create a NER model OpenNLP extracts features from the context, things
such as: word prefix and suffix, next word, previous word, previous word
prefix and suffix, next word prefix and suffix etc.
When you don't configure the feature generator it will apply the default:
https://opennlp.apache.org/documentation/1.6.0/manual/opennlp.html#tools.namefind.training.featuregen.api

Default feature generator:

AdaptiveFeatureGenerator featureGenerator = *new* CachedFeatureGenerator(
 *new* AdaptiveFeatureGenerator[]{
   *new* WindowFeatureGenerator(*new* TokenFeatureGenerator(), 2, 2),
   *new* WindowFeatureGenerator(*new*
TokenClassFeatureGenerator(true), 2, 2),
   *new* OutcomePriorFeatureGenerator(),
   *new* PreviousMapFeatureGenerator(),
   *new* BigramNameFeatureGenerator(),
   *new* SentenceFeatureGenerator(true, false)
   });


These default features should work for most cases (specially English), but
they of course can be incremented. If you do so, your model will take new
features in account. So yes, you are putting the features in your model.

To configure custom features is not easy. I would start with the default
and use 10-fold cross-validation and take notes of its effectiveness. Than
change/add a feature, evaluate and take notes. Sometimes a feature that we
are sure would help can destroy the model effectiveness.

Regards
William


2016-06-29 7:00 GMT-03:00 Damiano Porta <damianopo...@gmail.com>:

> Thank you William! Really appreciated!
>
> I only do not get one point, when you said "You could increment your
> model using
> Custom Feature Generators" does it mean that i can "put" these features
> inside ONE *.bin* file (model) that implement different things, or, name
> finder is one thing and those feature generators other?
>
> Thank you in advance for the clarification.
>
> 2016-06-29 1:23 GMT+02:00 William Colen <william.co...@gmail.com>:
>
> > Not exactly. You would create a new NER model to replace yours.
> >
> > In this approach you would need a corpus like this:
> >
> >  Pierre Vinken  , 61 years old , will join the
> board
> > as a nonexecutive director Nov. 29 .
> > Mr .  Vinken  is chairman of Elsevier N.V. , the
> > Dutch publishing group .  Jessie Robson  is
> > retiring , she was a board member for 5 years .
> >
> >
> > I am not an English native speaker, so I am not sure if the example is
> > clear enough. I tried to use Jessie as a neutral name and "she" as
> > disambiguation.
> >
> > With a corpus big enough maybe you could create a model that outputs both
> > classes, personMale and personFemale. To train a model you can follow
> >
> >
> https://opennlp.apache.org/documentation/1.6.0/manual/opennlp.html#tools.namefind.training
> >
> > Let's say your results are not good enough. You could increment your
> model
> > using Custom Feature Generators (
> >
> >
> https://opennlp.apache.org/documentation/1.6.0/manual/opennlp.html#tools.namefind.training.featuregen
> > and
> >
> >
> https://opennlp.apache.org/documentation/1.6.0/apidocs/opennlp-tools/opennlp/tools/util/featuregen/package-summary.html
> > ).
> >
> > One of the implemented featuregen can take a dictionary (
> >
> >
> https://opennlp.apache.org/documentation/1.6.0/apidocs/opennlp-tools/opennlp/tools/util/featuregen/DictionaryFeatureGenerator.html
> > ).
> > You can also implement other convenient FeatureGenerator, for instance
> > regex.
> >
> > Again, it is just a wild guess of how to implement it. I don't know if it
> > would perform well. I was only thinking how to implement a gender ML
> model
> > that uses the surrounding context.
> >
> > Hope I could clarify.
> >
> > William
> >
> > 2016-06-28 19:15 GMT-03:00 Damiano Porta <damianopo...@gmail.com>:
> >
> > > Hi William,
> > > Ok, so you are talking about a kind of pipe where we execute:
> > >
> > > 1. NER (personM for example)
> > > 2. Regex (filter to reduce false positives)
> > > 3. Plain dictionary (filter as above) ?
> > >
> > > Yes we can split out model in two for M and F, it is not a big problem,
> > we
> > > have a database grouped by gender.
> > >
> > > I only have a doubt regarding the use of a dictionary. Because if we
> use
> > a
> > > dictionary to create the model, we could only use it to detect names
> > > without using NER. No?
> > >
> > >
> > >
> > > 2016-06-29 0:10 GMT+02:00 William Colen <william.co...@gmail.com>:
> > >
> 

Re: Model to detect the gender

2016-06-28 Thread William Colen
Not exactly. You would create a new NER model to replace yours.

In this approach you would need a corpus like this:

 Pierre Vinken  , 61 years old , will join the board
as a nonexecutive director Nov. 29 .
Mr .  Vinken  is chairman of Elsevier N.V. , the
Dutch publishing group .  Jessie Robson  is
retiring , she was a board member for 5 years .


I am not an English native speaker, so I am not sure if the example is
clear enough. I tried to use Jessie as a neutral name and "she" as
disambiguation.

With a corpus big enough maybe you could create a model that outputs both
classes, personMale and personFemale. To train a model you can follow
https://opennlp.apache.org/documentation/1.6.0/manual/opennlp.html#tools.namefind.training

Let's say your results are not good enough. You could increment your model
using Custom Feature Generators (
https://opennlp.apache.org/documentation/1.6.0/manual/opennlp.html#tools.namefind.training.featuregen
and
https://opennlp.apache.org/documentation/1.6.0/apidocs/opennlp-tools/opennlp/tools/util/featuregen/package-summary.html
).

One of the implemented featuregen can take a dictionary (
https://opennlp.apache.org/documentation/1.6.0/apidocs/opennlp-tools/opennlp/tools/util/featuregen/DictionaryFeatureGenerator.html
).
You can also implement other convenient FeatureGenerator, for instance
regex.

Again, it is just a wild guess of how to implement it. I don't know if it
would perform well. I was only thinking how to implement a gender ML model
that uses the surrounding context.

Hope I could clarify.

William

2016-06-28 19:15 GMT-03:00 Damiano Porta <damianopo...@gmail.com>:

> Hi William,
> Ok, so you are talking about a kind of pipe where we execute:
>
> 1. NER (personM for example)
> 2. Regex (filter to reduce false positives)
> 3. Plain dictionary (filter as above) ?
>
> Yes we can split out model in two for M and F, it is not a big problem, we
> have a database grouped by gender.
>
> I only have a doubt regarding the use of a dictionary. Because if we use a
> dictionary to create the model, we could only use it to detect names
> without using NER. No?
>
>
>
> 2016-06-29 0:10 GMT+02:00 William Colen <william.co...@gmail.com>:
>
> > Do you plan to use the surrounding context? If yes, maybe you could try
> to
> > split NER in two categories: PersonM and PersonF. Just an idea, never
> read
> > or tried anything like it. You would need a training corpus with these
> > classes.
> >
> > You could add both the plain dictionary and the regex as NER features as
> > well and check how it improves.
> >
> > 2016-06-28 18:56 GMT-03:00 Damiano Porta <damianopo...@gmail.com>:
> >
> > > Hello everybody,
> > >
> > > we built a NER model to find persons (name) inside our documents.
> > > We are looking for the best approach to understand if the name is
> > > male/female.
> > >
> > > Possible solutions:
> > > - Plain dictionary?
> > > - Regex to check the initial and/letters of the name?
> > > - Classifier? (naive bayes? Maxent?)
> > >
> > > Thanks
> > >
> >
>


Re: Model to detect the gender

2016-06-28 Thread William Colen
Do you plan to use the surrounding context? If yes, maybe you could try to
split NER in two categories: PersonM and PersonF. Just an idea, never read
or tried anything like it. You would need a training corpus with these
classes.

You could add both the plain dictionary and the regex as NER features as
well and check how it improves.

2016-06-28 18:56 GMT-03:00 Damiano Porta :

> Hello everybody,
>
> we built a NER model to find persons (name) inside our documents.
> We are looking for the best approach to understand if the name is
> male/female.
>
> Possible solutions:
> - Plain dictionary?
> - Regex to check the initial and/letters of the name?
> - Classifier? (naive bayes? Maxent?)
>
> Thanks
>


Re: DeepLearning4J as a ML for OpenNLP

2016-06-28 Thread William Colen
Thank you for pointing, Prof. Chris. Can you please point me the exact
project at http://scispark.jpl.nasa.gov/ I should look at? It is huge.

Thank you again.
William

William Colen

2016-06-28 18:26 GMT-03:00 Mattmann, Chris A (3980) <
chris.a.mattm...@jpl.nasa.gov>:

> Yep I think so - you may also look at SciSpark
> http://scispark.jpl.nasa.gov
> where we are using DL4J/ND4J and Breeze interchangeably here.
>
> Cheers,
> Chris
>
> ++
> Chris Mattmann, Ph.D.
> Chief Architect
> Instrument Software and Science Data Systems Section (398)
> NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA
> Office: 168-519, Mailstop: 168-527
> Email: chris.a.mattm...@nasa.gov
> WWW:  http://sunset.usc.edu/~mattmann/
> ++
> Director, Information Retrieval and Data Science Group (IRDS)
> Adjunct Associate Professor, Computer Science Department
> University of Southern California, Los Angeles, CA 90089 USA
> WWW: http://irds.usc.edu/
> ++++++++++
>
>
>
>
>
>
>
>
>
>
> On 6/28/16, 2:23 PM, "William Colen" <william.co...@gmail.com> wrote:
>
> >Hi,
> >
> >Do you think it would be possible to implement a ML based on DL4J?
> >
> >http://deeplearning4j.org/
> >
> >Thank you
> >William
>


Re: DeepLearning4J as a ML for OpenNLP

2016-06-28 Thread William Colen
Suneel,

I mean an implementation so we can use DL4J to train the OpenNLP models,
just like we already do in opennlp.tools.ml package with Maxent,
Perceptron, NayveBayes. I think it was Jörn who also did a few others that
are in the SandBox: Mallet and Mahout.

Thank you!
William

2016-06-28 18:27 GMT-03:00 Suneel Marthi <suneel_mar...@yahoo.com.invalid>:

> Are u looking at using ND4J (from Deeplearning4j project) as the Math
> backend for ML work? If so, yes.
>
>
>   From: William Colen <william.co...@gmail.com>
>  To: "dev@opennlp.apache.org" <dev@opennlp.apache.org>
>  Sent: Tuesday, June 28, 2016 5:23 PM
>  Subject: DeepLearning4J as a ML for OpenNLP
>
> Hi,
>
> Do you think it would be possible to implement a ML based on DL4J?
>
> http://deeplearning4j.org/
>
> Thank you
> William
>
>
>
>


DeepLearning4J as a ML for OpenNLP

2016-06-28 Thread William Colen
Hi,

Do you think it would be possible to implement a ML based on DL4J?

http://deeplearning4j.org/

Thank you
William


Re: Sentiment Analysis Parser updates

2016-06-28 Thread William Colen
Hi,

I tried your code. Very good work so far! Congratulations.

Is the examples/result file corrupted? It has only one line.

Do you plan to implement a simple CLI to use it interactively from command
line, similar to

bin/opennlp Doccat
bin/opennlp TokenNameFinder

?

Also, do you plan to add evaluation tools by extending
AbstractEvaluatorTool and AbstractCrossValidatorTool, as well as the
listener EvaluationErrorPrinter? I found these tools very useful while I am
developing new models and features, maybe you would find it useful as well.

You could also check the DoccatFineGrainedReportListener as a start point
to create a confusion matrix (I think it would be easy because Doccat data
structures are similar to yours).

The result would look like the follow (this is a 300 entries Portuguese
corpus I am building from Facebook messages):


=== Evaluation summary ===
  Number of documents:298
Min sentence size:  1
Max sentence size:463
Average sentence size:  18,01
 Categories count:  4
 Accuracy: 61,41%

=== Detailed Accuracy By Tag ===

-
|  Tag | Errors |  Count |   % Err | Precision | Recall | F-Measure |
-
|  neutral | 46 | 56 | 0,821   | 0,588 | 0,179  | 0,274 |
| positive | 46 | 70 | 0,657   | 0,48  | 0,343  | 0,4   |
| negative | 18 |167 | 0,108   | 0,651 | 0,892  | 0,753 |
| spam |  5 |  5 | 1   | 0 | 0  | 0 |
-

=== Confusion matrix ===


a b c d | Accuracy | <-- classified as
 <149>   13 4 1 |   89,22% |   a = negative
   42   <24>3 1 |   34,29% |   b = positive
   3511   <10>. |   17,86% |   c = neutral
3 2 .<.>|   0% |   d = spam




Regards,
William

2016-06-23 2:11 GMT-03:00 Mattmann, Chris A (3980) <
chris.a.mattm...@jpl.nasa.gov>:

> Thank you Jason!
>
> ++
> Chris Mattmann, Ph.D.
> Chief Architect
> Instrument Software and Science Data Systems Section (398)
> NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA
> Office: 168-519, Mailstop: 168-527
> Email: chris.a.mattm...@nasa.gov
> WWW:  http://sunset.usc.edu/~mattmann/
> ++
> Director, Information Retrieval and Data Science Group (IRDS)
> Adjunct Associate Professor, Computer Science Department
> University of Southern California, Los Angeles, CA 90089 USA
> WWW: http://irds.usc.edu/
> ++
>
>
>
>
>
>
>
>
>
>
> On 6/22/16, 8:41 PM, "Jason Baldridge"  wrote:
>
> >Anastasija,
> >
> >There might be a few appropriate sentiment datasets listed in my homework
> >on Twitter sentiment analysis:
> >
> >https://github.com/utcompling/applied-nlp/wiki/Homework5
> >
> >There may also be some useful data sets in the Crowdflower Open Data
> >collection:
> >
> >https://www.crowdflower.com/data-for-everyone/
> >
> >Hope this helps!
> >
> >-Jason
> >
> >On Wed, 22 Jun 2016 at 15:59 Anastasija Mensikova <
> >mensikova.anastas...@gmail.com> wrote:
> >
> >> Hi everyone,
> >>
> >> Some updates on our Sentiment Analysis Parser work.
> >>
> >> You might have noticed, I have enhanced our website (the GH page)
> recently,
> >> polished it and made it more user-friendly. My next step will be
> sending a
> >> pull request to Tika. However, my main goal until the end of Google
> Summer
> >> of Code is to enhance the parser in a way that will allow it to work
> >> categorically (in other words, the sentiment determined won't be just
> >> positive or negative, it will have a few categories). This means that my
> >> next step is to look for a categorical open data set (which I will
> >> hopefully do by the end of the weekend the latest) and, of course,
> enhance
> >> my model and training. After that I will look into how the confidence
> >> levels can be increased.
> >>
> >> Have a great day/night!
> >>
> >> Thank you,
> >> Anastasija Mensikova.
> >>
>


Re: Usages of Adaptive features.

2016-06-28 Thread William Colen
You can also activate the monitor from command line, using misclassified
and detailedF:

bin/opennlp TokenNameFinderCrossValidator
Usage: opennlp
TokenNameFinderCrossValidator[.ontonotes|.bionlp2004|.conll03|.conll02|.ad|.evalita|.muc6|.brat]
[-factory factoryName] [-resources resourcesDir] [-type modelType]
[-featuregen featuregenFile] [-nameTypes types] [-sequenceCodec codec]
[-params paramsFile] -lang language [-misclassified true|false] [-folds
num] [-detailedF true|false] -data sampleData [-encoding charsetName]

Arguments description:
-factory factoryName
A sub-class of TokenNameFinderFactory
-resources resourcesDir
The resources directory
-type modelType
The type of the token name finder model
-featuregen featuregenFile
The feature generator descriptor file
-nameTypes types
name types to use for training
-sequenceCodec codec
sequence codec used to code name spans
-params paramsFile
training parameters file.
-lang language
language which is being processed.
-misclassified true|false
if true will print false negatives and false positives.
-folds num
number of folds, default is 10.
-detailedF true|false
if true will print detailed FMeasure results.
-data sampleData
data to be used, usually a file name.
-encoding charsetName
encoding for reading and writing text, if absent the system default is used.

William Colen

2016-06-28 11:04 GMT-03:00 William Colen <william.co...@gmail.com>:

>
> https://opennlp.apache.org/documentation/1.6.0/manual/opennlp.html#tools.namefind.training.featuregen
>
> Do you have a specific question?
>
> You can try the default feature generator and check how your model will
> perform in terms of precision and recall. You can take a look at the kind
> of errors (use a EvaluationMonitor
> https://opennlp.apache.org/documentation/1.6.0/apidocs/opennlp-tools/opennlp/tools/util/eval/EvaluationMonitor.html)
> and try to figure out features that it is missing that would give a hint
> how to perform better.
> Add the features and check  precision and recall again.
>
> 2016-06-21 13:45 GMT-03:00 <rakeshbe...@gmail.com>:
>
>> Please share the usages of Adaptive features that are used in NER tagging?
>>
>> Regards,
>> Rakesh.P
>>
>
>


Re: [VOTE] Release OpenNLP 1.6.0 RC 6

2015-07-09 Thread William Colen
Thank you all the voters.

The release vote is closed and we have enough +1 to proceed with the
release.

You can follow the release tasks here:
https://cwiki.apache.org/confluence/display/OPENNLP/ReleasePlanAndTasks1.6.0

Thank you,
William

2015-07-08 18:26 GMT-03:00 Rodrigo Agerri rage...@apache.org:

 +1 for the release. It all good looks to me and I think it will be
 nice to have the 1.6.0 out.

 Rodrigo

 On Wed, Jul 1, 2015 at 2:37 PM, William Colen co...@apache.org wrote:
  +1 for the release
 
  I repeated a few tests and used the distributables in another project.
 
  2015-06-30 9:20 GMT-03:00 Joern Kottmann kottm...@gmail.com:
 
  +1 in addition to the other tests I verified all the hashes and
 signatures.
  They are all good.
 
  Jörn
  On Jun 16, 2015 4:51 PM, William Colen co...@apache.org wrote:
 
   Hello,
  
   Lets vote to release RC 6 as OpenNLP 1.6.0.
  
   The testing of it is documented here:
   https://cwiki.apache.org/confluence/display/OPENNLP/TestPlan1.6.0
  
   The RC can be downloaded here:
   http://people.apache.org/~colen/releases/opennlp-1.6.0/rc6
  
   The release notes can be found here:
  
  
 
 http://people.apache.org/~colen/releases/opennlp-1.6.0/rc6/RELEASE_NOTES.html
  
   Please vote to approve this release:
   [ ] +1 Approve the release
   [ ] -1 Veto the release (please provide specific comments)
   [ ] 0   Don't care
  
   Please report any problems you may find.
  
 



Re: [VOTE] Release OpenNLP 1.6.0 RC 6

2015-07-01 Thread William Colen
+1 for the release

I repeated a few tests and used the distributables in another project.

2015-06-30 9:20 GMT-03:00 Joern Kottmann kottm...@gmail.com:

 +1 in addition to the other tests I verified all the hashes and signatures.
 They are all good.

 Jörn
 On Jun 16, 2015 4:51 PM, William Colen co...@apache.org wrote:

  Hello,
 
  Lets vote to release RC 6 as OpenNLP 1.6.0.
 
  The testing of it is documented here:
  https://cwiki.apache.org/confluence/display/OPENNLP/TestPlan1.6.0
 
  The RC can be downloaded here:
  http://people.apache.org/~colen/releases/opennlp-1.6.0/rc6
 
  The release notes can be found here:
 
 
 http://people.apache.org/~colen/releases/opennlp-1.6.0/rc6/RELEASE_NOTES.html
 
  Please vote to approve this release:
  [ ] +1 Approve the release
  [ ] -1 Veto the release (please provide specific comments)
  [ ] 0   Don't care
 
  Please report any problems you may find.
 



Re: OpenNLP: Named Entity Recognition ( Token Name Finder )

2015-06-17 Thread William Colen
I can't remember if the interactions parameter is used in PERCEPTRON.
With my experience with other tools, you should use Cutoff 0. Perceptron
takes advantage of every feature you add.
Did you try the evaluation tools to compute F1?



2015-06-17 13:25 GMT-03:00 nikhil jain nikhil_jain1...@yahoo.com.invalid:

 Hello,


 Did anyone get a chance to look at this.

 Please provide some feedback.


 Thanks

 Nikhil Jain

 Sent from Yahoo Mail on Android

 From:nikhil jain nikhil_jain1...@yahoo.com.INVALID
 Date:Tue, Jun 16, 2015 at 4:36 PM
 Subject:Re: OpenNLP: Named Entity Recognition ( Token Name Finder )

 Hi William,


 Thanks for the link.


 I have tried both model Maxent and perception on my problem and Perception
 is working much better than Maxent.


 I have one question, when I am creating a perceptron model using cutoff 5
 and iterations 100 then after 5th iteration model is adjusting itself and
 not going forward for further iterations, so my question is, is it correct
 behaviour or I am doing something wrong.


 Adding some code and logs for the reference.


 ObjectStreamNameSample sampleStream = new
 NameSampleDataStream(lineStream);



   TokenNameFinderModel model = null;

   TrainingParameters tp = new TrainingParameters();

   //tp.put(TrainingParameters.ALGORITHM_PARAM, MAXENT);

   tp.put(TrainingParameters.ALGORITHM_PARAM, PERCEPTRON);

   System.out.println(244:Hybrid parser:PERCEPTRON);

   tp.put(TrainingParameters.ITERATIONS_PARAM,
 Integer.toString(100));

   tp.put(TrainingParameters.CUTOFF_PARAM, Integer.toString(5));

   tp.put(Threads, 3);



   opennlp.tools.util.featuregen.AdaptiveFeatureGenerator
 generator = null;



   try {

  MapString, Object resources = null;

  model = NameFinderME.train( en, security,
 sampleStream, tp, generator, resources);

   } catch (IOException e) {





 Indexing events using cutoff of 5



Computing event counts...  done. 8209384 events

Indexing...  done.

 Collecting events... Done indexing.

 Incorporating indexed data for training...

 done.

Number of Event Tokens: 8209384

Number of Outcomes: 34

  Number of Predicates: 325780

 Computing model parameters...

 Performing 100 iterations.

   1:  . (8209184/8209384) 0.75637636149

   2:  . (8209291/8209384) 0.886715008093

   3:  . (8209340/8209384) 0.946402799528

   4:  . (8209356/8209384) 0.965892690609

   5:  . (8209357/8209384) 0.967110808802

 Stopping: change in training set accuracy less than 1.0E-5

 Stats: (8104703/8209384) 0.9872486169486042

 ...done.

 Compressed 325780 parameters to 3957

 532 outcome patterns



 Thanks

 Nikhil


 Sent from Yahoo Mail on Android

 From:William Colen william.co...@gmail.com
 Date:Fri, May 29, 2015 at 5:47 PM
 Subject:Re: OpenNLP: Named Entity Recognition ( Token Name Finder )

 The answer about the differences would be quite long. You can learn about
 the theory researching online. Try some papers from here:
 https://cwiki.apache.org/confluence/display/OPENNLP/NLP+Papers

 Which algorithm is better for you depends on your task and your data. You
 can start developing using the standard Maxent and when your environment is
 ready you can try other ML implementations.

 Regards,
 William


 2015-05-29 7:07 GMT-03:00 nikhil jain nikhil_jain1...@yahoo.com.invalid:

  Hello,
 
 
  Did anyone get a chance to look at the email. I know I am asking a very
  basic question but being a new in this subject, its very difficult to
  understand the terms.
 
 
  I tried to understand by reading wiki pages but not fully understand that
  why I raised a question.
 
 
  Thanks
 
  Nikhil
 
  Sent from Yahoo Mail on Android
 
  From:nikhil jain nikhil_jain1...@yahoo.com
  Date:Tue, May 19, 2015 at 11:51 PM
  Subject:OpenNLP: Named Entity Recognition ( Token Name Finder )
 
  Hello Everyone,
 
 
  I was reading a openNLP documentation, and found that OpenNLP supports
  Maxent, Perceptron and Perceptron sequence type models.
 
 
  Could someone please explain me the difference in between them?
 
 
  I am trying to understand which one would be good for tagging sequence of
  data.
 
 
  BTW, I am new in NLP and Machine learning. so please help me to
 understand
  this.
 
 
  Thanks
 
  Nikhil Jain
 
 
 
 
 
 
 




[VOTE] Release OpenNLP 1.6.0 RC 6

2015-06-16 Thread William Colen
Hello,

Lets vote to release RC 6 as OpenNLP 1.6.0.

The testing of it is documented here:
https://cwiki.apache.org/confluence/display/OPENNLP/TestPlan1.6.0

The RC can be downloaded here:
http://people.apache.org/~colen/releases/opennlp-1.6.0/rc6

The release notes can be found here:
http://people.apache.org/~colen/releases/opennlp-1.6.0/rc6/RELEASE_NOTES.html

Please vote to approve this release:
[ ] +1 Approve the release
[ ] -1 Veto the release (please provide specific comments)
[ ] 0   Don't care

Please report any problems you may find.


Re: OpenNLP: Named Entity Recognition ( Token Name Finder )

2015-05-29 Thread William Colen
The answer about the differences would be quite long. You can learn about
the theory researching online. Try some papers from here:
https://cwiki.apache.org/confluence/display/OPENNLP/NLP+Papers

Which algorithm is better for you depends on your task and your data. You
can start developing using the standard Maxent and when your environment is
ready you can try other ML implementations.

Regards,
William


2015-05-29 7:07 GMT-03:00 nikhil jain nikhil_jain1...@yahoo.com.invalid:

 Hello,


 Did anyone get a chance to look at the email. I know I am asking a very
 basic question but being a new in this subject, its very difficult to
 understand the terms.


 I tried to understand by reading wiki pages but not fully understand that
 why I raised a question.


 Thanks

 Nikhil

 Sent from Yahoo Mail on Android

 From:nikhil jain nikhil_jain1...@yahoo.com
 Date:Tue, May 19, 2015 at 11:51 PM
 Subject:OpenNLP: Named Entity Recognition ( Token Name Finder )

 Hello Everyone,


 I was reading a openNLP documentation, and found that OpenNLP supports
 Maxent, Perceptron and Perceptron sequence type models.


 Could someone please explain me the difference in between them?


 I am trying to understand which one would be good for tagging sequence of
 data.


 BTW, I am new in NLP and Machine learning. so please help me to understand
 this.


 Thanks

 Nikhil Jain









Re: OpenNLP 1.6.0 RC 4 ready for testing

2015-05-28 Thread William Colen
Yes, the sentence detector does not match, but it is such a small change
that I would not even investigate.
It is an empty line that used to be included in the 1.5.3 output, but it is
not included in 1.6.0 anymore:


*Output in 1.5.3:*
[image: Imagem inline 1]

*Output in 1.6.0:*
[image: Imagem inline 3]


I would ignore the change.


Thank you,
William


2015-05-28 7:28 GMT-03:00 Joern Kottmann kottm...@gmail.com:

 The chunker and parser tests are fine now.

 Do you know what's the deal with the sentence detector?

 The compatibility test is marked as failed. Can we leave it like that or
 do we have to fix some bugs?

 Jörn
 On May 23, 2015 5:35 AM, William Colen co...@apache.org wrote:

 Our fourth release candidate is ready for testing. RC 3 failed in the
 compatibility, regression and performance tests, which are fixed in RC 4.

 The RC 4 can be downloaded from here:
 http://people.apache.org/~colen/releases/opennlp-1.6.0/rc4/

 To use it in a maven build set the version for opennlp-tools or
 opennlp-uima to 1.6.0 and add the following URL to your settings.xml file:
 https://repository.apache.org/content/repositories/orgapacheopennlp-1003

 The current test plan can be found here:
 https://cwiki.apache.org/confluence/display/OPENNLP/TestPlan1.6.0

 Please sign up for tasks in the test plan.

 The release plan can be found here:

 https://cwiki.apache.org/confluence/display/OPENNLP/ReleasePlanAndTasks1.6.0

 The release contains quite some changes, please refer to the contained
 issue list for details.

 For your convenience, a copy of the issue list, as well as the release
 notes and the readme, can be found in the following link:


 http://people.apache.org/~colen/releases/opennlp-1.6.0/rc4/RELEASE_NOTES.html


 Thank you,
 William




OpenNLP 1.6.0 RC 3 ready for testing

2015-04-30 Thread William Colen
Our third release candidate is ready for testing. RC 2 failed in the
compatibility and regression tests, which are fixed in RC 3.

The RC 3 can be downloaded from here:
http://people.apache.org/~colen/releases/opennlp-1.6.0/rc3/

To use it in a maven build set the version for opennlp-tools or
opennlp-uima to 1.6.0 and add the following URL to your settings.xml file:
https://repository.apache.org/content/repositories/orgapacheopennlp-1002

The current test plan can be found here:
https://cwiki.apache.org/confluence/display/OPENNLP/TestPlan1.6.0

Please sign up for tasks in the test plan.

The release plan can be found here:
https://cwiki.apache.org/confluence/display/OPENNLP/ReleasePlanAndTasks1.6.0

The release contains quite some changes, please refer to the contained
issue list for details.

For your convenience, a copy of the issue list, as well as the release
notes and the readme, can be found in the following link:

http://people.apache.org/~colen/releases/opennlp-1.6.0/rc3/RELEASE_NOTES.html


Thank you,
William


Re: Automated testing with public data

2015-04-29 Thread William Colen
+1

The script would also be great for documentation.

2015-04-29 11:15 GMT-03:00 Joern Kottmann kottm...@gmail.com:

 Or we just make a download script which bootstraps the users corpus folder.

 Could be a couple of wget lines or so ...


 Jörn

 On Wed, Apr 29, 2015 at 6:17 AM, William Colen william.co...@gmail.com
 wrote:

  Automating the download would be fine as long as we cache it, as Richard
  suggested. Maybe it could be done by a script to prepare the environment,
  and not be part of the unit test itself.
  Anyway, it would be a good idea to save the data somewhere because we
 never
  know if some of the websites will become unavailable in the future.
 
 
  2015-04-15 5:31 GMT-03:00 Richard Eckart de Castilho 
  richard.eck...@gmail.com:
 
   On 15.04.2015, at 10:23, Joern Kottmann kottm...@gmail.com wrote:
  
With publicly accessible data I mean a corpus you can somehow
 acquire,
opposed to the data you create on your own for a project.
   
All the corpora we support in the formats package are publicly
   accessible.
Maybe
some you have to buy and for others you just have to sign some
  agreement.
   
A very interesting corpus for testing (and training models on) is
   OntoNotes.
   
Here is a link to the LDC entry:
https://catalog.ldc.upenn.edu/LDC2011T03
   
You can get it for free (or for a small distribution fee) but you
 can't
just download it.
It would be great if the ASF could acquire this data set so we can
  share
   it
among the committers.
   
Is that what you mean with proprietary data?
  
   Yes, that is what I mean.
  
   E.g. the TIGER corpus requires clicking through some pages and forms to
   reach a download page, but in principle, it appears as if the corpus
 was
   simply downloadable by a deep-link URL. The license terms state, that
 the
   corpus must not be redistributed.
  
   Some tools are also publicly accessible and downloadable but not
   redistributable. For example anybody can download TreeTagger and its
   models, but only from the original homepage. It is not permitted to
   redistribute it, i.e. to publish it to a repository or offer it on an
   alternative homepage.
  
   So there is a (small) class of resources between being redistributable
  and
   proprietary (for fee), namely being in principle publicly accessible
 (for
   free) but not redistributable.
  
   Cheers,
  
   -- Richard
 



Re: Automated testing with public data

2015-04-28 Thread William Colen
Automating the download would be fine as long as we cache it, as Richard
suggested. Maybe it could be done by a script to prepare the environment,
and not be part of the unit test itself.
Anyway, it would be a good idea to save the data somewhere because we never
know if some of the websites will become unavailable in the future.


2015-04-15 5:31 GMT-03:00 Richard Eckart de Castilho 
richard.eck...@gmail.com:

 On 15.04.2015, at 10:23, Joern Kottmann kottm...@gmail.com wrote:

  With publicly accessible data I mean a corpus you can somehow acquire,
  opposed to the data you create on your own for a project.
 
  All the corpora we support in the formats package are publicly
 accessible.
  Maybe
  some you have to buy and for others you just have to sign some agreement.
 
  A very interesting corpus for testing (and training models on) is
 OntoNotes.
 
  Here is a link to the LDC entry:
  https://catalog.ldc.upenn.edu/LDC2011T03
 
  You can get it for free (or for a small distribution fee) but you can't
  just download it.
  It would be great if the ASF could acquire this data set so we can share
 it
  among the committers.
 
  Is that what you mean with proprietary data?

 Yes, that is what I mean.

 E.g. the TIGER corpus requires clicking through some pages and forms to
 reach a download page, but in principle, it appears as if the corpus was
 simply downloadable by a deep-link URL. The license terms state, that the
 corpus must not be redistributed.

 Some tools are also publicly accessible and downloadable but not
 redistributable. For example anybody can download TreeTagger and its
 models, but only from the original homepage. It is not permitted to
 redistribute it, i.e. to publish it to a repository or offer it on an
 alternative homepage.

 So there is a (small) class of resources between being redistributable and
 proprietary (for fee), namely being in principle publicly accessible (for
 free) but not redistributable.

 Cheers,

 -- Richard


Re: Parser performance bug

2015-02-20 Thread William Colen
I might be totally wrong, but I have a feeling that the change is
in ChunkerModel.java, because I also notice a change in the Chunker tool
results. It could be somehow related to the changes in the parameters in
that file. We can't discard the possibility that there was a bug that was
fixed with the changes.


Regards,
William

2015-02-16 12:17 GMT-02:00 Joern Kottmann kottm...@gmail.com:

 Hi all,

 the performance of the parser changed a bit. The output of the current
 version in 1.6.0 RC2 is different from the output of the 1.5.3 release.
 Even tough there shouldn't been any difference as far as I can see.

 The question of what caused that difference came up and I started to
 bisect it.

 Here are my results so far:
 1655561 - 1fe53c0aeaae1eb978dbb83f34b13944f2692b1f (head)
 1591889 - 1fe53c0aeaae1eb978dbb83f34b13944f2692b1f (5/2/14)
 1576093 - 1fe53c0aeaae1eb978dbb83f34b13944f2692b1f  (3/10/14)
 1574819 - 1fe53c0aeaae1eb978dbb83f34b13944f2692b1f (3/6/14)
 1574524 - 1fe53c0aeaae1eb978dbb83f34b13944f2692b1f (3/5/14)
 1574505 - 93c912e100932384465ec740d144a94656f214d3 (3/5/14)
 1573000 - 93c912e100932384465ec740d144a94656f214d3 (2/28/14)
 1569434 - 93c912e100932384465ec740d144a94656f214d3 (2/18/14)
 1569285 - 93c912e100932384465ec740d144a94656f214d3 (2/18/14)
 1554795 - 93c912e100932384465ec740d144a94656f214d3 (1/2/14)
 1463979 - 93c912e100932384465ec740d144a94656f214d3 (1.5.3)

 The first column is the svn revision, the second column the hash of the
 output data and in the parenthesis is the date of the revision or the
 version.

 The change in the code which caused the difference happened in 1574524.
 I had a quick look there and couldn't see within a few minutes what
 caused the issue. I will probably again use a more systematic approach
 to find the exact change in that commit that causes the difference.

 Jörn





OpenNLP 1.6.0 RC 2 ready for testing

2015-01-22 Thread William Colen
Hi all,

Our second release candidate is ready for testing. RC 1 failed to pass the
initial tests.

The RC 2 can be downloaded from here:
http://people.apache.org/~colen/releases/opennlp-1.6.0/rc2/

To use it in a maven build set the version for opennlp-tools or
opennlp-uima to 1.6.0 and add the following URL to your settings.xml file:
https://repository.apache.org/content/repositories/orgapacheopennlp-1001

The current test plan can be found here:
https://cwiki.apache.org/confluence/display/OPENNLP/TestPlan1.6.0

Please sign up for tasks in the test plan.

The release plan can be found here:
https://cwiki.apache.org/confluence/display/OPENNLP/ReleasePlanAndTasks1.6.0

The RC contains quite some changes, please refer to the contained issue
list for details.

Thank you,
William


Re: OpenNLP 1.6.0 RC 2 ready for testing

2015-01-22 Thread William Colen
You can find the issue list in the distributable:
issuesFixed/jira-report.html as well as a readme with a summary of the main
important issues.

For your convenience -

https://issues.apache.org/jira/browse/OPENNLP-650?jql=project%20%3D%20OPENNLP%20AND%20fixVersion%20%3D%201.6.0

README contents

Apache OpenNLP 1.6.0
===


Building from the Source Distribution
-

At least Maven 3.0.0 is required for building.

To build everything execute the following command in the root folder:
mvn clean install

The results of the build will be placed  in:
opennlp-distr/target/apache-opennlp-[version]-bin.tar-gz (or .zip)

What is new in Apache OpenNLP 1.6.0
---

This release introduces many new features, improvements and bug fixes. The
API
has been improved for a better consistency and 1.4 deprecated methods were
removed. Now Java 1.7 is required.

Additionally the release contains the following noteworthy changes:

- Added evalutation support to the parser and doccat components
- Added support to Evalita 07/09, Brat and OntoNotes corpus formats
- Now L-BFGS is stable
- Added Snowball to the Stemmer package
- NameFinder now supports a user defined factory
- Added pluggable machine learning support
- Added a lemmatizer module
- Added Cluster, Document Begin and Clark feature generators to the Name
Finder
- Added Liblinear as a Machine Learning addon
- Entity Linker now has a command line interface
- Added sequence classification support

A detailed list of the issues related to this release can be found in the
release
notes.

Requirements

Java 1.7 is required to run OpenNLP
Maven 3.0.0 is required for building it

Known OSGi Issues

In an OSGi environment the following things are not supported:
- The coreference resolution component

2015-01-22 18:13 GMT-02:00 Mike Reed mike.r...@pathar.net:

 To help those of us on the periphery follow along more closely, can you
 isolate and distribute the mentioned issues list please?

 I use the library in .NET via https://www.nuget.org/packages/OpenNLP.NET/
 so I'd like to follow along, but these releases are not directly available
 for me to drop into my solution and I am less aware of the code and
 structures than those who work at a lower level. I am however very
 interested in any and all advances of the library.

 -Original Message-
 From: William Colen [mailto:william.co...@gmail.com]
 Sent: Thursday, January 22, 2015 1:55 PM
 To: dev@opennlp.apache.org
 Subject: OpenNLP 1.6.0 RC 2 ready for testing

 Hi all,

 Our second release candidate is ready for testing. RC 1 failed to pass the
 initial tests.

 The RC 2 can be downloaded from here:
 http://people.apache.org/~colen/releases/opennlp-1.6.0/rc2/

 To use it in a maven build set the version for opennlp-tools or
 opennlp-uima to 1.6.0 and add the following URL to your settings.xml file:
 https://repository.apache.org/content/repositories/orgapacheopennlp-1001

 The current test plan can be found here:
 https://cwiki.apache.org/confluence/display/OPENNLP/TestPlan1.6.0

 Please sign up for tasks in the test plan.

 The release plan can be found here:

 https://cwiki.apache.org/confluence/display/OPENNLP/ReleasePlanAndTasks1.6.0

 The RC contains quite some changes, please refer to the contained issue
 list for details.

 Thank you,
 William



OpenNLP 1.6.0 RC 1 ready for testing

2014-12-10 Thread William Colen
Hi all,

Our first release candidate is ready for testing.

The RC 1 can be downloaded from here:
http://people.apache.org/~colen/releases/opennlp-1.6.0/rc1/

To use it in a maven build set the version for opennlp-tools or
opennlp-uima to 1.6.0 and add the following URL to your settings.xml file:
https://repository.apache.org/content/repositories/orgapacheopennlp-1000

The current test plan can be found here:
https://cwiki.apache.org/confluence/display/OPENNLP/TestPlan1.6.0

Please sign up for tasks in the test plan.

The release plan can be found here:
https://cwiki.apache.org/confluence/display/OPENNLP/ReleasePlanAndTasks1.6.0

The RC contains quite some changes, please refer to the contained issue
list for details.

Thank you,
William


Re: Next release (was: Re: 1.6.0 maven repo)

2014-11-21 Thread William Colen
+1 to start the release process

I candidate myself as release manager for the 1.6.0.

2014-11-20 18:32 GMT-02:00 Vinh Khuc knv...@gmail.com:

 +1 for the release of 1.6.0 RC

 Vinh

 On Thu, Nov 20, 2014 at 3:24 PM, Joern Kottmann kottm...@gmail.com
 wrote:

  Yes, all the important issues, expect one (OPENNLP-730) are closed now.
  There are still a couple of issues open about name finder feature
  generators, but those could also be added to OpenNLP in a 1.6.1 release
  or during testing.
 
  +1 to make the first RC for 1.6.0 and start testing it
 
  Jörn
 
  On Thu, 2014-11-20 at 07:33 +, Rodrigo Agerri wrote:
   +1 to start making a release. I would like to be involved too.
  
   R
   On 19 Nov 2014 23:40, Joern Kottmann kottm...@gmail.com wrote:
  
Hello,
   
yes, that should be the current state.
   
Can you please elaborate on the issue you have.
Do you get an old version?
   
We should try to make a release of 1.6.0, I think most issues
are already solved and remaining bugs we will uncover during the
 manual
testing phase.
   
Jörn
   
On Wed, 2014-11-19 at 21:20 +0100, Rodrigo Agerri wrote:
 Hi

 Any chance to release snapshot repos to maven central? Or to an
  apache
 snapshots repo?

 It would make the use of current trunk via API much easier.

 Cheers

 Rodrigo
   
   
   
 
 
 


 --
 Vinh Khuc



Re: How to sanitize and parse noisy text

2014-07-15 Thread William Colen
A while back I had a similar problem  while extracting text from HTML using
Tika. What I did was to hack the Tika HTML parser to extract the text as I
needed. I can't remember exactly how it was, but as far as I remember Tika
raises events when it finds a markup (at least a HTML markup), that is not
handled by default. If you know the structure of the document you are
reading, you can decide what to do with the markup and maybe change the
output (adding a space, a line break etc).



2014-07-15 5:00 GMT-03:00 Jörn Kottmann kottm...@gmail.com:

 Text extracted from PDFs must often be cleaned up first, e.g.
 fix tokenization, remove page header/footer, fix hyphenation, detect
 headlines/titles, etc.

 If there are fundamental issues with the plain text the OpenNLP components
 trained on cleaned text will not work very well.

 Jörn


 On 07/15/2014 05:38 AM, Carlos Scheidecker wrote:

 Hello all,

 I have an interesting problem here. More of a challenge.

 I have been doing text cleansing for bad characters and all.

 Then I have another interesting problem.

 Extracted a public PDF with Tika does not necessary mean you will get
 clean
 text because the original PDF might have different fonts within a section
 that will cause weird behaviors.

 If you then divide it into Senteces via OpenNLP you will then get some
 interesting sentences.

 Trying to parse those sentences then it gets worse.

 I am showing an example bellow and I would like to ask about solutions to
 it, considering the text can be noisy.

 I do not think that it will be easy to fix the Sentence Parser. Here is
 what I think on approaching it:

 Instead, the best way to do is to look at the sentences poorly parsed,
 parse them and extract the inner (S) from the parse as separate sentences.

 What would you suggest?

 Here is an example of a piece of text extracted with Tika from a public
 pdf. This part is what OpenNLP considered to be a sentence:

 

 related research DocumentsBrief: Your Next Portal should Be An Engagement
 WorkplaceFebruary 3, 2014Microsoft Aims sharePoint To The CloudJanuary 27,
 2014setting The Technology Foundation For Your social Business And
 Collaboration strategyJuly 29, 2013The Forrester wave : enterprise social
 Platforms, Q2 2014The 13 Providers That Matter Most And How They stack
 Upby
 rob Koplowitzwith Peter Burris and Nancy Wang2257913JUNE 5, 2014For CIos
 The Forrester Wave : Enterprise social Platforms, Q2 2014 2  2014,
 Forrester Research, Inc. Reproduction Prohibited June 5, 2014 eNTeRPRIse
 sOCIaL PLaTFORM MaRKeT MaTuRes aMID CONsOLIDaTION aND INTegRaTIONThe
 enterprise social platform is no longer in its infancy as offerings become
 increasingly functional.

 

 It is now parsed as follows:


 (S (S (S (NP (VBN related) (NN research) (NNP DocumentsBrief:) (NNP Your)
 (NNP Next) (NNP Portal)) (VP (MD should) (VP (VB Be) (NP (NP (DT An) (NNP
 Engagement) (NNP WorkplaceFebruary) (CD 3,) (JJ 2014Microsoft) (NNP Aims)
 (NN sharePoint)) (PP (TO To) (NP (DT The) (NNP CloudJanuary))) (VP
 (VBD
 27,) (S (VP (VBG 2014setting) (NP (NP (NP (DT The) (NNP Technology) (NNP
 Foundation)) (PP (IN For) (NP (PRP$ Your) (JJ social) (NNP Business
 (CC
 And) (NP (NNP Collaboration))) (PP (RB strategyJuly) (NP (CD 29,) (CD
 2013The) (NNP Forrester) (NN wave))) (: :) (S (VP (VB enterprise) (NP
 (JJ social) (NN Platforms,)) (PP (IN Q2) (NP (NP (DT 2014The) (CD 13) (NNS
 Providers)) (NP (NP (DT That) (NNP Matter) (JJS Most)) (SBAR (S (CC And)
 (SBAR (WHADVP (WRB How)) (S (NP (PRP They)) (VP (VBP stack) (PP (IN Upby)
 (NP (NP (NN rob)) (PP (IN Koplowitzwith) (NP (NP (NNP Peter) (NNP Burris))
 (CC and) (NP (NP (NNP Nancy) (NNP Wang2257913JUNE) (CD 5,)) (PP (IN
 2014For) (NP (NP (NNP CIos)) (NP (DT The) (NNP Forrester) (NNP Wave)) (:
 :)
 (S (NP (NP (NP (NP (NN Enterprise) (JJ social) (NN Platforms,)) (PP (IN
 Q2)
 (NP (CD 2014) (CD 2) (JJ 2014,) (NNP Forrester) (NNP Research,) (NNP Inc.)
 (NNP Reproduction) (NNP Prohibited) (NNP June) (CD 5,) (CD 2014) (NN
 eNTeRPRIse) (NN sOCIaL (NP (NNP PLaTFORM) (NNP MaRKeT) (NNP MaTuRes)
 (NN aMID) (NN CONsOLIDaTION))) (PP (IN aND) (NP (DT INTegRaTIONThe) (NN
 enterprise) (JJ social) (NN platform (VP (VP (VBZ is) (ADVP (RB no)
 (RB
 longer)) (PP (IN in) (NP (PRP$ its) (NN infancy))) (SBAR (IN as) (S (NP
 (NNS offerings)) (VP (VBP become) (ADVP (RB increasingly)) (VBG
 functional.)


 Notice that I have more than one (S (S (S

 And then I have the first correct structure as (S (NP . (VP.

 What is the best way to deal with it?





Re: TokenNameFinder and Span probs

2014-05-11 Thread William Colen
+1 for the second too

Em quarta-feira, 7 de maio de 2014, Joern Kottmann kottm...@gmail.com
escreveu:

 Hello Mark,

 +1 for your second solution. I believe that is much more intuitive than
 calling a method afterwards to retrieve the prob for a Span.
 it is easier to use because the prob is delivered as part of the result and
 no user action is required to obtain it.

 We could use this solution everywhere where a span gets returned.

 Jörn



 On Wed, May 7, 2014 at 2:18 AM, Mark G giaconiam...@gmail.comjavascript:;
 wrote:

  I am currently working on a project in which we are using NER to to pass
  toponyms into the GeoEntityLinker addon for geotagging and I am passing
 on
  the locations, entities, and other info into SOLR for indexing. Over the
  years I have noticed that the TokenNameFinder interface does not include
  all the probs() methods that the NameFinderME has, and furthermore the
 Span
  object does not have a double field for storing a prob for itself.  Also
  the sentenceDetector has a method called getSentenceProbabilities rather
  than probs().
  When I pass the Spans into the GeoEntityLinker/EntityLinker I can't get
 the
  probs anymore because they are not in the Span objects. I can always
 extend
  Span and add the field, or keep a 2D array of the probs for each
 sentence,
  but wanted to see what everyone thinks about
  1. adding the probs methods to the TokenNameFinder interface
  2. adding a prob field to Span (a double)
  3. Having the NameFinder return the prob with each Span so it doesn't
 have
  to be set after the call to find() using the double[] of probs
  4. Have the sentencedetectorME return its spans with a prob, add probs()
  method to the SentenceDetector interface, and deprecate the
  getSentenceProbabilities...
 
  Thoughts?
 



-- 
William Colen


Re: End of line whitespaces in Eclipse

2014-04-24 Thread William Colen
I think we should do it.

2014-04-22 8:50 GMT-03:00 Jörn Kottmann kottm...@gmail.com:

 We should maybe once remove all these white spaces
 at the end of lines. And maybe repeat that process
 for every release.

 Now days there are tools which can diff the files
 ignoring white space only changes.

 Any opinions?

 Jörn

 On Thu, 2014-04-10 at 19:58 -0300, William Colen wrote:
  When I save a .java file in Eclipse, it is removing the end of line
  whitespaces. I am using the
  http://opennlp.apache.org/code-formatter/OpenNLP-Eclipse-Formatter.xml
 
  This is causing lots of changes in files I actually needed to change only
  one line. Do anybody know how to I avoid it?
 
  Thank you,
  William





Re: DocumentSample in Doccat

2014-04-24 Thread William Colen
What do you think of adding the following field to the DocumentSample?

MapString, Object extraInformation


Also, we could add the following methods to the DocumentCategorizer
interface:

public double[] categorize(String text[], MapString, Object
extraInformation);
public double[] categorize(String documentText, MapString, Object
extraInformation);

Any opinion?

Thank you,
William


2014-04-17 10:39 GMT-03:00 Mark G giaconiam...@gmail.com:

 Another general doccat thought I had is this. in my projects that use
 Doccat, I created a class called a samplecollection, which simply wrapped a
 listdocumentsample but then provided  a method that returned the samples
 as a DoccatModel (using a properly formatted ByteArrayInputStream of the
 doccat training format of all the samples). This worked out well because I
 stored all the samples in a database, and users could CRUD samples for
 different categories. There was a map reduce job that at job startup read
 in the samples from the database into the samplecollection, dynamically
 generated the model, and then used the model to classify all the texts
 across the cluster; so every MR job ran the latest and greatest model based
 on current samples. Not sure if we're interested in something like that,
 but I see several questions on stack overflow asking about iterative model
 building, and a SampleCollection that returns a Model has worked for me.  I
 also created a SampleCRUD interface that abstracts storage and retrieval of
 the samples I had a Postgres and Accumulo impl for sample storage.
 just a thought, I know this can get very specific and complicated, thought
 we may be able to find a middle ground by providing a framework and some
 generic impls.
 MG


 On Thu, Apr 17, 2014 at 8:28 AM, William Colen william.co...@gmail.com
 wrote:

  Yes, I don't see how to represent the sentences and paragraphs.
 
  +1 for the generic Map as suggested by Mark. We already have such things
 in
  other sample classes, like NameSample and the POSSample.
 
  A use case: the 20news corpus is a collection of articles, and each
 article
  contains fields like From, Subject, Organization. Mahout, which
  includes a formatter for this corpus, concatenate it all to the text
 field,
  but I think we could improve accuracy by handling this metadata in a
  separated feature generator.
 
 
  2014-04-17 8:37 GMT-03:00 Tech mail giaconiam...@gmail.com:
 
   I agree, this goes back to the concept of having a document model...
   I know in the prod systems I've used doccat, storing sentences and
   paragraphs wouldn't make sense, people usually have their own domain
  model
   for that. I still feel like if we augment the documentsample object
 with
  a
   generic Map it would be helpful in some cases and not constraining
  
   Sent from my iPhone
  
On Apr 17, 2014, at 6:35 AM, Jörn Kottmann kottm...@gmail.com
 wrote:
   
On 04/15/2014 07:45 PM, William Colen wrote:
Hello,
   
I've been working with the Doccat module and I am wondering if we
  could
improve its data structure for the 1.6.0 release.
   
Today the DocumentSample has the following attributes:
   
- String category
- ListString text
   
I would suggest adding an attribute to hold metadata, or additional
contexts information. What do you think?
   
Right now the training format contains these two fields per line.
Do you want to change the format as well?
   
Also, what do you think of including sentences and paragraph
   information? I
don't know if there is anything a feature generator can extract from
  it
   to
improve the classification.
   
I guess we only want to do that if there is a use case for it. It
 will
   make the processing for the clients
more complex, since they then would have to provide sentences and
   paragraphs compared to just
a piece of text.
   
Jörn
  
 



Re: DocumentSample in Doccat

2014-04-24 Thread William Colen
Yes, it looks nice. Maybe we should redo all the DocumentCategorizer
interface. It is different from other tools, for example, we can't get the
best category of one document with only one call, we need to use two
methods.



2014-04-24 18:43 GMT-03:00 Mark G ma...@apache.org:

 William, that map looks good to me.
 In my current project I find this method convenient for getting back the
 probs over the categories in the model as a Maplet me know if there's
 anything wrong with it :)

 public MapString, Double categoriesAsMap(String text) {
 MapString, Double probDist = new HashMapString, Double();

 double[] categorize = categorize(text);
 int catSize = getNumberOfCategories();
 for (int i = 0; i  catSize; i++) {
   String category = getCategory(i);
   probDist.put(category, categorize[getIndex(category)]);
 }
 return probDist;

   }

 perhaps we should consider adding this method to abstract some
 detailsjust a thought





 On Thu, Apr 24, 2014 at 3:56 PM, William Colen william.co...@gmail.com
 wrote:

  What do you think of adding the following field to the DocumentSample?
 
  MapString, Object extraInformation
 
 
  Also, we could add the following methods to the DocumentCategorizer
  interface:
 
  public double[] categorize(String text[], MapString, Object
  extraInformation);
  public double[] categorize(String documentText, MapString, Object
  extraInformation);
 
  Any opinion?
 
  Thank you,
  William
 
 
  2014-04-17 10:39 GMT-03:00 Mark G giaconiam...@gmail.com:
 
   Another general doccat thought I had is this. in my projects that use
   Doccat, I created a class called a samplecollection, which simply
  wrapped a
   listdocumentsample but then provided  a method that returned the
  samples
   as a DoccatModel (using a properly formatted ByteArrayInputStream of
 the
   doccat training format of all the samples). This worked out well
 because
  I
   stored all the samples in a database, and users could CRUD samples for
   different categories. There was a map reduce job that at job startup
 read
   in the samples from the database into the samplecollection, dynamically
   generated the model, and then used the model to classify all the texts
   across the cluster; so every MR job ran the latest and greatest model
  based
   on current samples. Not sure if we're interested in something like
 that,
   but I see several questions on stack overflow asking about iterative
  model
   building, and a SampleCollection that returns a Model has worked for
 me.
   I
   also created a SampleCRUD interface that abstracts storage and
 retrieval
  of
   the samples I had a Postgres and Accumulo impl for sample storage.
   just a thought, I know this can get very specific and complicated,
  thought
   we may be able to find a middle ground by providing a framework and
 some
   generic impls.
   MG
  
  
   On Thu, Apr 17, 2014 at 8:28 AM, William Colen 
 william.co...@gmail.com
   wrote:
  
Yes, I don't see how to represent the sentences and paragraphs.
   
+1 for the generic Map as suggested by Mark. We already have such
  things
   in
other sample classes, like NameSample and the POSSample.
   
A use case: the 20news corpus is a collection of articles, and each
   article
contains fields like From, Subject, Organization. Mahout, which
includes a formatter for this corpus, concatenate it all to the text
   field,
but I think we could improve accuracy by handling this metadata in a
separated feature generator.
   
   
2014-04-17 8:37 GMT-03:00 Tech mail giaconiam...@gmail.com:
   
 I agree, this goes back to the concept of having a document
  model...
 I know in the prod systems I've used doccat, storing sentences and
 paragraphs wouldn't make sense, people usually have their own
 domain
model
 for that. I still feel like if we augment the documentsample object
   with
a
 generic Map it would be helpful in some cases and not constraining

 Sent from my iPhone

  On Apr 17, 2014, at 6:35 AM, Jörn Kottmann kottm...@gmail.com
   wrote:
 
  On 04/15/2014 07:45 PM, William Colen wrote:
  Hello,
 
  I've been working with the Doccat module and I am wondering if
 we
could
  improve its data structure for the 1.6.0 release.
 
  Today the DocumentSample has the following attributes:
 
  - String category
  - ListString text
 
  I would suggest adding an attribute to hold metadata, or
  additional
  contexts information. What do you think?
 
  Right now the training format contains these two fields per line.
  Do you want to change the format as well?
 
  Also, what do you think of including sentences and paragraph
 information? I
  don't know if there is anything a feature generator can extract
  from
it
 to
  improve the classification.
 
  I guess we only want to do that if there is a use

Re: DocumentSample in Doccat

2014-04-17 Thread William Colen
Yes, I don't see how to represent the sentences and paragraphs.

+1 for the generic Map as suggested by Mark. We already have such things in
other sample classes, like NameSample and the POSSample.

A use case: the 20news corpus is a collection of articles, and each article
contains fields like From, Subject, Organization. Mahout, which
includes a formatter for this corpus, concatenate it all to the text field,
but I think we could improve accuracy by handling this metadata in a
separated feature generator.


2014-04-17 8:37 GMT-03:00 Tech mail giaconiam...@gmail.com:

 I agree, this goes back to the concept of having a document model...
 I know in the prod systems I've used doccat, storing sentences and
 paragraphs wouldn't make sense, people usually have their own domain model
 for that. I still feel like if we augment the documentsample object with a
 generic Map it would be helpful in some cases and not constraining

 Sent from my iPhone

  On Apr 17, 2014, at 6:35 AM, Jörn Kottmann kottm...@gmail.com wrote:
 
  On 04/15/2014 07:45 PM, William Colen wrote:
  Hello,
 
  I've been working with the Doccat module and I am wondering if we could
  improve its data structure for the 1.6.0 release.
 
  Today the DocumentSample has the following attributes:
 
  - String category
  - ListString text
 
  I would suggest adding an attribute to hold metadata, or additional
  contexts information. What do you think?
 
  Right now the training format contains these two fields per line.
  Do you want to change the format as well?
 
  Also, what do you think of including sentences and paragraph
 information? I
  don't know if there is anything a feature generator can extract from it
 to
  improve the classification.
 
  I guess we only want to do that if there is a use case for it. It will
 make the processing for the clients
  more complex, since they then would have to provide sentences and
 paragraphs compared to just
  a piece of text.
 
  Jörn



Re: svn commit: r1587944 [1/2] - in /opennlp/trunk/opennlp-tools/src: main/java/opennlp/tools/cmdline/doccat/ main/java/opennlp/tools/doccat/ main/java/opennlp/tools/sentdetect/ main/java/opennlp/tool

2014-04-16 Thread William Colen
Jörn,

Can you please review my change to the ExtensionLoader? I modified it to
accept singletons (private constructor and the field INSTANCE).

Thank you,
William


2014-04-16 12:26 GMT-03:00 co...@apache.org:

 Author: colen
 Date: Wed Apr 16 15:26:24 2014
 New Revision: 1587944

 URL: http://svn.apache.org/r1587944
 Log:
 OPENNLP-674 Added factory to Doccat

 Added:

 opennlp/trunk/opennlp-tools/src/main/java/opennlp/tools/doccat/DoccatFactory.java
   (with props)

 opennlp/trunk/opennlp-tools/src/test/java/opennlp/tools/doccat/DoccatFactoryTest.java
   (with props)
 opennlp/trunk/opennlp-tools/src/test/resources/opennlp/tools/doccat/

 opennlp/trunk/opennlp-tools/src/test/resources/opennlp/tools/doccat/DoccatSample.txt
   (with props)
 Modified:

 opennlp/trunk/opennlp-tools/src/main/java/opennlp/tools/cmdline/doccat/DoccatCrossValidatorTool.java

 opennlp/trunk/opennlp-tools/src/main/java/opennlp/tools/cmdline/doccat/DoccatTrainerTool.java

 opennlp/trunk/opennlp-tools/src/main/java/opennlp/tools/cmdline/doccat/TrainingParams.java

 opennlp/trunk/opennlp-tools/src/main/java/opennlp/tools/doccat/DoccatCrossValidator.java

 opennlp/trunk/opennlp-tools/src/main/java/opennlp/tools/doccat/DoccatModel.java

 opennlp/trunk/opennlp-tools/src/main/java/opennlp/tools/doccat/DocumentCategorizerME.java

 opennlp/trunk/opennlp-tools/src/main/java/opennlp/tools/sentdetect/SentenceDetectorFactory.java

 opennlp/trunk/opennlp-tools/src/main/java/opennlp/tools/util/ext/ExtensionLoader.java

 Modified:
 opennlp/trunk/opennlp-tools/src/main/java/opennlp/tools/cmdline/doccat/DoccatCrossValidatorTool.java
 URL:
 http://svn.apache.org/viewvc/opennlp/trunk/opennlp-tools/src/main/java/opennlp/tools/cmdline/doccat/DoccatCrossValidatorTool.java?rev=1587944r1=1587943r2=1587944view=diff

 ==
 ---
 opennlp/trunk/opennlp-tools/src/main/java/opennlp/tools/cmdline/doccat/DoccatCrossValidatorTool.java
 (original)
 +++
 opennlp/trunk/opennlp-tools/src/main/java/opennlp/tools/cmdline/doccat/DoccatCrossValidatorTool.java
 Wed Apr 16 15:26:24 2014
 @@ -34,8 +34,10 @@ import opennlp.tools.cmdline.doccat.Docc
  import opennlp.tools.cmdline.params.CVParams;
  import opennlp.tools.doccat.DoccatCrossValidator;
  import opennlp.tools.doccat.DoccatEvaluationMonitor;
 +import opennlp.tools.doccat.DoccatFactory;
  import opennlp.tools.doccat.DocumentSample;
  import opennlp.tools.doccat.FeatureGenerator;
 +import opennlp.tools.tokenize.Tokenizer;
  import opennlp.tools.util.eval.EvaluationMonitor;
  import opennlp.tools.util.model.ModelUtil;

 @@ -88,13 +90,18 @@ public final class DoccatCrossValidatorT
  FeatureGenerator[] featureGenerators = DoccatTrainerTool
  .createFeatureGenerators(params.getFeatureGenerators());

 +Tokenizer tokenizer = DoccatTrainerTool.createTokenizer(params
 +.getTokenizer());
 +
  DoccatEvaluationMonitor[] listenersArr = listeners
  .toArray(new DoccatEvaluationMonitor[listeners.size()]);

  DoccatCrossValidator validator;
  try {
 +  DoccatFactory factory = DoccatFactory.create(params.getFactory(),
 +  tokenizer, featureGenerators);
validator = new DoccatCrossValidator(params.getLang(), mlParams,
 -  featureGenerators, listenersArr);
 +  factory, listenersArr);

validator.evaluate(sampleStream, params.getFolds());
  } catch (IOException e) {

 Modified:
 opennlp/trunk/opennlp-tools/src/main/java/opennlp/tools/cmdline/doccat/DoccatTrainerTool.java
 URL:
 http://svn.apache.org/viewvc/opennlp/trunk/opennlp-tools/src/main/java/opennlp/tools/cmdline/doccat/DoccatTrainerTool.java?rev=1587944r1=1587943r2=1587944view=diff

 ==
 ---
 opennlp/trunk/opennlp-tools/src/main/java/opennlp/tools/cmdline/doccat/DoccatTrainerTool.java
 (original)
 +++
 opennlp/trunk/opennlp-tools/src/main/java/opennlp/tools/cmdline/doccat/DoccatTrainerTool.java
 Wed Apr 16 15:26:24 2014
 @@ -26,16 +26,19 @@ import opennlp.tools.cmdline.TerminateTo
  import opennlp.tools.cmdline.doccat.DoccatTrainerTool.TrainerToolParams;
  import opennlp.tools.cmdline.params.TrainingToolParams;
  import opennlp.tools.doccat.BagOfWordsFeatureGenerator;
 +import opennlp.tools.doccat.DoccatFactory;
  import opennlp.tools.doccat.DoccatModel;
  import opennlp.tools.doccat.DocumentCategorizerME;
  import opennlp.tools.doccat.DocumentSample;
  import opennlp.tools.doccat.FeatureGenerator;
 +import opennlp.tools.tokenize.Tokenizer;
 +import opennlp.tools.tokenize.WhitespaceTokenizer;
  import opennlp.tools.util.ext.ExtensionLoader;
  import opennlp.tools.util.model.ModelUtil;

  public class DoccatTrainerTool
  extends AbstractTrainerToolDocumentSample, TrainerToolParams {
 -
 +
interface TrainerToolParams extends TrainingParams, TrainingToolParams {
}

 @@ -47,7 +50,7 @@ public class DoccatTrainerTool
public String 

Re: Doccat evaluator

2014-04-11 Thread William Colen
Now in the trunk we have the tools:

$ bin/opennlp DoccatEvaluator
Usage: opennlp DoccatEvaluator[.leipzig] [-reportOutputFile outputFile]
-model model [-misclassified true|false] -data sampleData [-encoding
charsetName]

Arguments description:
 -reportOutputFile outputFile
the path of the fine-grained report file.
 -model model
the model file to be evaluated.
 -misclassified true|false
if true will print false negatives and false positives.
 -data sampleData
data to be used, usually a file name.
-encoding charsetName
 encoding for reading and writing text, if absent the system default is
used.


n$ bin/opennlp DoccatCrossValidator
Usage: opennlp DoccatCrossValidator[.leipzig] [-reportOutputFile
outputFile] [-misclassified true|false] [-folds num] [-featureGenerators
fg] [-params paramsFile] -lang language -data sampleData [-encoding
charsetName]

Arguments description:
-reportOutputFile outputFile
the path of the fine-grained report file.
 -misclassified true|false
if true will print false negatives and false positives.
 -folds num
number of folds, default is 10.
 -featureGenerators fg
Comma separated feature generator classes. Bag of words is used if not
specified.
 -params paramsFile
training parameters file.
 -lang language
language which is being processed.
-data sampleData
 data to be used, usually a file name.
-encoding charsetName
 encoding for reading and writing text, if absent the system default is
used.


If misclassified is true, the evaluator will use the stderr to print the
misclassified documents.
If reportOutputFile is set, the evaluator will print to it some detailed
reports, for example the f-measure for the different outcomes and the
confusion matrix.

2014-04-10 19:48 GMT-03:00 William Colen william.co...@gmail.com:

 Yes, I just finished implementing the confusion matrix report, just like
 the one I did for the POS Tagger. I will commit it today.

 I could not test it properly with Leipzig corpus. For some reason to
 Doccat never fails with this corpus!
 To effectively test it I used the 20news corpus.


 2014-04-10 19:37 GMT-03:00 Jörn Kottmann kottm...@gmail.com:

 I thought it should be done similar to the way pos tags are measured when
 I implemented that.

 A confusion matrix might also be helpful to see which categories are more
 difficult to classify for the system.

 Jörn


 On 04/10/2014 03:00 PM, William Colen wrote:

 Actually, since we always add a tag to each document, accuracy makes
 sense.
 We could implement F-1 for the individual categories.

 2014-04-09 17:23 GMT-03:00 William Colen william.co...@gmail.com:

  Hello,

 I was checking if there is any open issue related to Doccat, and I found
 this one -

 OPENNLP-81: Add a cli tool for the doccat evaluation support

 I noticed that there is already a class
 named DocumentCategorizerEvaluator, which is not used anywhere
 internally.
 This is evaluating performance in terms of accuracy, but I believe it
 would
 be better do do it in terms of F-Measuare.

 Any thoughts?

 As we are working in a major version, I think it would be OK to change
 it.


 Thank you,
 William






Re: Doccat evaluator

2014-04-10 Thread William Colen
Yes, I just finished implementing the confusion matrix report, just like
the one I did for the POS Tagger. I will commit it today.

I could not test it properly with Leipzig corpus. For some reason to Doccat
never fails with this corpus!
To effectively test it I used the 20news corpus.


2014-04-10 19:37 GMT-03:00 Jörn Kottmann kottm...@gmail.com:

 I thought it should be done similar to the way pos tags are measured when
 I implemented that.

 A confusion matrix might also be helpful to see which categories are more
 difficult to classify for the system.

 Jörn


 On 04/10/2014 03:00 PM, William Colen wrote:

 Actually, since we always add a tag to each document, accuracy makes
 sense.
 We could implement F-1 for the individual categories.

 2014-04-09 17:23 GMT-03:00 William Colen william.co...@gmail.com:

  Hello,

 I was checking if there is any open issue related to Doccat, and I found
 this one -

 OPENNLP-81: Add a cli tool for the doccat evaluation support

 I noticed that there is already a class
 named DocumentCategorizerEvaluator, which is not used anywhere
 internally.
 This is evaluating performance in terms of accuracy, but I believe it
 would
 be better do do it in terms of F-Measuare.

 Any thoughts?

 As we are working in a major version, I think it would be OK to change
 it.


 Thank you,
 William





End of line whitespaces in Eclipse

2014-04-10 Thread William Colen
When I save a .java file in Eclipse, it is removing the end of line
whitespaces. I am using the
http://opennlp.apache.org/code-formatter/OpenNLP-Eclipse-Formatter.xml

This is causing lots of changes in files I actually needed to change only
one line. Do anybody know how to I avoid it?

Thank you,
William


Doccat evaluator

2014-04-09 Thread William Colen
Hello,

I was checking if there is any open issue related to Doccat, and I found
this one -

OPENNLP-81: Add a cli tool for the doccat evaluation support

I noticed that there is already a class named DocumentCategorizerEvaluator,
which is not used anywhere internally. This is evaluating performance in
terms of accuracy, but I believe it would be better do do it in terms of
F-Measuare.

Any thoughts?

As we are working in a major version, I think it would be OK to change it.


Thank you,
William


Re: CoNLL02 format issue

2014-03-12 Thread William Colen
If it helps, there is another Spanish corpus at CONLL02 page which has 3
fields:
   Xavier Carreras provides the Spanish data sets with part of speech
tagshttp://www.lsi.upc.es/~nlp/tools/nerc/nerc.html
 (20030803)

William


2014-03-12 9:43 GMT-03:00 Roque Vera roqu...@gmail.com:

 I found an issue in TokenNamedFinderConverter module. Specifically I try to
 convert a file in CoNLL 2002 format into OpenNLP one. The error I get when
 I execute opennlp TokenNameFinderConverter conll02 -data esp.testa -lang
 es -types per  corpus_testa.txt on the command line interface is:








 *IO error while converting data : Expected three fields per line in
 training data, got 2 for line 'Sao B-LOC'! Expected three fields per line
 in training data, got 2 for line 'Sao B-LOC'! java.io.IOException: Expected
 three fields per line in training data, got 2 for line 'Sao B-LOC'!
 at

 opennlp.tools.formats.Conll02NameSampleStream.read(Conll02NameSampleStream.java:140)
 at

 opennlp.tools.formats.Conll02NameSampleStream.read(Conll02NameSampleStream.java:49)
 at

 opennlp.tools.cmdline.AbstractConverterTool.run(AbstractConverterTool.java:110)
 at opennlp.tools.cmdline.CLI.main(CLI.java:222).*



 The reason is clear; three fields are expected from my file esp.testa
 that only has two. But, the curious thing is that the file is from CoNLL's
 data-set for test.


 I propose two solutions for this problem. The first is to add a third field
 intermediately to the two existed. For example, originally the file may
 contains a line in IOB2-format like: Sao B-LOC, and we must have to
 change it to Sao VP B-LOC, where VP is a POS tag that, in term of the
 implementation, doesn't really matter what it means. I create a modified
 version of the test data-set accordantly to this detail.


 The other possible solution is to change the code from

 apache-opennlp-1.5.3-src\opennlp-tools\src\main\java\opennlp\tools\formats\Conll02NameSampleStream.java,
 beginning in line 133. The solution is given in the following table, where
 the first column contains the original code and the second the proposed
 solution.

 String fields[] = line.split( );

   if (fields.length == 3) {

 sentence.add(fields[0]);

 tags.add(fields[2]);

   }

   else {

 throw new IOException(Expected three fields per line in training
 data, got  +

 fields.length +  for line ' + line + '!);

   }

 String fields[] = line.split( );

   if (fields.length == 3) {

 sentence.add(fields[0]);

 tags.add(fields[2]);

   }

   if (fields.length  == 2){

 sentence.add(fields[0]);

 tags.add(fields[1]);

   }

   else {

 throw new IOException(Expected three or two fields per line in
 training data, got  +

 fields.length +  for line ' + line + '!);

   }

 The first if statement is necessary because the training data-set of
 CoNLL have three fields. Note that the second if statement only serves to
 the test data-set (that is the case in which I have problem).


 I hope this suggestion help to solve this problem.

 Frankly,

 Roque Vera.
 Facultad Politécnica, Universidad Nacional de Asunción.
 Paraguay.



Re: svn commit: r1534864 - /opennlp/trunk/opennlp-tools/src/main/java/opennlp/tools/entitylinker/GeoHashBinScorer.java

2013-10-23 Thread William Colen
+1 move do 6 or 7 for the next major release. We can ask what our users
think of it.

Em quarta-feira, 23 de outubro de 2013, Ioan Barbulescu escreveu:

 Hi guys

 I would vote for java 7, as well.

 Thank you.

 BR,
 Ioan


 On Wed, Oct 23, 2013 at 6:24 PM, Mark G giaconiam...@gmail.comjavascript:;
 wrote:

  agree, straight to 7 makes sense to me... try with resources, better
  collections support, switch on strings etc all new in 7
 
 
  On Wed, Oct 23, 2013 at 8:36 AM, Jörn Kottmann 
  kottm...@gmail.comjavascript:;
 wrote:
 
   On 10/23/2013 01:21 PM, Mark G wrote:
  
   When will we move to 6?
  
  
   I don't have any strong opinions about moving forward. Some say it
 might
   be better
   to move directly to Java 7 or even wait for Java 8.
  
   There are not that many interesting new features in Java 6, thats why I
   believe it might
   be worth to make a bigger step to avoid one or two versions.
  
   Any opinions? Do we still have a Java 5 user here?
  
   Jörn
  
 



-- 
William Colen


Re: Size of training data

2013-04-26 Thread William Colen
From command line you can specify memory using

MAVEN_OPTS=-Xmx4048m

You can also set it as JVM arguments if you are using from the API:

java -Xmx4048m ...



On Fri, Apr 26, 2013 at 4:30 AM, Svetoslav Marinov 
svetoslav.mari...@findwise.com wrote:

 I use the API. Can one specify the memory size via the command line? I
 think the default there is 1024M? At 8G memory during computing event
 counts..., at 16G during indexing: Computing event counts...  done.
 50153300 events
 IndexingŠ

 Svetoslav

 On 2013-04-26 09:12, Jörn Kottmann kottm...@gmail.com wrote:

 On 04/26/2013 09:06 AM, Svetoslav Marinov wrote:
  I'm wondering what is the max size (if such exists) for training a NER
 model? I have a corpus of 2 600 000 sentences annotated with just one
 category, 310M in size. However, the training never finishes ­ 8G memory
 resulted in java out of memory exception, and when I increased it to 16G
 it just died with no error message.
 
 Do you use the command line interface or the API for the training?
 At which stage of the training did you get the out of memory exception?
 Where did it just die when you used 16G of memory (maybe do a jstack) ?
 
 Jörn
 





Re: Please review the 1.5.3 release announcement

2013-04-17 Thread William Colen
Jörn, thank you for updating the web site. I already added a news item. Now
are we ready to send the announce?



On Mon, Apr 15, 2013 at 6:52 PM, Jörn Kottmann kottm...@gmail.com wrote:

 +1, lets wait until we updated the website, the distributeables are
 mirrord and the maven artifacts are
 available.

 I already promoted the maven repo and pushed the release to the dist area,
 but it might take a bit until everything
 is available, the mirrors might need 24 hours to mirror the distributables.

 Jörn


 On 04/15/2013 09:46 PM, William Colen wrote:

   Hello,

 Please review the release announcement for the OpenNLP version 1.5.3.

 https://cwiki.apache.org/**confluence/display/OPENNLP/**
 ReleasePlanAndTasks1.5.3https://cwiki.apache.org/confluence/display/OPENNLP/ReleasePlanAndTasks1.5.3

 The announce mails will be sent to the users and announce@apache lists.

 Thank you,
 William





Re: OpenNLP 1.5.3 RC 3 ready for testing

2013-04-09 Thread William Colen
+1

What about the similarity component? Should we build it only after the
1.5.3 release?

William Colen


On Tue, Apr 9, 2013 at 5:19 AM, Jörn Kottmann kottm...@gmail.com wrote:

 Are we ready to start with the release vote? The last important test item
 that is missing
 is checking the signatures, I can do that tonight.

 The UIMA trainers are not tested, but they should be ok, as far as I know
 they did not change,
 the UIMA AEs were run during my PEAR bundle test.

 Jörn


 On 04/03/2013 03:39 PM, William Colen wrote:

 Hi all,

 Our third release candidate is ready for testing. RC 2 failed to pass in
 only a few tests, including the creation of the issues list and the NOTICE
 file. Also, some new bug fixes and improvements were recently included.

 The RC 3 can be downloaded from here:
 http://people.apache.org/~**colen/releases/opennlp-1.5.3/**rc3/http://people.apache.org/~colen/releases/opennlp-1.5.3/rc3/

 To use it in a maven build set the version for opennlp-tools or
 opennlp-uima to 1.5.3, and for opennlp-maxent to 3.0.3, and add this URL
 to
 your settings.xml file:
 https://repository.apache.org/**content/repositories/**
 orgapacheopennlp-059/https://repository.apache.org/content/repositories/orgapacheopennlp-059/

 The current test plan can be found here:
 https://cwiki.apache.org/**OPENNLP/testplan153.htmlhttps://cwiki.apache.org/OPENNLP/testplan153.html

 Please sign up for tasks in the test plan.

 The release plan can be found here:
 https://cwiki.apache.org/**OPENNLP/**releaseplanandtasks153.htmlhttps://cwiki.apache.org/OPENNLP/releaseplanandtasks153.html

 The RC contains quite some changes, please refer to the contained issue
 list for details.

 William





[VOTE] Release OpenNLP 1.5.3 RC 3

2013-04-09 Thread William Colen
Hello,

Lets vote to release RC 3 as OpenNLP 1.5.3.

The testing of it is documented here:
https://cwiki.apache.org/confluence/display/OPENNLP/TestPlan1.5.3

The RC can be downloaded here:
http://people.apache.org/~colen/releases/opennlp-1.5.3/rc3

Please vote to approve this release:
[ ] +1 Approve the release
[ ] -1 Veto the release (please provide specific comments)
[ ] 0   Don't care

Please report any problems you may find.


Re: [VOTE] Release OpenNLP 1.5.3 RC 3

2013-04-09 Thread William Colen
+1 Approve the release


On Tue, Apr 9, 2013 at 9:51 AM, William Colen william.co...@gmail.comwrote:

 Hello,

 Lets vote to release RC 3 as OpenNLP 1.5.3.

 The testing of it is documented here:
 https://cwiki.apache.org/confluence/display/OPENNLP/TestPlan1.5.3

 The RC can be downloaded here:
 http://people.apache.org/~colen/releases/opennlp-1.5.3/rc3

 Please vote to approve this release:
 [ ] +1 Approve the release
 [ ] -1 Veto the release (please provide specific comments)
 [ ] 0   Don't care

 Please report any problems you may find.



Re: OpenNLP 1.5.3 RC 2 ready for testing

2013-04-03 Thread William Colen
Thank you, I fixed it. I will start the build of RC3 right now.



On Wed, Apr 3, 2013 at 5:01 AM, Jörn Kottmann kottm...@gmail.com wrote:

 On 04/03/2013 02:10 AM, William Colen wrote:

 Thank you, Jörn.

 I also had to update the maven-changes-plugin version. The 2.3 was failing
 to download the issue list. Changing to the latest solved the issue.


 The date in the NOTICE file still says 2011, that needs to be changed to
 2013.

 Jörn



Re: OpenNLP 1.5.3 RC 2 ready for testing

2013-04-03 Thread William Colen
In fact you already fixed the year. Thank you.

Yes, I can start the build after it.


On Wed, Apr 3, 2013 at 8:26 AM, Jörn Kottmann kottm...@gmail.com wrote:

 Before you build we should either commit OPENNLP-564 or remove it from the
 issue list.
 Should I quickly commit the rules file?

 Jörn


 On 04/03/2013 01:23 PM, William Colen wrote:

 Thank you, I fixed it. I will start the build of RC3 right now.



 On Wed, Apr 3, 2013 at 5:01 AM, Jörn Kottmann kottm...@gmail.com wrote:

  On 04/03/2013 02:10 AM, William Colen wrote:

  Thank you, Jörn.

 I also had to update the maven-changes-plugin version. The 2.3 was
 failing
 to download the issue list. Changing to the latest solved the issue.


  The date in the NOTICE file still says 2011, that needs to be changed
 to
 2013.

 Jörn





OpenNLP 1.5.3 RC 3 ready for testing

2013-04-03 Thread William Colen
Hi all,

Our third release candidate is ready for testing. RC 2 failed to pass in
only a few tests, including the creation of the issues list and the NOTICE
file. Also, some new bug fixes and improvements were recently included.

The RC 3 can be downloaded from here:
http://people.apache.org/~colen/releases/opennlp-1.5.3/rc3/

To use it in a maven build set the version for opennlp-tools or
opennlp-uima to 1.5.3, and for opennlp-maxent to 3.0.3, and add this URL to
your settings.xml file:
https://repository.apache.org/content/repositories/orgapacheopennlp-059/

The current test plan can be found here:
https://cwiki.apache.org/OPENNLP/testplan153.html

Please sign up for tasks in the test plan.

The release plan can be found here:
https://cwiki.apache.org/OPENNLP/releaseplanandtasks153.html

The RC contains quite some changes, please refer to the contained issue
list for details.

William


Re: Re: English 300k sentences Leipzig Corpus for test

2013-03-14 Thread William Colen
Hi,

I could not find a way to convert from Leipzig to other formats than DocCat
sample. Is it possible to convert from Leipzig to SentenceSample using the
OpenNLP tools?

Thank you,
William


On Thu, Mar 14, 2013 at 9:51 AM, Jörn Kottmann kottm...@gmail.com wrote:




  Original Message 
 Subject:Re: English 300k sentences Leipzig Corpus for test
 Date:   Thu, 14 Mar 2013 09:48:21 -0300
 From:   William Colen william.co...@gmail.com
 To: Jörn Kottmann kottm...@gmail.com



 Yes, you can forward.

 It is not clear to me how to convert it. I could only find converters from
 Leipzig to DocCat.


 On Thu, Mar 14, 2013 at 6:09 AM, Jörn Kottmann kottm...@gmail.com wrote:

  Do you mind if I forward this to the dev list?

 Yes, you need to convert the data into input data. The idea
 is that we process the data with 1.5.2 and 1.5.3 and see if the output
 is still identical, if its not identical its either a change in our code
 or a bug.

 It doesn't really matter which file you download as long as it has enough
 sentences,
 would be nice if you can note in the test plan which one you used.

 Hopefully I will have sometime over the weekend to do the tests on the
 private data I have.

 Jörn


 On 03/13/2013 11:38 PM, William Colen wrote:

  Hi, Jörn,

 I would like to start testing with Leipzig Corpus. Do you know how the
 steps to do it?

 I downloaded the file named
 eng_news_2010_300K-text.tar.gzfile:///Users/wcolen/**
 Desktop/opennlp1.5.3/eng_news_2010_300K-text.tar.gz,


 and now I would use the converter to extract documents from it.

 After that, I would try to use the output of a module as input to the
 next.
 Is it correct?

 Thank you,
 William









Re: Next release

2013-02-19 Thread William Colen
Should we try to upload it to Central Repo using jwnl as groupid? What do
you think?


On Mon, Feb 18, 2013 at 3:03 PM, Benson Margulies bimargul...@gmail.comwrote:

 upload to central via ossrh.

 On Feb 18, 2013, at 12:46 PM, William Colen william.co...@gmail.com
 wrote:

  We are using jwnl as groupid, I don't know if we can upload using this
  groupid. The best would be to reflect the Java package, which is
  net.didion.jwnl, or to follow the one used by 14. rc3, which
  is net.sf.jwordnet.jwnl.
  But changing the groupid would not help the already released OpenNLP
  versions.
 
 
  On Mon, Feb 18, 2013 at 12:07 PM, Jörn Kottmann kottm...@gmail.com
 wrote:
 
  On 02/18/2013 03:17 PM, William Colen wrote:
 
  I suppose we can't use opennlp.apache.org to host it, can we?
  We probably could somehow distribute it from Apache servers, but if we
  can get it into the central maven repository it would also fix the issue
  for the already
  released OpenNLP versions as far as I know.
 
  Here is a guide which explains the process:
  https://docs.sonatype.org/**display/Repository/Uploading+**
  3rd-party+Artifacts+to+The+**Central+Repository
 https://docs.sonatype.org/display/Repository/Uploading+3rd-party+Artifacts+to+The+Central+Repository
 
 
  Jörn
 



Uploading JWNL to Maven Central Repo

2013-02-19 Thread William Colen
Hi,

I changed the topic so we can focus on this issue.

To upload JWNL to the Central Repo we need to bundle the Source and JavaDoc.

We don't have it at http://opennlp.sourceforge.net/maven2/jwnl/jwnl/1.3.3/

So, I tried to get the JavaDoc and Source from the JWNL website.

I downloaded all jars from 1.3 releases of JWNL from here:
http://sourceforge.net/projects/jwordnet/files/jwnl/JWNL%201.3/


None of them matched the SHA1 of the one we are using. Apparently someone
built the JWNL we are using from the source.

Do you know if we are safe to use the SourceCode and JavaDoc of the release
1.3 rc3 they distribute?

Thank you,
William


On Tue, Feb 19, 2013 at 11:09 AM, Benson Margulies bimargul...@gmail.comwrote:

 yes, ossrh will do that

 On Feb 19, 2013, at 8:38 AM, William Colen william.co...@gmail.com
 wrote:

  Should we try to upload it to Central Repo using jwnl as groupid? What
 do
  you think?
 
 
  On Mon, Feb 18, 2013 at 3:03 PM, Benson Margulies bimargul...@gmail.com
 wrote:
 
  upload to central via ossrh.
 
  On Feb 18, 2013, at 12:46 PM, William Colen william.co...@gmail.com
  wrote:
 
  We are using jwnl as groupid, I don't know if we can upload using
 this
  groupid. The best would be to reflect the Java package, which is
  net.didion.jwnl, or to follow the one used by 14. rc3, which
  is net.sf.jwordnet.jwnl.
  But changing the groupid would not help the already released OpenNLP
  versions.
 
 
  On Mon, Feb 18, 2013 at 12:07 PM, Jörn Kottmann kottm...@gmail.com
  wrote:
 
  On 02/18/2013 03:17 PM, William Colen wrote:
 
  I suppose we can't use opennlp.apache.org to host it, can we?
  We probably could somehow distribute it from Apache servers, but if we
  can get it into the central maven repository it would also fix the
 issue
  for the already
  released OpenNLP versions as far as I know.
 
  Here is a guide which explains the process:
  https://docs.sonatype.org/**display/Repository/Uploading+**
  3rd-party+Artifacts+to+The+**Central+Repository
 
 https://docs.sonatype.org/display/Repository/Uploading+3rd-party+Artifacts+to+The+Central+Repository
 
 
  Jörn
 



Re: Next release

2013-02-18 Thread William Colen
With jwnl 1.4_rc3 the code at least compiles.

Also, it would be nice if someone familiar with the Coreference module
could add some tests to the test plan:

https://cwiki.apache.org/OPENNLP/testplan153.html


On Sun, Feb 17, 2013 at 10:07 PM, Lance Norskog goks...@gmail.com wrote:

 OPENNLP-510 Maven dependency on jwnl is broken

 The version of JWNL used in coreference does not have an available Maven
 download. This made it hard to add OpenNLP to the Lucene/Solr project.

 That project made a final (abandonment) release that is in Maven.
 http://search.maven.org/#**artifactdetails%7Cnet.sf.**
 jwordnet%7Cjwnl%7C1.4_rc3%**7Cjarhttp://search.maven.org/#artifactdetails%7Cnet.sf.jwordnet%7Cjwnl%7C1.4_rc3%7Cjar

 Are there any coref users out there? Could you please check if this
 version works?


 On 12/19/2012 12:17 PM, Jörn Kottmann wrote:

 Lets start to get the release done, are there any issues expect the two
 open
 ones which need to go into this release ?

 Open issues are:
 OPENNLP-541 Improve ADChunkSampleStream
 OPENNLP-402 CLI tools and formats refactored

 Jörn

 On 09/12/2012 03:56 PM, Jörn Kottmann wrote:

 Hi all,

 it has been a while since we released 1.5.2 and to me it looks
 like its time for 1.5.3. I usually work now with the trunk version
 because it just contain too many fixes I need for my day job.

 I will volunteer to be release manager if nobody else wants to
 take this role.

 Any opinions?

 Jörn






Re: Next release

2013-02-18 Thread William Colen
I suppose we can't use opennlp.apache.org to host it, can we?


On Mon, Feb 18, 2013 at 10:57 AM, Jörn Kottmann kottm...@gmail.com wrote:

 On 02/18/2013 02:07 AM, Lance Norskog wrote:

 OPENNLP-510 Maven dependency on jwnl is broken

 The version of JWNL used in coreference does not have an available
 Maven download. This made it hard to add OpenNLP to the Lucene/Solr
 project.

 That project made a final (abandonment) release that is in Maven.
 http://search.maven.org/#**artifactdetails%7Cnet.sf.**
 jwordnet%7Cjwnl%7C1.4_rc3%**7Cjarhttp://search.maven.org/#artifactdetails%7Cnet.sf.jwordnet%7Cjwnl%7C1.4_rc3%7Cjar


 Are there any coref users out there? Could you please check if this
 version works?


 The dependency is hosted in the maven repository on SourceForge, so it
 should be possible
 to get the 1.3 dependency automatically. From time to time this site gets
 too much traffic
 and is blocked, which makes the build unreliable.

 We shouldn't update to a newer version just for the sake of solving the
 repository problem,
 we could instead just create a new repository somewhere else or try to get
 it into an existing one.

 Jörn



  1   2   >