We alreay have a properties file inside the model. It wouldn't be a difficult fix to add a property to it which stores the EOS characters which have been used during training.
Jörn On Thu, Feb 9, 2012 at 10:06 AM, Katrin Tomanek <katrin.toma...@averbis.com>wrote: > Hi Jörn, > > thanks for this explanation. > What you are saying means, that the context generator and the eos scanner > are not stored in the model, right? > > I had assumed this... other ML toolkits, such as e.g. Mallet (which uses > the "Pipe"-logic where openlp uses event streams) actually does this. > > Maybe this would also be a good improvement... > > Best > Katrin > > On 02/09/2012 09:56 AM, Joern Kottmann wrote: > >> When you only do it during training then it will not consider ":" as >> a possible split during detection. That explains your drop in accuracy. >> >> It looks like that it is not possible to modify the EOS characters >> properly >> with >> the current version. I suggest that you checkout the source code and then >> change the defaultEosCharacters array in opennlp.tools.sentdetect.** >> Factory. >> With that you are able to do your test and get it working for now. >> >> Anyway we should have an easy way to specify the EOS characters without >> implementing a custom Factory class. >> >> Please open a jira to improve this. >> >> Jörn >> >> On Thu, Feb 9, 2012 at 9:21 AM, Katrin Tomanek >> <katrin.toma...@averbis.com>**wrote: >> >> Hi Jörn, >>> >>> I only modified the training process. >>> >>> However, when I check the predictions it turns out that the model never >>> learns to split at ":" positions. >>> >>> Shouldn't it be enought to modify the DefaultSDContextGenerator and the >>> DefaultEndOfSentenceScanner so that these know about ":" as an EOS, >>> right? >>> Or are there other places where ":" should be added? >>> >>> Best >>> Katrin >>> >>> >>> >>> On 02/09/2012 09:18 AM, Joern Kottmann wrote: >>> >>> Did you modify the evaluation as well? If you just do it during training >>>> the >>>> evaluator will not be able to consider ":" as en EOS character. >>>> >>>> For me it sounds like that it fails to split on the ":" in some place. >>>> >>>> The sentence detector uses a maxent model to classify every EOS >>>> character >>>> as either a SPLIT or NO_SPLIT. >>>> >>>> Jörn >>>> >>>> On Thu, Feb 9, 2012 at 8:59 AM, Katrin Tomanek >>>> <katrin.toma...@averbis.com>****wrote: >>>> >>>> >>>> Hi Willian, >>>> >>>>> >>>>> I am currently using opennlp-1.5.2 and try to use it as an API, i.e. >>>>> not >>>>> to modify this code by write my own code around it. However, what I >>>>> described below (with the SDEventStream) results in the same as you are >>>>> describing: I am changing the set of EOS characters. >>>>> >>>>> I am just wondering, why adding ":" as an EOS character decreases the >>>>> results (dropping von ~80F to 45F in sentence splitting, and ":" is >>>>> always >>>>> a sentence boundary symbol in my data!) >>>>> >>>>> Looks like I need to debug a little bit more whats happening in the >>>>> DefaultSDContextGenerator. >>>>> >>>>> >>>>> >>>> >>> -- >>> Dr. Katrin Tomanek >>> Averbis GmbH >>> Tennenbacher Strasse 11 >>> D-79106 Freiburg >>> >>> Fon: +49 (0) 761 - 203 97696 >>> Fax: +49 (0) 761 - 203 97694 >>> E-Mail: katrin.toma...@averbis.com >>> >>> Geschäftsführer: Dr. med. Philipp Daumke, Dr. Kornél Markó >>> Sitz der Gesellschaft: Freiburg i. Br. >>> AG Freiburg i. Br., HRB 701080 >>> >>> >> > > -- > Dr. Katrin Tomanek > Averbis GmbH > Tennenbacher Strasse 11 > D-79106 Freiburg > > Fon: +49 (0) 761 - 203 97696 > Fax: +49 (0) 761 - 203 97694 > E-Mail: katrin.toma...@averbis.com > > Geschäftsführer: Dr. med. Philipp Daumke, Dr. Kornél Markó > Sitz der Gesellschaft: Freiburg i. Br. > AG Freiburg i. Br., HRB 701080 >