[ 
https://issues.apache.org/jira/browse/OPENNLP-170?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Assaf Urieli updated OPENNLP-170:
---------------------------------

    Attachment: GISTrainerChangeLog.txt
                ScaleDoesntMatterTest.java
                GISTrainer.java

Patch to fix this issue

> OpenNLP Maxent miscalculates for real values < 1
> ------------------------------------------------
>
>                 Key: OPENNLP-170
>                 URL: https://issues.apache.org/jira/browse/OPENNLP-170
>             Project: OpenNLP
>          Issue Type: Bug
>          Components: Maxent
>    Affects Versions: maxent-3.0.0-sourceforge
>         Environment: Windows 7, Java 1.6
>            Reporter: Assaf Urieli
>            Assignee: Jason Baldridge
>             Fix For: tools-1.5.2-incubating, maxent-3.0.2-incubating
>
>         Attachments: GISTrainer.java, GISTrainerChangeLog.txt, 
> ScaleDoesntMatterTest.java
>
>
> When using predicates with real values, entering real values predA=0.1 
> predB=0.2 gives different results than predA=10, predB=20
> However, using predA=1, predB=2 gives the same results as predA=10, predB=20.
> Test below:
> package openMaxentTest;
> import java.io.StringReader;
> import junit.framework.TestCase;
> import opennlp.maxent.GIS;
> import opennlp.maxent.PlainTextByLineDataStream;
> import opennlp.maxent.RealBasicEventStream;
> import opennlp.model.EventStream;
> import opennlp.model.MaxentModel;
> import opennlp.model.OnePassRealValueDataIndexer;
> import opennlp.model.RealValueFileEventStream;
> public class ScaleDoesntMatterTest extends TestCase {
>       /**
>        * This test sets out to prove that the scale you use on real valued 
> predicates
>        * doesn't matter when it comes the probability assigned to each 
> outcome.
>        * Strangely, if we use (1,2) and (10,20) there's no difference.
>        * If we use (0.1,0.2) and (10,20) there is a difference.
>        * @throws Exception
>        */
>       public void testScaleResults() throws Exception {
>               String smallValues = "predA=0.1 predB=0.2 A\n" +
>                               "predB=0.3 predA=0.1 B\n";
>               
>               String smallTest = "predA=0.2 predB=0.2";
>               
>               String largeValues = "predA=10 predB=20 A\n" +
>                               "predB=30 predA=10 B\n";
>               
>               String largeTest = "predA=20 predB=20";
>               
>               StringReader smallReader = new StringReader(smallValues);
>               EventStream smallEventStream = new RealBasicEventStream(new 
> PlainTextByLineDataStream(smallReader));
>               MaxentModel smallModel = GIS.trainModel(2, new 
> OnePassRealValueDataIndexer(smallEventStream,0), false);
>               String[] contexts = smallTest.split(" ");
>               float[] values = 
> RealValueFileEventStream.parseContexts(contexts);
>               double[] ocs = smallModel.eval(contexts, values);
>               
>               String smallResults = smallModel.getAllOutcomes(ocs);
>               System.out.println("smallResults: " + smallResults);
>               
>               StringReader largeReader = new StringReader(largeValues);
>               EventStream largeEventStream = new RealBasicEventStream(new 
> PlainTextByLineDataStream(largeReader));
>               MaxentModel largeModel = GIS.trainModel(2, new 
> OnePassRealValueDataIndexer(largeEventStream,0), false);
>               contexts = largeTest.split(" ");
>               values = RealValueFileEventStream.parseContexts(contexts);
>               ocs = largeModel.eval(contexts, values);
>               
>               String largeResults = smallModel.getAllOutcomes(ocs);
>               System.out.println("largeResults: " + largeResults);
>               
>               assertEquals(smallResults, largeResults);
>               
>       }
> }
> The problem concerns the correctionConstant in GISTrainer, which is set to be 
> an integer. I implemented the following fix in class GISTrainer:
>     // determine the correction constant and its inverse
>     //int correctionConstant = 1;
>     float correctionConstant = 0;
>     for (int ci = 0; ci < contexts.length; ci++) {
>       if (values == null || values[ci] == null) {
>         if (contexts[ci].length > correctionConstant) {
>           correctionConstant = contexts[ci].length;
>         }
>       }
>       else {
>         float cl = values[ci][0];
>         for (int vi=1;vi<values[ci].length;vi++) {
>           cl+=values[ci][vi];
>         }
>         
>         if (cl > correctionConstant) {
>           //correctionConstant=(int) Math.ceil(cl);
>           correctionConstant= cl;
>         }
>       }
>     }
> I'd be curious to know if there's a reason for using an integer 
> correctionConstant.
> Rgds,
> Assaf Urieli

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

Reply via email to