OpenNLP Maxent miscalculates for real values < 1
------------------------------------------------

                 Key: OPENNLP-170
                 URL: https://issues.apache.org/jira/browse/OPENNLP-170
             Project: OpenNLP
          Issue Type: Bug
          Components: Maxent
    Affects Versions: maxent-3.0.0-sourceforge
         Environment: Windows 7, Java 1.6
            Reporter: Assaf Urieli


When using predicates with real values, entering real values predA=0.1 
predB=0.2 gives different results than predA=10, predB=20
However, using predA=1, predB=2 gives the same results as predA=10, predB=20.

Test below:
package openMaxentTest;

import java.io.StringReader;
import junit.framework.TestCase;

import opennlp.maxent.GIS;
import opennlp.maxent.PlainTextByLineDataStream;
import opennlp.maxent.RealBasicEventStream;
import opennlp.model.EventStream;
import opennlp.model.MaxentModel;
import opennlp.model.OnePassRealValueDataIndexer;
import opennlp.model.RealValueFileEventStream;


public class ScaleDoesntMatterTest extends TestCase {

        /**
         * This test sets out to prove that the scale you use on real valued 
predicates
         * doesn't matter when it comes the probability assigned to each 
outcome.
         * Strangely, if we use (1,2) and (10,20) there's no difference.
         * If we use (0.1,0.2) and (10,20) there is a difference.
         * @throws Exception
         */
        public void testScaleResults() throws Exception {
                String smallValues = "predA=0.1 predB=0.2 A\n" +
                                "predB=0.3 predA=0.1 B\n";
                
                String smallTest = "predA=0.2 predB=0.2";
                
                String largeValues = "predA=10 predB=20 A\n" +
                                "predB=30 predA=10 B\n";
                
                String largeTest = "predA=20 predB=20";
                
                StringReader smallReader = new StringReader(smallValues);
                EventStream smallEventStream = new RealBasicEventStream(new 
PlainTextByLineDataStream(smallReader));

                MaxentModel smallModel = GIS.trainModel(2, new 
OnePassRealValueDataIndexer(smallEventStream,0), false);
                String[] contexts = smallTest.split(" ");
                float[] values = 
RealValueFileEventStream.parseContexts(contexts);
                double[] ocs = smallModel.eval(contexts, values);
                
                String smallResults = smallModel.getAllOutcomes(ocs);
                System.out.println("smallResults: " + smallResults);
                
                StringReader largeReader = new StringReader(largeValues);
                EventStream largeEventStream = new RealBasicEventStream(new 
PlainTextByLineDataStream(largeReader));

                MaxentModel largeModel = GIS.trainModel(2, new 
OnePassRealValueDataIndexer(largeEventStream,0), false);
                contexts = largeTest.split(" ");
                values = RealValueFileEventStream.parseContexts(contexts);
                ocs = largeModel.eval(contexts, values);
                
                String largeResults = smallModel.getAllOutcomes(ocs);
                System.out.println("largeResults: " + largeResults);
                
                assertEquals(smallResults, largeResults);
                
        }
}

The problem concerns the correctionConstant in GISTrainer, which is set to be 
an integer. I implemented the following fix in class GISTrainer:
    // determine the correction constant and its inverse
    //int correctionConstant = 1;
    float correctionConstant = 0;
    for (int ci = 0; ci < contexts.length; ci++) {
      if (values == null || values[ci] == null) {
        if (contexts[ci].length > correctionConstant) {
          correctionConstant = contexts[ci].length;
        }
      }
      else {
        float cl = values[ci][0];
        for (int vi=1;vi<values[ci].length;vi++) {
          cl+=values[ci][vi];
        }
        
        if (cl > correctionConstant) {
          //correctionConstant=(int) Math.ceil(cl);
          correctionConstant= cl;
        }
      }
    }

I'd be curious to know if there's a reason for using an integer 
correctionConstant.

Rgds,
Assaf Urieli

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

Reply via email to