CoNLL02 format issue

Roque Vera Wed, 12 Mar 2014 05:47:22 -0700

I found an issue in TokenNamedFinderConverter module. Specifically I try to
convert a file in CoNLL 2002 format into OpenNLP one. The error I get when
I execute "opennlp TokenNameFinderConverter conll02 -data esp.testa -lang
es -types per > corpus_testa.txt" on the command line interface is:









*IO error while converting data : Expected three fields per line in
training data, got 2 for line 'Sao B-LOC'! Expected three fields per line
in training data, got 2 for line 'Sao B-LOC'! java.io.IOException: Expected
three fields per line in training data, got 2 for line 'Sao B-LOC'!
at
opennlp.tools.formats.Conll02NameSampleStream.read(Conll02NameSampleStream.java:140)
        at
opennlp.tools.formats.Conll02NameSampleStream.read(Conll02NameSampleStream.java:49)
        at
opennlp.tools.cmdline.AbstractConverterTool.run(AbstractConverterTool.java:110)
        at opennlp.tools.cmdline.CLI.main(CLI.java:222).*



The reason is clear; three fields are expected from my file "esp.testa"
that only has two. But, the curious thing is that the file is from CoNLL's
data-set for test.


I propose two solutions for this problem. The first is to add a third field
intermediately to the two existed. For example, originally the file may
contains a line in IOB2-format like: "Sao B-LOC", and we must have to
change it to "Sao VP B-LOC", where "VP" is a POS tag that, in term of the
implementation, doesn't really matter what it means. I create a modified
version of the test data-set accordantly to this detail.


The other possible solution is to change the code from
"apache-opennlp-1.5.3-src\opennlp-tools\src\main\java\opennlp\tools\formats\Conll02NameSampleStream.java",
beginning in line 133. The solution is given in the following table, where
the first column contains the original code and the second the proposed
solution.

String fields[] = line.split(" ");

      if (fields.length == 3) {

        sentence.add(fields[0]);

        tags.add(fields[2]);

      }

      else {

        throw new IOException("Expected three fields per line in training
data, got " +

            fields.length + " for line '" + line + "'!");

      }

String fields[] = line.split(" ");

      if (fields.length == 3) {

        sentence.add(fields[0]);

        tags.add(fields[2]);

      }

      if (fields.length  == 2){

        sentence.add(fields[0]);

        tags.add(fields[1]);

      }

      else {

        throw new IOException("Expected three or two fields per line in
training data, got " +

            fields.length + " for line '" + line + "'!");

      }

The first "if" statement is necessary because the training data-set of
CoNLL have three fields. Note that the second "if" statement only serves to
the test data-set (that is the case in which I have problem).


I hope this suggestion help to solve this problem.

Frankly,

Roque Vera.
Facultad Politécnica, Universidad Nacional de Asunción.
Paraguay.

CoNLL02 format issue

Reply via email to