I found an issue in TokenNamedFinderConverter module. Specifically I try to
convert a file in CoNLL 2002 format into OpenNLP one. The error I get when
I execute "opennlp TokenNameFinderConverter conll02 -data esp.testa -lang
es -types per > corpus_testa.txt" on the command line interface is:
*IO error while converting data : Expected three fields per line in
training data, got 2 for line 'Sao B-LOC'! Expected three fields per line
in training data, got 2 for line 'Sao B-LOC'! java.io.IOException: Expected
three fields per line in training data, got 2 for line 'Sao B-LOC'!
at
opennlp.tools.formats.Conll02NameSampleStream.read(Conll02NameSampleStream.java:140)
at
opennlp.tools.formats.Conll02NameSampleStream.read(Conll02NameSampleStream.java:49)
at
opennlp.tools.cmdline.AbstractConverterTool.run(AbstractConverterTool.java:110)
at opennlp.tools.cmdline.CLI.main(CLI.java:222).*
The reason is clear; three fields are expected from my file "esp.testa"
that only has two. But, the curious thing is that the file is from CoNLL's
data-set for test.
I propose two solutions for this problem. The first is to add a third field
intermediately to the two existed. For example, originally the file may
contains a line in IOB2-format like: "Sao B-LOC", and we must have to
change it to "Sao VP B-LOC", where "VP" is a POS tag that, in term of the
implementation, doesn't really matter what it means. I create a modified
version of the test data-set accordantly to this detail.
The other possible solution is to change the code from
"apache-opennlp-1.5.3-src\opennlp-tools\src\main\java\opennlp\tools\formats\Conll02NameSampleStream.java",
beginning in line 133. The solution is given in the following table, where
the first column contains the original code and the second the proposed
solution.
String fields[] = line.split(" ");
if (fields.length == 3) {
sentence.add(fields[0]);
tags.add(fields[2]);
}
else {
throw new IOException("Expected three fields per line in training
data, got " +
fields.length + " for line '" + line + "'!");
}
String fields[] = line.split(" ");
if (fields.length == 3) {
sentence.add(fields[0]);
tags.add(fields[2]);
}
if (fields.length == 2){
sentence.add(fields[0]);
tags.add(fields[1]);
}
else {
throw new IOException("Expected three or two fields per line in
training data, got " +
fields.length + " for line '" + line + "'!");
}
The first "if" statement is necessary because the training data-set of
CoNLL have three fields. Note that the second "if" statement only serves to
the test data-set (that is the case in which I have problem).
I hope this suggestion help to solve this problem.
Frankly,
Roque Vera.
Facultad Politécnica, Universidad Nacional de Asunción.
Paraguay.