On 02/19/2014 01:25 PM, William Colen wrote:
Is the SequenceValidator the only thing we need to change? If a corpus uses
BILOU, the formatters need to convert it to IOB2?
The format parsing code creates Span objects. The name finder and
chunker take these Span objects and
then perform IOB2 coding on them (start, cont, other).
The coding is done in to places, first during training the Span are
encoded, and during tagging the tag sequences
are decoded into Span objects again.
An interface like this could work for the name finder (didn't check the
chunker yet):
public interface class SequenceCodec {
Span[] decode(List<String> c);
String[] encode(Span names[], int length);
SequenceValidator createSequenceValidator();
}
The Sequence Validator depends of course on the used codec and could be
created by a factory
method.
Some machine learners e.g. Mallet CRF don't support our sequence
validation. I am not yet sure how we
handle that case.
Jörn