Hi all,
the chunker and name finder both use IOB2 sequence coding. The logic
to do that is hard coded in both components.
I would like to suggest that we introduce a SequenceCodec interface to
abstract
this code and make it replaceable with different sequence codecs.
This will allow us to reuse the sequence codec in both components, and
make it
replaceable with other sequence codecs such as BILOU.
On my NER test datasets the F-Measure went up or down by around 1% depending
on the machine learner and data set with BILOU coding compared to IOB2
coding.
I didn't do any testing in the chunker.
Any opinions? Is it worth the effort?
Jörn