Have a look at the Sequence Coding thread here on the list.

The name finder always used IOB2 coding by default, we made this now
configurable and it can be replaced by other codecs such BILOU, or when
the work is done by a user implemented codec.

To detect names in a sentence the name finder uses a learn able classifier. The classifier has to decide if a token is part of name or not. The logic on which labels are used to encode/
decode name spans is now the responsibility of the SequenceCodec object.

In the IOB2 codec (see the BioCodec class) the tokens are labels as Begin, Inside, Other.
Each new name span has to start with the Begin label.

The BILOU codec uses the following labels: Begin, Inside, Last, Unit and Other.

The might be advantages to switch the codec depending on the data you are using, in the German CONLL03 data the evaluation results are slightly better with BILOU
instead of IOB2.

The BILOU codec uses more labels, and will be more resource intensive compared to IOB2.

Also have a look at the wikipedia article about IOB:
http://en.wikipedia.org/wiki/Inside_Outside_Beginning

HTH,
Jörn

On 03/05/2014 02:18 PM, Mark G wrote:
Hello, I updated the tools trunk two days ago and stopped getting NER
results. I chatted with Joern and he made a change to the seq codec that
brought everything back to normal. For the benefit of everyone on the dev
list, would it be possible for someone to explain the changes regarding the
sequence codec: its benefits, the differences, and where in the code to
look to see what it is actually doing. Don't need anything elaborate, just
a point of departure for inquiry.
MG


Reply via email to