Hi Daniel,

I have some news. I wrote a tool to extract the chunk information from
Bosque AD format and create OpenNLP train data. It is not working nicely
yet, but it is a point to start from.
I'm getting the following results:

Precision: 0.7680814205283673
Recall: 0.8237343241987923
F-Measure: 0.7949350067234425

Which is bad if compared with the ones we get using the English data. The
problem is probably the heuristic to extract chunk information.

What I did:
1. As described in that PUC-Rio paper: "Defined as chunk all consecutive
tokens within the same deepest-level phrase."
2. I'm considering the group forms described at section 2.1 of Floresta
Symbolset <http://beta.visl.sdu.dk/visl/pt/info/symbolset-floresta.html>

Here is a sample:

AD format:
STA:cu
=CJT:fcl
==ADVL:adv("depois" <left>)    depois
==ACC-PASS:pron-pers("se" <coll> <left> M 3P ACC)    se
==P:v-fin("encontrar" <se-passive> <nosubj> <cjt-head> <fmc> <mv> PR 3P IND
VFIN)    encontram
==PIV:pp
===H:prp("com" <right>)    com
===P<:np
====>N:art("o" <artd> DET F S)    a
====H:n("dissidência" <np-def> <ac> <am> F S)    dissidência
====N<:pp
=====H:prp("de" <sam-> <np-close>)    de
=====P<:np
======>N:art("o" <artd> <-sam> DET M S)    o
======H:n("grupo" <np-def> <HH> M S)    grupo
======,
======APP:np
=======>N:art("o" <artd> DET M P)    os
=======H:prop("Bacamarteiros_de_Pinga_Fogo" <org> <np-close> M P)
Bacamarteiros_de_Pinga_Fogo
=,
=CO:conj-c("e" <co-fin> <co-fmc>)    e
=CJT:x
==SUBJ:np
===>N:art("o" <artd> DET F S)    a
===H:n("festa" <np-def> <occ> <left> F S)    festa
==P:v-fin("continuar" <cjt-sta> <fmc> <mv> PR 3S IND VFIN)    continua
==ADVL:pp
===H:prp("por" <right>)    por
===P<:n("muito_tempo" <np-idf> <dur> M S)    muito_tempo
.

Result:

depois adv O
se pron-pers O
encontram v-fin B-VP
com prp B-PP
a art B-NP
dissidência n I-NP
de prp B-PP
o art B-NP
grupo n I-NP
, , I-NP
os art B-NP
Bacamarteiros_de_Pinga_Fogo prop I-NP
, , O
e conj-c O
a art B-NP
festa n I-NP
continua v-fin B-VP
por prp B-PP
muito_tempo n I-PP
. . O

The code to perform the conversion is
opennlp.tools.formats.ADChunkSampleStream (only at SVN Trunk)

Follow the instructions if you want to reproduce the experiment and check
the results.

A. Prepare the environment

- Get the code from SVN trunk, as described here:
http://incubator.apache.org/opennlp/source-code.html
- You will need Maven 3.0.1 to compile the project, if you don't have it
yet, please get it from http://maven.apache.org/download.html, the
installation instructions are at the same page.
- Compile the project. To do that go to the folder <project-root>/opennlp/
from command line and run the command "mvn install". It can take longer to
execute the first build.
- Now go to the folder <project-root>/opennlp-tools
- Execute the command:
   mvn dependency:copy-dependencies -DoutputDirectory="lib"
to copy the libraries to lib folder
- Copy the file
<project-root>/opennlp-tools/target/opennlp-tools-1.5.1-incubating-SNAPSHOT.jar
to <project-root>/opennlp-tools
- Now we are ready to execute Apache OpenNLP!

B. Use the ChunkConverter

- Download the Amazonia Corpus from Bosque and extract somewhere:
http://www.linguateca.pt/floresta/ficheiros/gz/amazonia.ad.gz
- Now we have to split the Corpus to a size we can handle. I could count
almost 271000 sentences in the Corpus, since my computer can't handle this
amount of sentence, I'll extract the first 2000 for evaluation, and the next
20000 for training.
- Create the evaluation data
   bin/opennlp ChunkerConverter ad -encoding ISO-8859-1 -data
../../../corpus/amazonia.ad -start 0 -end 2000 > amazonia-chunk.eval
you will see some "Couldn't parse leaf" messages. A few leafs was not
following the expected format. We will have to check how to handle it later.
- Create the train data
   bin/opennlp ChunkerConverter ad -encoding ISO-8859-1 -data
../../../corpus/amazonia.ad -start 2001 -end 22000 > amazonia-chunk.train
you can check the results and verify if they are consistent.

C. Train
- Execute the command:
   bin/opennlp ChunkerTrainerME -encoding UTF-8 -lang pt -data
amazonia-chunk.train -model pt-chunker.bin

D. Evaluation
- Execute the command
   bin/opennlp ChunkerEvaluator -data amazonia-chunk.eval -model
pt-chunker.bin

Regards,
William

Reply via email to