Good news!

After some modifications and using another corpus we could have nicer
results:

Precision: 0.9413606010016694
Recall: 0.9379938451301671
F-Measure: 0.9396742073907428

For these results I used the Corpus
Bosque_CF_8.0.ad<http://www.linguateca.pt/floresta/ficheiros/gz/Bosque_CF_8.0.ad.txt.gz>to
perform a 10-fold cross validation.
Maybe the poor performance we got before was related to Amazonia.ad, which
is an unrevised automatically generated corpus. The problem with
Bosque_CF_8.0 is that it is too small (< 10k sentences).

Regards
William

On Fri, Jan 7, 2011 at 1:25 AM, William Colen <[email protected]> wrote:

> Hi Daniel,
>
> I have some news. I wrote a tool to extract the chunk information from
> Bosque AD format and create OpenNLP train data. It is not working nicely
> yet, but it is a point to start from.
> I'm getting the following results:
>
> Precision: 0.7680814205283673
> Recall: 0.8237343241987923
> F-Measure: 0.7949350067234425
>
> Which is bad if compared with the ones we get using the English data. The
> problem is probably the heuristic to extract chunk information.
>
> What I did:
> 1. As described in that PUC-Rio paper: "Defined as chunk all consecutive
> tokens within the same deepest-level phrase."
> 2. I'm considering the group forms described at section 2.1 of Floresta
> Symbolset <http://beta.visl.sdu.dk/visl/pt/info/symbolset-floresta.html>
>
> Here is a sample:
>
> AD format:
> STA:cu
> =CJT:fcl
> ==ADVL:adv("depois" <left>)    depois
> ==ACC-PASS:pron-pers("se" <coll> <left> M 3P ACC)    se
> ==P:v-fin("encontrar" <se-passive> <nosubj> <cjt-head> <fmc> <mv> PR 3P IND
> VFIN)    encontram
> ==PIV:pp
> ===H:prp("com" <right>)    com
> ===P<:np
> ====>N:art("o" <artd> DET F S)    a
> ====H:n("dissidência" <np-def> <ac> <am> F S)    dissidência
> ====N<:pp
> =====H:prp("de" <sam-> <np-close>)    de
> =====P<:np
> ======>N:art("o" <artd> <-sam> DET M S)    o
> ======H:n("grupo" <np-def> <HH> M S)    grupo
> ======,
> ======APP:np
> =======>N:art("o" <artd> DET M P)    os
> =======H:prop("Bacamarteiros_de_Pinga_Fogo" <org> <np-close> M P)
> Bacamarteiros_de_Pinga_Fogo
> =,
> =CO:conj-c("e" <co-fin> <co-fmc>)    e
> =CJT:x
> ==SUBJ:np
> ===>N:art("o" <artd> DET F S)    a
> ===H:n("festa" <np-def> <occ> <left> F S)    festa
> ==P:v-fin("continuar" <cjt-sta> <fmc> <mv> PR 3S IND VFIN)    continua
> ==ADVL:pp
> ===H:prp("por" <right>)    por
> ===P<:n("muito_tempo" <np-idf> <dur> M S)    muito_tempo
> .
>
> Result:
>
> depois adv O
> se pron-pers O
> encontram v-fin B-VP
> com prp B-PP
> a art B-NP
> dissidência n I-NP
> de prp B-PP
> o art B-NP
> grupo n I-NP
> , , I-NP
> os art B-NP
> Bacamarteiros_de_Pinga_Fogo prop I-NP
> , , O
> e conj-c O
> a art B-NP
> festa n I-NP
> continua v-fin B-VP
> por prp B-PP
> muito_tempo n I-PP
> . . O
>
> The code to perform the conversion is
> opennlp.tools.formats.ADChunkSampleStream (only at SVN Trunk)
>
> Follow the instructions if you want to reproduce the experiment and check
> the results.
>
> A. Prepare the environment
>
> - Get the code from SVN trunk, as described here:
> http://incubator.apache.org/opennlp/source-code.html
> - You will need Maven 3.0.1 to compile the project, if you don't have it
> yet, please get it from http://maven.apache.org/download.html, the
> installation instructions are at the same page.
> - Compile the project. To do that go to the folder <project-root>/opennlp/
> from command line and run the command "mvn install". It can take longer to
> execute the first build.
> - Now go to the folder <project-root>/opennlp-tools
> - Execute the command:
>    mvn dependency:copy-dependencies -DoutputDirectory="lib"
> to copy the libraries to lib folder
> - Copy the file
> <project-root>/opennlp-tools/target/opennlp-tools-1.5.1-incubating-SNAPSHOT.jar
> to <project-root>/opennlp-tools
> - Now we are ready to execute Apache OpenNLP!
>
> B. Use the ChunkConverter
>
> - Download the Amazonia Corpus from Bosque and extract somewhere:
> http://www.linguateca.pt/floresta/ficheiros/gz/amazonia.ad.gz
> - Now we have to split the Corpus to a size we can handle. I could count
> almost 271000 sentences in the Corpus, since my computer can't handle this
> amount of sentence, I'll extract the first 2000 for evaluation, and the next
> 20000 for training.
> - Create the evaluation data
>    bin/opennlp ChunkerConverter ad -encoding ISO-8859-1 -data
> ../../../corpus/amazonia.ad -start 0 -end 2000 > amazonia-chunk.eval
> you will see some "Couldn't parse leaf" messages. A few leafs was not
> following the expected format. We will have to check how to handle it later.
> - Create the train data
>    bin/opennlp ChunkerConverter ad -encoding ISO-8859-1 -data
> ../../../corpus/amazonia.ad -start 2001 -end 22000 > amazonia-chunk.train
> you can check the results and verify if they are consistent.
>
> C. Train
> - Execute the command:
>    bin/opennlp ChunkerTrainerME -encoding UTF-8 -lang pt -data
> amazonia-chunk.train -model pt-chunker.bin
>
> D. Evaluation
> - Execute the command
>    bin/opennlp ChunkerEvaluator -data amazonia-chunk.eval -model
> pt-chunker.bin
>
> Regards,
> William
>

Reply via email to