Good news! After some modifications and using another corpus we could have nicer results:
Precision: 0.9413606010016694 Recall: 0.9379938451301671 F-Measure: 0.9396742073907428 For these results I used the Corpus Bosque_CF_8.0.ad<http://www.linguateca.pt/floresta/ficheiros/gz/Bosque_CF_8.0.ad.txt.gz>to perform a 10-fold cross validation. Maybe the poor performance we got before was related to Amazonia.ad, which is an unrevised automatically generated corpus. The problem with Bosque_CF_8.0 is that it is too small (< 10k sentences). Regards William On Fri, Jan 7, 2011 at 1:25 AM, William Colen <[email protected]> wrote: > Hi Daniel, > > I have some news. I wrote a tool to extract the chunk information from > Bosque AD format and create OpenNLP train data. It is not working nicely > yet, but it is a point to start from. > I'm getting the following results: > > Precision: 0.7680814205283673 > Recall: 0.8237343241987923 > F-Measure: 0.7949350067234425 > > Which is bad if compared with the ones we get using the English data. The > problem is probably the heuristic to extract chunk information. > > What I did: > 1. As described in that PUC-Rio paper: "Defined as chunk all consecutive > tokens within the same deepest-level phrase." > 2. I'm considering the group forms described at section 2.1 of Floresta > Symbolset <http://beta.visl.sdu.dk/visl/pt/info/symbolset-floresta.html> > > Here is a sample: > > AD format: > STA:cu > =CJT:fcl > ==ADVL:adv("depois" <left>) depois > ==ACC-PASS:pron-pers("se" <coll> <left> M 3P ACC) se > ==P:v-fin("encontrar" <se-passive> <nosubj> <cjt-head> <fmc> <mv> PR 3P IND > VFIN) encontram > ==PIV:pp > ===H:prp("com" <right>) com > ===P<:np > ====>N:art("o" <artd> DET F S) a > ====H:n("dissidência" <np-def> <ac> <am> F S) dissidência > ====N<:pp > =====H:prp("de" <sam-> <np-close>) de > =====P<:np > ======>N:art("o" <artd> <-sam> DET M S) o > ======H:n("grupo" <np-def> <HH> M S) grupo > ======, > ======APP:np > =======>N:art("o" <artd> DET M P) os > =======H:prop("Bacamarteiros_de_Pinga_Fogo" <org> <np-close> M P) > Bacamarteiros_de_Pinga_Fogo > =, > =CO:conj-c("e" <co-fin> <co-fmc>) e > =CJT:x > ==SUBJ:np > ===>N:art("o" <artd> DET F S) a > ===H:n("festa" <np-def> <occ> <left> F S) festa > ==P:v-fin("continuar" <cjt-sta> <fmc> <mv> PR 3S IND VFIN) continua > ==ADVL:pp > ===H:prp("por" <right>) por > ===P<:n("muito_tempo" <np-idf> <dur> M S) muito_tempo > . > > Result: > > depois adv O > se pron-pers O > encontram v-fin B-VP > com prp B-PP > a art B-NP > dissidência n I-NP > de prp B-PP > o art B-NP > grupo n I-NP > , , I-NP > os art B-NP > Bacamarteiros_de_Pinga_Fogo prop I-NP > , , O > e conj-c O > a art B-NP > festa n I-NP > continua v-fin B-VP > por prp B-PP > muito_tempo n I-PP > . . O > > The code to perform the conversion is > opennlp.tools.formats.ADChunkSampleStream (only at SVN Trunk) > > Follow the instructions if you want to reproduce the experiment and check > the results. > > A. Prepare the environment > > - Get the code from SVN trunk, as described here: > http://incubator.apache.org/opennlp/source-code.html > - You will need Maven 3.0.1 to compile the project, if you don't have it > yet, please get it from http://maven.apache.org/download.html, the > installation instructions are at the same page. > - Compile the project. To do that go to the folder <project-root>/opennlp/ > from command line and run the command "mvn install". It can take longer to > execute the first build. > - Now go to the folder <project-root>/opennlp-tools > - Execute the command: > mvn dependency:copy-dependencies -DoutputDirectory="lib" > to copy the libraries to lib folder > - Copy the file > <project-root>/opennlp-tools/target/opennlp-tools-1.5.1-incubating-SNAPSHOT.jar > to <project-root>/opennlp-tools > - Now we are ready to execute Apache OpenNLP! > > B. Use the ChunkConverter > > - Download the Amazonia Corpus from Bosque and extract somewhere: > http://www.linguateca.pt/floresta/ficheiros/gz/amazonia.ad.gz > - Now we have to split the Corpus to a size we can handle. I could count > almost 271000 sentences in the Corpus, since my computer can't handle this > amount of sentence, I'll extract the first 2000 for evaluation, and the next > 20000 for training. > - Create the evaluation data > bin/opennlp ChunkerConverter ad -encoding ISO-8859-1 -data > ../../../corpus/amazonia.ad -start 0 -end 2000 > amazonia-chunk.eval > you will see some "Couldn't parse leaf" messages. A few leafs was not > following the expected format. We will have to check how to handle it later. > - Create the train data > bin/opennlp ChunkerConverter ad -encoding ISO-8859-1 -data > ../../../corpus/amazonia.ad -start 2001 -end 22000 > amazonia-chunk.train > you can check the results and verify if they are consistent. > > C. Train > - Execute the command: > bin/opennlp ChunkerTrainerME -encoding UTF-8 -lang pt -data > amazonia-chunk.train -model pt-chunker.bin > > D. Evaluation > - Execute the command > bin/opennlp ChunkerEvaluator -data amazonia-chunk.eval -model > pt-chunker.bin > > Regards, > William >
