Hi Daniel, I have some news. I wrote a tool to extract the chunk information from Bosque AD format and create OpenNLP train data. It is not working nicely yet, but it is a point to start from. I'm getting the following results:
Precision: 0.7680814205283673 Recall: 0.8237343241987923 F-Measure: 0.7949350067234425 Which is bad if compared with the ones we get using the English data. The problem is probably the heuristic to extract chunk information. What I did: 1. As described in that PUC-Rio paper: "Defined as chunk all consecutive tokens within the same deepest-level phrase." 2. I'm considering the group forms described at section 2.1 of Floresta Symbolset <http://beta.visl.sdu.dk/visl/pt/info/symbolset-floresta.html> Here is a sample: AD format: STA:cu =CJT:fcl ==ADVL:adv("depois" <left>) depois ==ACC-PASS:pron-pers("se" <coll> <left> M 3P ACC) se ==P:v-fin("encontrar" <se-passive> <nosubj> <cjt-head> <fmc> <mv> PR 3P IND VFIN) encontram ==PIV:pp ===H:prp("com" <right>) com ===P<:np ====>N:art("o" <artd> DET F S) a ====H:n("dissidência" <np-def> <ac> <am> F S) dissidência ====N<:pp =====H:prp("de" <sam-> <np-close>) de =====P<:np ======>N:art("o" <artd> <-sam> DET M S) o ======H:n("grupo" <np-def> <HH> M S) grupo ======, ======APP:np =======>N:art("o" <artd> DET M P) os =======H:prop("Bacamarteiros_de_Pinga_Fogo" <org> <np-close> M P) Bacamarteiros_de_Pinga_Fogo =, =CO:conj-c("e" <co-fin> <co-fmc>) e =CJT:x ==SUBJ:np ===>N:art("o" <artd> DET F S) a ===H:n("festa" <np-def> <occ> <left> F S) festa ==P:v-fin("continuar" <cjt-sta> <fmc> <mv> PR 3S IND VFIN) continua ==ADVL:pp ===H:prp("por" <right>) por ===P<:n("muito_tempo" <np-idf> <dur> M S) muito_tempo . Result: depois adv O se pron-pers O encontram v-fin B-VP com prp B-PP a art B-NP dissidência n I-NP de prp B-PP o art B-NP grupo n I-NP , , I-NP os art B-NP Bacamarteiros_de_Pinga_Fogo prop I-NP , , O e conj-c O a art B-NP festa n I-NP continua v-fin B-VP por prp B-PP muito_tempo n I-PP . . O The code to perform the conversion is opennlp.tools.formats.ADChunkSampleStream (only at SVN Trunk) Follow the instructions if you want to reproduce the experiment and check the results. A. Prepare the environment - Get the code from SVN trunk, as described here: http://incubator.apache.org/opennlp/source-code.html - You will need Maven 3.0.1 to compile the project, if you don't have it yet, please get it from http://maven.apache.org/download.html, the installation instructions are at the same page. - Compile the project. To do that go to the folder <project-root>/opennlp/ from command line and run the command "mvn install". It can take longer to execute the first build. - Now go to the folder <project-root>/opennlp-tools - Execute the command: mvn dependency:copy-dependencies -DoutputDirectory="lib" to copy the libraries to lib folder - Copy the file <project-root>/opennlp-tools/target/opennlp-tools-1.5.1-incubating-SNAPSHOT.jar to <project-root>/opennlp-tools - Now we are ready to execute Apache OpenNLP! B. Use the ChunkConverter - Download the Amazonia Corpus from Bosque and extract somewhere: http://www.linguateca.pt/floresta/ficheiros/gz/amazonia.ad.gz - Now we have to split the Corpus to a size we can handle. I could count almost 271000 sentences in the Corpus, since my computer can't handle this amount of sentence, I'll extract the first 2000 for evaluation, and the next 20000 for training. - Create the evaluation data bin/opennlp ChunkerConverter ad -encoding ISO-8859-1 -data ../../../corpus/amazonia.ad -start 0 -end 2000 > amazonia-chunk.eval you will see some "Couldn't parse leaf" messages. A few leafs was not following the expected format. We will have to check how to handle it later. - Create the train data bin/opennlp ChunkerConverter ad -encoding ISO-8859-1 -data ../../../corpus/amazonia.ad -start 2001 -end 22000 > amazonia-chunk.train you can check the results and verify if they are consistent. C. Train - Execute the command: bin/opennlp ChunkerTrainerME -encoding UTF-8 -lang pt -data amazonia-chunk.train -model pt-chunker.bin D. Evaluation - Execute the command bin/opennlp ChunkerEvaluator -data amazonia-chunk.eval -model pt-chunker.bin Regards, William
