[ https://issues.apache.org/jira/browse/MAHOUT-242?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12803274#action_12803274 ]
Isabel Drost commented on MAHOUT-242: ------------------------------------- First of all, thanks for the patch. The code looks good so far, patch applies cleanly and builds w/o problems. Some initial comments and questions I had when reading it: CollocMapper, Line 66: If I read your implementation correctly, this means that documents are always read fully into memory, right? So we would assume to only run the ngramCollector over documents that fit into main memory and unable to process larger documents. I am wondering whether this is an issue at all, and if so, whether there is any way around that. Gram, Line 192: You can omit the "else" clauses, in case the "if" already returns its result to the caller, however this is a question of style. I was wondering, why in line 177 you did not write "this.position != other.position"? NGramCollector, Line 47 (and a few others): Shouldn't we avoid using deprecated apis instead of suppressing deprecation warnings? LLRReducer, Line 143: How about making the method package private if it should be used in unit tests only anyway? Line 106: Would be nice to have an additional counter for the skipped grams. I agree with you that things like sentence boundary detection and more sophisticated tokenization should be left as work for an additional issue. Jake, would be great, if you could have a closer look to verify that this is about the pipeline you had in mind in the referenced e-mail threads and mention anything that might still be missing. > LLR Collocation Identifier > -------------------------- > > Key: MAHOUT-242 > URL: https://issues.apache.org/jira/browse/MAHOUT-242 > Project: Mahout > Issue Type: New Feature > Affects Versions: 0.3 > Reporter: Drew Farris > Priority: Minor > Attachments: MAHOUT-242.patch, mahout-colloc.tar.gz, > mahout-colloc.tar.gz > > > Identifies interesting Collocations in text using ngrams scored via the > LogLikelihoodRatio calculation. > As discussed in: > * > http://www.lucidimagination.com/search/document/d051123800ab6ce7/collocations_in_mahout#26634d6364c2c0d2 > * > http://www.lucidimagination.com/search/document/b8d5bb0745eef6e8/n_grams_for_terms#f16fa54417697d8e > Current form is a tar of a maven project that depends on mahout. Build as > usual with 'mvn clean install', can be executed using: > {noformat} > mvn -e exec:java -Dexec.mainClass="org.apache.mahout.colloc.CollocDriver" > -Dexec.args="--input src/test/resources/article --colloc target/colloc > --output target/output -w" > {noformat} > Output will be placed in target/output and can be viewed nicely using: > {noformat} > sort -rn -k1 target/output/part-00000 > {noformat} > Includes rudimentary unit tests. Please review and comment. Needs more work > to get this into patch state and integrate with Robin's document vectorizer > work in MAHOUT-237 > Some basic TODO/FIXME's include: > * use mahout math's ObjectInt map implementation when available > * make the analyzer configurable > * better input validation + negative unit tests. > * more flexible ways to generate units of analysis (n-1)grams. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.