[ 
https://issues.apache.org/jira/browse/MAHOUT-242?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12803274#action_12803274
 ] 

Isabel Drost commented on MAHOUT-242:
-------------------------------------

First of all, thanks for the patch. The code looks good so far, patch applies 
cleanly and builds w/o problems. Some initial comments and questions I had when 
reading it: 
 
CollocMapper, Line 66: If I read your implementation correctly, this means that 
documents are always read fully into memory, right? So we would assume to only 
run the ngramCollector over documents that fit into main memory and unable to 
process larger documents. I am wondering whether this is an issue at all, and 
if so, whether there is any way around that. 
 
Gram, Line 192: You can omit the "else" clauses, in case the "if" already 
returns its result to the caller, however this is a question of style. I was 
wondering, why in line 177 you did not write "this.position != other.position"? 
 
NGramCollector, Line 47 (and a few others): Shouldn't we avoid using deprecated 
apis instead of suppressing deprecation warnings? 
 
LLRReducer, Line 143: How about making the method package private if it should 
be used in unit tests only anyway? 
Line 106: Would be nice to have an additional counter for the skipped grams. 
 
 
I agree with you that things like sentence boundary detection and more 
sophisticated tokenization should be left as work for an additional issue. 
 
Jake, would be great, if you could have a closer look to verify that this is 
about the pipeline you had in mind in the referenced e-mail threads and mention 
anything that might still be missing.

> LLR Collocation Identifier
> --------------------------
>
>                 Key: MAHOUT-242
>                 URL: https://issues.apache.org/jira/browse/MAHOUT-242
>             Project: Mahout
>          Issue Type: New Feature
>    Affects Versions: 0.3
>            Reporter: Drew Farris
>            Priority: Minor
>         Attachments: MAHOUT-242.patch, mahout-colloc.tar.gz, 
> mahout-colloc.tar.gz
>
>
> Identifies interesting Collocations in text using ngrams scored via the 
> LogLikelihoodRatio calculation. 
> As discussed in: 
> * 
> http://www.lucidimagination.com/search/document/d051123800ab6ce7/collocations_in_mahout#26634d6364c2c0d2
> * 
> http://www.lucidimagination.com/search/document/b8d5bb0745eef6e8/n_grams_for_terms#f16fa54417697d8e
> Current form is a tar of a maven project that depends on mahout. Build as 
> usual with 'mvn clean install', can be executed using:
> {noformat}
> mvn -e exec:java  -Dexec.mainClass="org.apache.mahout.colloc.CollocDriver" 
> -Dexec.args="--input src/test/resources/article --colloc target/colloc 
> --output target/output -w"
> {noformat}
> Output will be placed in target/output and can be viewed nicely using:
> {noformat}
> sort -rn -k1 target/output/part-00000
> {noformat}
> Includes rudimentary unit tests. Please review and comment. Needs more work 
> to get this into patch state and integrate with Robin's document vectorizer 
> work in MAHOUT-237
> Some basic TODO/FIXME's include:
> * use mahout math's ObjectInt map implementation when available
> * make the analyzer configurable
> * better input validation + negative unit tests.
> * more flexible ways to generate units of analysis (n-1)grams.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

Reply via email to