[ 
https://issues.apache.org/jira/browse/MAHOUT-242?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Drew Farris updated MAHOUT-242:
-------------------------------

    Attachment: mahout-colloc.tar.gz

Thanks for taking a look and providing some great feedback Robin.

Here's a new version that includes the following changes:

* Now runs from a SequenceFile<Text,Text> of ID -> DOC by default. Tested on 
some medium-sized collections of 10k and 100k files using Robin's directory to 
sequence file util.
* Analyzer is now configurable from the command-line via the --analyzerName 
option
* Using a Writable implementation instead of strings to move data around. No 
more parsing, splitting, concatenating
* Improved the handling of the output directory, output from passes are written 
to subdirectories of this directory, so no need to specify multiple output 
directories any longer.

After 'mvn clean install' a sample can be run like so:
{noformat}
mvn -e exec:java  -Dexec.mainClass="org.apache.mahout.colloc.CollocDriver" 
-Dexec.args="--input src/test/resources/article --output target/output -w -t"
{noformat}

I'd like to get this into patch form as a next step + get all of the license 
headers on the code here, but I'm not sure where it should live in terms of 
project/package names, etc. Any thoughts?

Also, I'm looking for feedback on the algorithm implementation -- this version 
differs that I presented on the list in that the implementation tracks the part 
of the orginal n-gram that the sub-part appears in (head, tail). I'm not 100% 
sure this is necessary or even correct.

Also, it's a bummer to have to create an analyzer subclass just to provide an 
implementation with a no-argument constructor. Has anyone considered making use 
of a DI framework with mahout? I know Grant has mentioned such options spring 
or guice with Mahout? Anyone have any strong objections to pull one of those in 
as a dependency? 


> LLR Collocation Identifier
> --------------------------
>
>                 Key: MAHOUT-242
>                 URL: https://issues.apache.org/jira/browse/MAHOUT-242
>             Project: Mahout
>          Issue Type: New Feature
>    Affects Versions: 0.3
>            Reporter: Drew Farris
>            Priority: Minor
>         Attachments: mahout-colloc.tar.gz, mahout-colloc.tar.gz
>
>
> Identifies interesting Collocations in text using ngrams scored via the 
> LogLikelihoodRatio calculation. 
> As discussed in: 
> * 
> http://www.lucidimagination.com/search/document/d051123800ab6ce7/collocations_in_mahout#26634d6364c2c0d2
> * 
> http://www.lucidimagination.com/search/document/b8d5bb0745eef6e8/n_grams_for_terms#f16fa54417697d8e
> Current form is a tar of a maven project that depends on mahout. Build as 
> usual with 'mvn clean install', can be executed using:
> {noformat}
> mvn -e exec:java  -Dexec.mainClass="org.apache.mahout.colloc.CollocDriver" 
> -Dexec.args="--input src/test/resources/article --colloc target/colloc 
> --output target/output -w"
> {noformat}
> Output will be placed in target/output and can be viewed nicely using:
> {noformat}
> sort -rn -k1 target/output/part-00000
> {noformat}
> Includes rudimentary unit tests. Please review and comment. Needs more work 
> to get this into patch state and integrate with Robin's document vectorizer 
> work in MAHOUT-237
> Some basic TODO/FIXME's include:
> * use mahout math's ObjectInt map implementation when available
> * make the analyzer configurable
> * better input validation + negative unit tests.
> * more flexible ways to generate units of analysis (n-1)grams.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

Reply via email to