[ https://issues.apache.org/jira/browse/MAHOUT-242?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12803918#action_12803918 ]
Zaki Rahaman commented on MAHOUT-242: ------------------------------------- Thanks Drew. Some overdue feedback from testing this baby out... My use case may differ from what's more typical, but basically I had 10MM+ semi-processed strings (lowercased, no punctuation, etc.) that I wanted to do my analysis on, not a large document or a series of smaller documents. As a result, I found that using the Lucene Analyzers in the NGramCollector to be horribly ineffecient (the job was taking hours to complete 1% on my modest 7 node cluster). Instead I used some custom logic to generate the n-grams and pass them to the collector. After this modification, the entire set of jobs ran in a little less than an hour even on a larger dataset. I know Robin mentioned something about an Ngram generator in the Bayes classifier, I should check it out. I guess I would advocate splitting the module as it exists into a couple of different pieces: 1. A general purpose tool for tokenizing/analyzing a document/documents (or in my case, 'documents' consisting of strings). I know there's some exisitng tools for vectorizing text, and indices, so maybe this is something similar but the basic idea would be to go from document --> analyzed text 2+3 (These can remain as is in the module or be split). N gram generator and counter and LLR pieces. Maybe this is all crazy talk, but it would seem to me refactoring/extracting the pieces out like this would prove useful and allow for code reuse in other modules (other NLP-type modules, elsewhere in the project). > LLR Collocation Identifier > -------------------------- > > Key: MAHOUT-242 > URL: https://issues.apache.org/jira/browse/MAHOUT-242 > Project: Mahout > Issue Type: New Feature > Affects Versions: 0.3 > Reporter: Drew Farris > Priority: Minor > Attachments: MAHOUT-242.patch, MAHOUT-242.patch, > mahout-colloc.tar.gz, mahout-colloc.tar.gz > > > Identifies interesting Collocations in text using ngrams scored via the > LogLikelihoodRatio calculation. > As discussed in: > * > http://www.lucidimagination.com/search/document/d051123800ab6ce7/collocations_in_mahout#26634d6364c2c0d2 > * > http://www.lucidimagination.com/search/document/b8d5bb0745eef6e8/n_grams_for_terms#f16fa54417697d8e > Current form is a tar of a maven project that depends on mahout. Build as > usual with 'mvn clean install', can be executed using: > {noformat} > mvn -e exec:java -Dexec.mainClass="org.apache.mahout.colloc.CollocDriver" > -Dexec.args="--input src/test/resources/article --colloc target/colloc > --output target/output -w" > {noformat} > Output will be placed in target/output and can be viewed nicely using: > {noformat} > sort -rn -k1 target/output/part-00000 > {noformat} > Includes rudimentary unit tests. Please review and comment. Needs more work > to get this into patch state and integrate with Robin's document vectorizer > work in MAHOUT-237 > Some basic TODO/FIXME's include: > * use mahout math's ObjectInt map implementation when available > * make the analyzer configurable > * better input validation + negative unit tests. > * more flexible ways to generate units of analysis (n-1)grams. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.