[
https://issues.apache.org/jira/browse/DATAFU-14?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13883736#comment-13883736
]
Russell Jurney commented on DATAFU-14:
--------------------------------------
The work with Lucene is a continuation of the Open NLP work. Lucene has
tokenizers and stemmers that help process natural language. They were created
for search, but they work well in other contexts.
The Lucene tokenizer is very handy.
http://lucene.apache.org/core/4_0_0/core/org/apache/lucene/analysis/Tokenizer.html
It is a reliable way to split text into 1-grams.
The NGramTokenizer is a way to split text into arbitrary n-grams. It is common
in NLP to compute 1-3 grams, and NGramTokenizer does this for you. N-Grams help
to vectorize natural language for additional processing. For example, in
creating topics/keywords with TF-IDF, you might use NGramTokenizer. I would use
this in conjunction with the Open NLP work.
http://lucene.apache.org/core/4_4_0/analyzers-common/org/apache/lucene/analysis/ngram/NGramTokenizer.html
The Snowball stemmers are useful to stem words.
http://en.wikipedia.org/wiki/Stemming This operation would typically happen in
the Open NLP workflows, after POSTag.
Thats about it. I can batch the work together, but its more convenient to do
them one at a time. Maybe a master JIRA and 3 sub-jiras?
> Add NGram Tokenizer to datafu.pig.text.lucene
> ---------------------------------------------
>
> Key: DATAFU-14
> URL: https://issues.apache.org/jira/browse/DATAFU-14
> Project: DataFu
> Issue Type: Improvement
> Environment: plants
> Reporter: Russell Jurney
>
> See
> https://github.com/rjurney/datafu/blob/lucene/src/java/datafu/pig/text/lucene/NGramTokenize.java
> Held up by
> http://stackoverflow.com/questions/21064520/how-to-use-lucene-shinglefilter-could-not-find-implementing-class-for-org-apach/21067142?noredirect=1#21067142
--
This message was sent by Atlassian JIRA
(v6.1.5#6160)