[ 
https://issues.apache.org/jira/browse/DATAFU-14?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13883736#comment-13883736
 ] 

Russell Jurney commented on DATAFU-14:
--------------------------------------

The work with Lucene is a continuation of the Open NLP work. Lucene has 
tokenizers and stemmers that help process natural language. They were created 
for search, but they work well in other contexts.

The Lucene tokenizer is very handy. 
http://lucene.apache.org/core/4_0_0/core/org/apache/lucene/analysis/Tokenizer.html
  It is a reliable way to split text into 1-grams.

The NGramTokenizer is a way to split text into arbitrary n-grams. It is common 
in NLP to compute 1-3 grams, and NGramTokenizer does this for you. N-Grams help 
to vectorize natural language for additional processing. For example, in 
creating topics/keywords with TF-IDF, you might use NGramTokenizer. I would use 
this in conjunction with the Open NLP work. 
http://lucene.apache.org/core/4_4_0/analyzers-common/org/apache/lucene/analysis/ngram/NGramTokenizer.html

The Snowball stemmers are useful to stem words. 
http://en.wikipedia.org/wiki/Stemming This operation would typically happen in 
the Open NLP workflows, after POSTag.

Thats about it. I can batch the work together, but its more convenient to do 
them one at a time. Maybe a master JIRA and 3 sub-jiras?

 

> Add NGram Tokenizer to datafu.pig.text.lucene
> ---------------------------------------------
>
>                 Key: DATAFU-14
>                 URL: https://issues.apache.org/jira/browse/DATAFU-14
>             Project: DataFu
>          Issue Type: Improvement
>         Environment: plants
>            Reporter: Russell Jurney
>
> See 
> https://github.com/rjurney/datafu/blob/lucene/src/java/datafu/pig/text/lucene/NGramTokenize.java
> Held up by 
> http://stackoverflow.com/questions/21064520/how-to-use-lucene-shinglefilter-could-not-find-implementing-class-for-org-apach/21067142?noredirect=1#21067142



--
This message was sent by Atlassian JIRA
(v6.1.5#6160)

Reply via email to