If you wait tokenization to depend on sentences, and you insist on being inside Lucene, you have to be a Tokenizer. Your tokenizer can set an attribute on the token that ends a sentence. Then, downstream, filters can read-ahead tokens to get the full sentence and buffer tokens as needed.
On Fri, Apr 17, 2015 at 1:00 PM, Ahmet Arslan <iori...@yahoo.com.invalid> wrote: > Hi Hummel, > > There was an effort to bring open-nlp capabilities to Lucene: > https://issues.apache.org/jira/browse/LUCENE-2899 > > Lance was working on it to keep it up-to-date. But, it looks like it is not > always best to accomplish all things inside Lucene. > I personally would do the sentence detection outside of the Lucene. > > By the way, I remember there was a way to consume all upstream token stream. > > I think it was consuming all input and injecting one concatenated huge > term/token. > > KeywordTokenizer has similar behaviour. It injects a single token. > http://lucene.apache.org/core/3_0_3/api/all/org/apache/lucene/analysis/KeywordAnalyzer.html > > Ahmet > > > On Wednesday, April 15, 2015 3:12 PM, Shay Hummel <shay.hum...@gmail.com> > wrote: > Hi Ahment, > Thank you for the reply, > That's exactly what I am doing. At the moment, to index a document, I break > it to sentences, and each sentence is analyzed (lemmatizing, stopword > removal etc.) > Now, what I am looking for is a way to create an analyzer (a class which > extends lucene's analyzer). This analyzer will be used for index and query > processing. It (a like the english analyzer) will receive the text and > produce tokens. > The Api of Analyzer requires implementing the createComponents which > is not dependent > on the text being analyzed. This fact is problematic since as you know the > OpenNlp sentence breaking depends on the text it gets (OpenNlp uses the > model files to provide spans of each sentence and then break them). > Is there a way around it? > > Shay > > > On Wed, Apr 15, 2015 at 3:50 AM Ahmet Arslan <iori...@yahoo.com.invalid> > wrote: > >> Hi Hummel, >> >> You can perform sentence detection outside of the solr, using opennlp for >> instance, and then feed them to solr. >> >> https://opennlp.apache.org/documentation/1.5.2-incubating/manual/opennlp.html#tools.sentdetect >> >> Ahmet >> >> >> >> >> On Tuesday, April 14, 2015 8:12 PM, Shay Hummel <shay.hum...@gmail.com> >> wrote: >> Hi >> I would like to create a text dependent analyzer. >> That is, *given a string*, the analyzer will: >> 1. Read the entire text and break it into sentences. >> 2. Each sentence will then be tokenized, possesive removal, lowercased, >> mark terms and stemmed. >> >> The second part is essentially what happens in english analyzer >> (createComponent). However, this is not dependent of the text it receives - >> which is the first part of what I am trying to do. >> >> So ... How can it be achieved? >> >> Thank you, >> >> Shay Hummel >> >> --------------------------------------------------------------------- >> To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org >> For additional commands, e-mail: java-user-h...@lucene.apache.org >> >> > > --------------------------------------------------------------------- > To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org > For additional commands, e-mail: java-user-h...@lucene.apache.org > --------------------------------------------------------------------- To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org For additional commands, e-mail: java-user-h...@lucene.apache.org