Hi Hummel, There was an effort to bring open-nlp capabilities to Lucene: https://issues.apache.org/jira/browse/LUCENE-2899
Lance was working on it to keep it up-to-date. But, it looks like it is not always best to accomplish all things inside Lucene. I personally would do the sentence detection outside of the Lucene. By the way, I remember there was a way to consume all upstream token stream. I think it was consuming all input and injecting one concatenated huge term/token. KeywordTokenizer has similar behaviour. It injects a single token. http://lucene.apache.org/core/3_0_3/api/all/org/apache/lucene/analysis/KeywordAnalyzer.html Ahmet On Wednesday, April 15, 2015 3:12 PM, Shay Hummel <shay.hum...@gmail.com> wrote: Hi Ahment, Thank you for the reply, That's exactly what I am doing. At the moment, to index a document, I break it to sentences, and each sentence is analyzed (lemmatizing, stopword removal etc.) Now, what I am looking for is a way to create an analyzer (a class which extends lucene's analyzer). This analyzer will be used for index and query processing. It (a like the english analyzer) will receive the text and produce tokens. The Api of Analyzer requires implementing the createComponents which is not dependent on the text being analyzed. This fact is problematic since as you know the OpenNlp sentence breaking depends on the text it gets (OpenNlp uses the model files to provide spans of each sentence and then break them). Is there a way around it? Shay On Wed, Apr 15, 2015 at 3:50 AM Ahmet Arslan <iori...@yahoo.com.invalid> wrote: > Hi Hummel, > > You can perform sentence detection outside of the solr, using opennlp for > instance, and then feed them to solr. > > https://opennlp.apache.org/documentation/1.5.2-incubating/manual/opennlp.html#tools.sentdetect > > Ahmet > > > > > On Tuesday, April 14, 2015 8:12 PM, Shay Hummel <shay.hum...@gmail.com> > wrote: > Hi > I would like to create a text dependent analyzer. > That is, *given a string*, the analyzer will: > 1. Read the entire text and break it into sentences. > 2. Each sentence will then be tokenized, possesive removal, lowercased, > mark terms and stemmed. > > The second part is essentially what happens in english analyzer > (createComponent). However, this is not dependent of the text it receives - > which is the first part of what I am trying to do. > > So ... How can it be achieved? > > Thank you, > > Shay Hummel > > --------------------------------------------------------------------- > To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org > For additional commands, e-mail: java-user-h...@lucene.apache.org > > --------------------------------------------------------------------- To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org For additional commands, e-mail: java-user-h...@lucene.apache.org