Re: tokenize into sentences/sentence splitter

Steve Rowe Wed, 23 Sep 2015 14:54:08 -0700

Unless you need to be able search on sentences-as-terms, i.e. exact sentence 
matching, you should try to find an alternative; otherwise your term index will 
be unnecessarily huge.


Three things come to mind:

1. A single Lucene index can host mixed document types, e.g. full documents and 
sentences.

2. Nested documents, in Lucene's join module, could help, depending on what you 
need to do.  Parent documents could correspond to original full documents, and 
sentences could be stored fields in child documents.  The sentence offsets 
could be separate child document fields, maybe also stored-only, depending on 
search/sort requirements.  See 
<http://blog.mikemccandless.com/2012/01/searching-relational-content-with.html>,
 
<http://lucene.apache.org/core/5_3_0/join/org/apache/lucene/search/join/ToParentBlockJoinQuery.html>
 the tests for ToParentBlockJoinQuery for example usages: 
<http://svn.apache.org/repos/asf/lucene/dev/tags/lucene_solr_5_3_0/lucene/join/src/test/org/apache/lucene/search/join/TestBlockJoin.java>.

3. Or, more simply, just store the offsets as a prefix inline with the stored 
sentence field values, e.g.

    original text: Four score and seven years ago....  Something happened.
    words (indexed): Four, score, and, seven, years, ago, Something, happened
    sentences (stored): 0,31|Four score and seven years ago....
    sentences (stored): 33,52|Something happened.

Steve
www.lucidworks.com

> On Sep 23, 2015, at 3:26 PM, Ziqi Zhang <ziqi.zh...@sheffield.ac.uk> wrote:
> 
> Thanks Steve.
> 
> It probably also makes sense to extract sentences and then store them. But 
> along with each sentence i also need to store its start/end offset. I'm not 
> sure how to do that without creating a separate index that stores each 
> sentence as a document? Basically the field for sentence and the field for 
> terms should be in the same index.
> 
> Thanks
> 
> 
> 
> On 23/09/2015 19:08, Steve Rowe wrote:
>> Hi Ziqi,
>> 
>> Lucene has support for sentence chunking - see SegmentingTokenizerBase, 
>> implemeented in ThaiTokenizer and HMMChineseTokenizer.  There is an example 
>> in that class’s tests that creates tokens out of individual sentences: 
>> TestSegmentingTokenizerBase.WholeSentenceTokenizer.
>> 
>> However, it sounds like you only need to store the sentences, not search 
>> against them, so I don’t think you need sentence *tokenization*.
>> 
>> why not simply use the JDK’s BreakIterator (or as you say OpenNLP) to do 
>> sentence splitting and add to the doc as stored fields?
>> 
>> Steve
>> www.lucidworks.com
>> 
>>> On Sep 23, 2015, at 11:39 AM, Ziqi Zhang <ziqi.zh...@sheffield.ac.uk> wrote:
>>> 
>>> Thanks that is understood.
>>> 
>>> My application is a bit special in the way that I need both an indexed 
>>> field with standard tokenization and an unindexed but stored field of 
>>> sentences. Both must be present for each document.
>>> 
>>> I could possibly do with PatternTokenizer, but that is of course, less 
>>> accurate than e.g., wrapping OpenNLP sentence splitter in a lucene 
>>> Tokenizer.
>>> 
>>> 
>>> 
>>> On 23/09/2015 16:23, Doug Turnbull wrote:
>>>> Sentence recognition is usually an NLP problem. Probably best handled
>>>> outside of Solr. For example, you probably want to train and run a sentence
>>>> recognition algorithm, inject a sentence delimiter, then use that delimiter
>>>> as the basis for tokenization.
>>>> 
>>>> More info on sentence recognition
>>>> http://opennlp.apache.org/documentation/manual/opennlp.html
>>>> 
>>>> On Wed, Sep 23, 2015 at 11:18 AM, Ziqi Zhang <ziqi.zh...@sheffield.ac.uk>
>>>> wrote:
>>>> 
>>>>> Hi
>>>>> 
>>>>> I need a special kind of 'token' which is a sentence, so I need a
>>>>> tokenizer that splits texts into sentences.
>>>>> 
>>>>> I wonder if there is already such or similar implementations?
>>>>> 
>>>>> If I have to implement it myself, I suppose I need to implement a subclass
>>>>> of Tokenizer. Having looked at a few existing implementations, it does not
>>>>> look very straightforward how to do it. A few pointers would be highly
>>>>> appreciated.
>>>>> 
>>>>> Many thanks
>>>>> 
>>>>> 
>>>>> ---------------------------------------------------------------------
>>>>> To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
>>>>> For additional commands, e-mail: java-user-h...@lucene.apache.org
>>>>> 
>>>>> 
>>> 
>>> -- 
>>> Ziqi Zhang
>>> Research Associate
>>> Department of Computer Science
>>> University of Sheffield
>>> 
>>> 
>>> ---------------------------------------------------------------------
>>> To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
>>> For additional commands, e-mail: java-user-h...@lucene.apache.org
>>> 
>> 
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
>> For additional commands, e-mail: java-user-h...@lucene.apache.org
>> 
> 
> 
> -- 
> Ziqi Zhang
> Research Associate
> Department of Computer Science
> University of Sheffield
> 
> 
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
> For additional commands, e-mail: java-user-h...@lucene.apache.org
> 


---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org

Re: tokenize into sentences/sentence splitter

Reply via email to