RE: extracting charoffsets from SpanWeight's getSpans() in 5.3.1?

Allison, Timothy B. Tue, 03 Nov 2015 05:15:14 -0800

Thank you.  Y, the Map<Integer, OffsetAttribute> charOffsets is a map of token 
position (Integer) and the character offsets for that token...so I think I'm 
good?


As part of LUCENE-5317, I have a DocTokenOffsetsVisitor interface and 
SpanCrawler that runs that visitor against an IndexSearcher...

The visitor sees a Document and a list of character offsets for hits in that 
document.

Once I update that to include the more modern code below (optionally for those 
storing offsets)...is there any interest in integrating LUCENE-5317 or 
components of it so that others don't have to reinvent the wheel below?


Cheers,

           Tim
-----Original Message-----
From: Alan Woodward [mailto:[email protected]] 
Sent: Tuesday, November 03, 2015 4:25 AM
To: [email protected]
Subject: Re: extracting charoffsets from SpanWeight's getSpans() in 5.3.1?

The second parameter passed to SpanCollector.collectLeaf() is the position, 
rather than an index of any kind, which I think is going to mess things up for 
you.  But other than that, you've got the right idea. :-)

Alan Woodward
www.flax.co.uk


On 3 Nov 2015, at 00:26, Allison, Timothy B. wrote:

> All,
> 
>  I'm trying to find all spans in a given String via stored offsets in Lucene 
> 5.3.1.  I wanted to use the Highlighter with a NullFragmenter, but that is 
> highlighting only the matching terms, not the full Spans (related to 
> LUCENE-6796?).
> 
>  My Current code iterates through the spans, stores the span positions in one 
> array and gathers the character offsets via a SpanCollector in a Map<Integer, 
> OffsetAttribute>.  Is there a simpler way?
> 
> Something like this:
> 
> String s = "the quick brown fox jumped over the lazy dog"; String 
> field = "f"; Analyzer analyzer = new StandardAnalyzer();
> 
> SpanQuery spanQuery = new SpanNearQuery(
>        new SpanQuery[] {
>                new SpanTermQuery(new Term(field, "fox")),
>                new SpanTermQuery(new Term(field, "quick"))
>        },
>        3,
>        false
> );
> 
> 
> MemoryIndex index = new MemoryIndex(true);
> 
> 
> index.addField(field, s, analyzer);
> index.freeze();
> 
> IndexSearcher searcher = index.createSearcher(); IndexReader reader = 
> searcher.getIndexReader(); spanQuery = (SpanQuery) 
> spanQuery.rewrite(reader); SpanWeight weight = (SpanWeight) 
> searcher.createWeight(spanQuery, false); Spans spans = 
> weight.getSpans(reader.leaves().get(0),
>        SpanWeight.Postings.OFFSETS);
> 
> if (spans == null) {
> //do something with full string
>     return;
> }
> 
> OffsetSpanCollector offsetSpanCollector = new OffsetSpanCollector(); 
> List<OffsetAttribute> spanPositions = new ArrayList<>(); while 
> (spans.nextDoc() != DocIdSetIterator.NO_MORE_DOCS) {
>    while (spans.nextStartPosition() != Spans.NO_MORE_POSITIONS) {
>        OffsetAttributeImpl offsetAttribute = new OffsetAttributeImpl();
>        offsetAttribute.setOffset(spans.startPosition(), 
> spans.endPosition()-1);
>        spanPositions.add(offsetAttribute);
>        spans.collect(offsetSpanCollector);
>    }
> }
> Map<Integer, OffsetAttribute> charOffsets = 
> offsetSpanCollector.getOffsets(); //now iterate through the list of 
> spanPositions and grab the character offsets for the start and end 
> tokens of each //span from the charOffsets ...
> 
> 
> 
> 
> private class OffsetSpanCollector implements SpanCollector {
>    Map<Integer, Offset> charOffsets = new HashMap<>();
> 
>    @Override
>    public void collectLeaf(PostingsEnum postingsEnum, int i, Term 
> term) throws IOException {
> 
>        OffsetAttributeImpl offsetAttribute = new OffsetAttributeImpl();
>        offsetAttribute.setOffset(postingsEnum.startOffset(), 
> postingsEnum.endOffset());
> 
>        charOffsets.put(i, offsetAttribute);
>    }
> 
>    @Override
>    public void reset() {
> 
>      //don't think I need to do anything with this?
>    }
> 
>    public Map<Integer, OffsetAttribute> getOffsets() {
>        return charOffsets;
>    }
> }
> 
> 


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

RE: extracting charoffsets from SpanWeight's getSpans() in 5.3.1?

Reply via email to