Thank you. Y, the Map<Integer, OffsetAttribute> charOffsets is a map of token position (Integer) and the character offsets for that token...so I think I'm good?
As part of LUCENE-5317, I have a DocTokenOffsetsVisitor interface and SpanCrawler that runs that visitor against an IndexSearcher... The visitor sees a Document and a list of character offsets for hits in that document. Once I update that to include the more modern code below (optionally for those storing offsets)...is there any interest in integrating LUCENE-5317 or components of it so that others don't have to reinvent the wheel below? Cheers, Tim -----Original Message----- From: Alan Woodward [mailto:a...@flax.co.uk] Sent: Tuesday, November 03, 2015 4:25 AM To: java-user@lucene.apache.org Subject: Re: extracting charoffsets from SpanWeight's getSpans() in 5.3.1? The second parameter passed to SpanCollector.collectLeaf() is the position, rather than an index of any kind, which I think is going to mess things up for you. But other than that, you've got the right idea. :-) Alan Woodward www.flax.co.uk On 3 Nov 2015, at 00:26, Allison, Timothy B. wrote: > All, > > I'm trying to find all spans in a given String via stored offsets in Lucene > 5.3.1. I wanted to use the Highlighter with a NullFragmenter, but that is > highlighting only the matching terms, not the full Spans (related to > LUCENE-6796?). > > My Current code iterates through the spans, stores the span positions in one > array and gathers the character offsets via a SpanCollector in a Map<Integer, > OffsetAttribute>. Is there a simpler way? > > Something like this: > > String s = "the quick brown fox jumped over the lazy dog"; String > field = "f"; Analyzer analyzer = new StandardAnalyzer(); > > SpanQuery spanQuery = new SpanNearQuery( > new SpanQuery[] { > new SpanTermQuery(new Term(field, "fox")), > new SpanTermQuery(new Term(field, "quick")) > }, > 3, > false > ); > > > MemoryIndex index = new MemoryIndex(true); > > > index.addField(field, s, analyzer); > index.freeze(); > > IndexSearcher searcher = index.createSearcher(); IndexReader reader = > searcher.getIndexReader(); spanQuery = (SpanQuery) > spanQuery.rewrite(reader); SpanWeight weight = (SpanWeight) > searcher.createWeight(spanQuery, false); Spans spans = > weight.getSpans(reader.leaves().get(0), > SpanWeight.Postings.OFFSETS); > > if (spans == null) { > //do something with full string > return; > } > > OffsetSpanCollector offsetSpanCollector = new OffsetSpanCollector(); > List<OffsetAttribute> spanPositions = new ArrayList<>(); while > (spans.nextDoc() != DocIdSetIterator.NO_MORE_DOCS) { > while (spans.nextStartPosition() != Spans.NO_MORE_POSITIONS) { > OffsetAttributeImpl offsetAttribute = new OffsetAttributeImpl(); > offsetAttribute.setOffset(spans.startPosition(), > spans.endPosition()-1); > spanPositions.add(offsetAttribute); > spans.collect(offsetSpanCollector); > } > } > Map<Integer, OffsetAttribute> charOffsets = > offsetSpanCollector.getOffsets(); //now iterate through the list of > spanPositions and grab the character offsets for the start and end > tokens of each //span from the charOffsets ... > > > > > private class OffsetSpanCollector implements SpanCollector { > Map<Integer, Offset> charOffsets = new HashMap<>(); > > @Override > public void collectLeaf(PostingsEnum postingsEnum, int i, Term > term) throws IOException { > > OffsetAttributeImpl offsetAttribute = new OffsetAttributeImpl(); > offsetAttribute.setOffset(postingsEnum.startOffset(), > postingsEnum.endOffset()); > > charOffsets.put(i, offsetAttribute); > } > > @Override > public void reset() { > > //don't think I need to do anything with this? > } > > public Map<Integer, OffsetAttribute> getOffsets() { > return charOffsets; > } > } > > --------------------------------------------------------------------- To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org For additional commands, e-mail: java-user-h...@lucene.apache.org