Now it is getting more clear. "pos" (aka position) starts at "-1" and its highest number is the last "node id" of the graph.
"pos" minus "positionLength" is the starting "node id" of the arc. Is the tokenStream after each filter always a valid graph? E.g. ShingleFilter with query "natural forest": SF text start end positionLength type position natural 0 7 1 word 1 natural forest 0 14 2 shingle 1 forest 8 14 1 word 2 (0)--- natural --->(1)--- forest --->(2) But how to insert the shingle into this graph? This is why I added a SynonymPreFilter to correct the graph between ShingleFilter and SynonymGraphFilter. But I had the wrong understanding of pos, positionIncrement, positionLength,... Another question, the API docs say "...Injecting synonyms – here, synonyms of a token should be added after that token..." But as I already mentioned the synonyms are added before the token. Are the docs outdated? Regards Bernd Am 13.02.2017 um 17:31 schrieb Michael McCandless: > On Mon, Feb 13, 2017 at 9:04 AM, Bernd Fehling > <bernd.fehl...@uni-bielefeld.de> wrote: > >> Am I confused by the naming of pos, positionIncrement, offset, >> positionLength, >> start and end between Lucene and Solr? > > "pos" is just accumulating the positionIncrement values, starting from > -1. I don't think Solr's analysis UI would change the meaning of > these attributes. > >> OK, the SynonymGraphFilter is ONLY for Lucene, right? > > No, it's also for Solr and Elasticsearch and any other search servers > on top of Lucene as well. > >> But how are you going to build the multi-word synonym query "natürlicher >> wald" >> from "natural forest"? > > Lucene's and Elasticsearch's query parsers have already been fixed to > correctly handle token graphs by default; Solr has a fork of Lucene's > query parser I think ... I'm not sure if it's been fixed yet to > interpret graphs. > > See e.g. https://issues.apache.org/jira/browse/LUCENE-7603 and > https://issues.apache.org/jira/browse/LUCENE-7638 > >> And how are you going to highlight a synonym hit for "natürlicher wald" >> when start and end is set to 0-14 and not to 0-18? >> Or is start and end not used for highlighting? > > This start/end offset, at query time, is not normally used. If you > have a document in the index that has "natürlicher wald" then it would > have offsets X to X+18, stored in the index ideally as postings > offsets, and should highlight correctly? > > Mike McCandless > > http://blog.mikemccandless.com > >> Am 13.02.2017 um 14:24 schrieb Michael McCandless: >>> Unfortunately, I cannot reproduce the problem with a straight Lucene >>> test case. I added a this test case to TestSynonymGraphFilter.java: >>> >>> https://gist.github.com/mikemccand/318459ca507742052688e2fe800a10dd >>> >>> And when I run it, it produces the correct token graph: >>> >>> TOKEN: naturwald >>> offset: 0-14 >>> pos: 0-4 >>> type: SYNONYM >>> >>> TOKEN: forêt >>> offset: 0-14 >>> pos: 0-1 >>> type: SYNONYM >>> >>> TOKEN: natürlicher >>> offset: 0-14 >>> pos: 0-2 >>> type: SYNONYM >>> >>> TOKEN: natural >>> offset: 0-7 >>> pos: 0-3 >>> type: word >>> >>> TOKEN: naturelle >>> offset: 0-14 >>> pos: 1-4 >>> type: SYNONYM >>> >>> TOKEN: wald >>> offset: 0-14 >>> pos: 2-4 >>> type: SYNONYM >>> >>> TOKEN: forest >>> offset: 8-14 >>> pos: 3-4 >>> type: word >>> >>> Remember that the "pos: " output above is really "node IDs" and you >>> can see the inserted side paths are correct. The offsets are >>> necessarily always 0-14 for inserted tokens because that is the span >>> of the two original tokens. >>> >>> Can you try removing the SPF filters in your test? Or otherwise >>> simplify your test so it's closer to what my test case is doing? >>> >>> Mike McCandless >>> >>> http://blog.mikemccandless.com >>> >>> On Mon, Feb 13, 2017 at 7:52 AM, Michael McCandless >>> <luc...@mikemccandless.com> wrote: >>>> Thanks Bernd; I'll see if I can make a test case from this. >>>> >>>> Mike McCandless >>>> >>>> http://blog.mikemccandless.com >>>> >>>> >>>> On Mon, Feb 13, 2017 at 5:00 AM, Bernd Fehling >>>> <bernd.fehl...@uni-bielefeld.de> wrote: >>>>> My very simple and small sysonym_test.txt has only one line: >>>>> naturwald, natural\ forest, forêt\ naturelle, natürlicher\ wald >>>>> >>>>> If I only use WT (WhitespaceTokenizer) and SGF (with WhitespaceTokenizer) >>>>> the result is: >>>>> >>>>> WT text start end positionLength type position >>>>> natural 0 7 1 word 1 >>>>> forest 8 14 1 word 2 >>>>> >>>>> SGF text start end positionLength type position >>>>> natural 0 7 3 word 1 >>>>> naturelle 0 14 3 SYNONYM 2 >>>>> wald 0 14 2 SYNONYM 3 >>>>> naturwald 0 14 4 SYNONYM 1 >>>>> forêt 0 14 1 SYNONYM 1 >>>>> natürlicher 0 14 2 SYNONYM 1 >>>>> >>>>> forest 8 14 1 word 4 >>>>> >>>>> The result is some kind of rubbish. >>>>> Also note the empty line between "natürlicher" and "forest". >>>>> >>>>> Anything else I should try, may be with KeywordTokenizer? >>>>> >>>>> p.s. You might have noticed the SPF filters in my setup. >>>>> First is SynonymPreFilter to set all attributes to the right value, >>>>> second is SynonymPostFilter to again fix all attribute settings but >>>>> also set multi-word synonyms as phrase and also cleanup the result >>>>> of SGF. >>>>> >>>>> Regards >>>>> Bernd >>>>> >>>>> Am 11.02.2017 um 00:45 schrieb Michael McCandless: >>>>>> Yeah, those tokens should have position length 2. >>>>>> >>>>>> Can you reduce to a small set of synonyms and text? If you use only >>>>>> whitespace tokenizer and SGF does the issue reproduce? >>>>>> >>>>>> Mike McCandless >>>>>> >>>>>> http://blog.mikemccandless.com >>>>>> >>>>>> >>>>>> On Fri, Feb 10, 2017 at 10:07 AM, Bernd Fehling >>>>>> <bernd.fehl...@uni-bielefeld.de> wrote: >>>>>>> Example for position end and positionLength of SGF. >>>>>>> >>>>>>> query: natural forest >>>>>>> >>>>>>> WT text start end positionLength type position >>>>>>> natural 0 7 1 word 1 >>>>>>> forest 8 14 1 word 2 >>>>>>> ... >>>>>>> >>>>>>> SPF text start end positionLength type position >>>>>>> natural 0 7 1 word 1 >>>>>>> natural forest 0 14 2 shingle 2 >>>>>>> forest 8 14 1 word 3 >>>>>>> >>>>>>> SGF text start end positionLength type position >>>>>>> natural 0 7 1 word 1 >>>>>>> naturwald 0 14 1 SYNONYM 2 >>>>>>> forêt naturelle 0 14 1 SYNONYM 2 >>>>>>> natürlicher wald 0 14 1 SYNONYM 2 >>>>>>> natural forest 0 14 1 shingle 2 >>>>>>> forest 8 14 1 word 3 >>>>>>> >>>>>>> SPF text start end positionLength type position >>>>>>> natural 0 7 1 word 1 >>>>>>> naturwald 0 9 1 SYNONYM 2 >>>>>>> "forêt naturelle" 0 17 2 SYNONYM 2 >>>>>>> "natürlicher wald" 0 18 2 SYNONYM 2 >>>>>>> "natural forest" 0 16 2 shingle 2 >>>>>>> forest 8 14 1 word 3 >>>>>>> >>>>>>> >>>>>>> SGF (SynonymsGraphFilter) has for all SYNONYM's the same position end >>>>>>> and positionLength. >>>>>>> I suppose that it is not correct? >>>>>>> >>>>>>> Regards >>>>>>> Bernd >>>>>>> >>>>>>> >>>>>>> Am 09.02.2017 um 18:39 schrieb Michael McCandless: >>>>>>>> On Thu, Feb 9, 2017 at 2:40 AM, Bernd Fehling >>>>>>>> <bernd.fehl...@uni-bielefeld.de> wrote: >>>>>>>>> I tried SynonymGraphFilter with my setup and it works right away. >>>>>>>>> It payed of that I did some modifications on my filters while >>>>>>>>> testing 6.3 with my setup. >>>>>>>> >>>>>>>> Good! >>>>>>>> >>>>>>>>> I only replaced SynonymFilter with SynonymGraphFilter and did not >>>>>>>>> use FlattenGraphFilter, pretty simple. So I can confirm that, up >>>>>>>>> to this point, SynonymGraphFilter is a full replacement for >>>>>>>>> SynonymFilter. At least for search-time synonym handling. >>>>>>>>> >>>>>>>>> But this also means there is still some work with the attributes, >>>>>>>>> right? >>>>>>>>> Position looks good, type and start are no problem anyway, but >>>>>>>>> the end position is still wrong and the positionLength for multi-word >>>>>>>>> synonyms. >>>>>>>> >>>>>>>> Can you give an example or make a small test case? >>>>>>>> PositionLengthAttribute is supposed to be correct coming out of >>>>>>>> SynonymGraphFilter. >>>>>>>> >>>>>>>>> One thing I noticed was that the originating token which "produces" >>>>>>>>> synonyms comes out last from SynonymGraphFilter, after the >>>>>>>>> "produced" synonyms. >>>>>>>>> I will have a look inside with debugger but I guess this is due >>>>>>>>> to output buffering of SynonymGraphFilter? >>>>>>>> >>>>>>>> Yeah they do come out in a different order, which token filters are >>>>>>>> allowed to do in general for all tokens leaving from the same position >>>>>>>> ... >>>>>>>> >>>>>>>> Mike McCandless >>>>>>>> >>>>>>>> http://blog.mikemccandless.com >>>>>>>> --------------------------------------------------------------------- To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org For additional commands, e-mail: java-user-h...@lucene.apache.org