On Mon, Feb 13, 2017 at 9:04 AM, Bernd Fehling <bernd.fehl...@uni-bielefeld.de> wrote:
> Am I confused by the naming of pos, positionIncrement, offset, positionLength, > start and end between Lucene and Solr? "pos" is just accumulating the positionIncrement values, starting from -1. I don't think Solr's analysis UI would change the meaning of these attributes. > OK, the SynonymGraphFilter is ONLY for Lucene, right? No, it's also for Solr and Elasticsearch and any other search servers on top of Lucene as well. > But how are you going to build the multi-word synonym query "natürlicher wald" > from "natural forest"? Lucene's and Elasticsearch's query parsers have already been fixed to correctly handle token graphs by default; Solr has a fork of Lucene's query parser I think ... I'm not sure if it's been fixed yet to interpret graphs. See e.g. https://issues.apache.org/jira/browse/LUCENE-7603 and https://issues.apache.org/jira/browse/LUCENE-7638 > And how are you going to highlight a synonym hit for "natürlicher wald" > when start and end is set to 0-14 and not to 0-18? > Or is start and end not used for highlighting? This start/end offset, at query time, is not normally used. If you have a document in the index that has "natürlicher wald" then it would have offsets X to X+18, stored in the index ideally as postings offsets, and should highlight correctly? Mike McCandless http://blog.mikemccandless.com > Am 13.02.2017 um 14:24 schrieb Michael McCandless: >> Unfortunately, I cannot reproduce the problem with a straight Lucene >> test case. I added a this test case to TestSynonymGraphFilter.java: >> >> https://gist.github.com/mikemccand/318459ca507742052688e2fe800a10dd >> >> And when I run it, it produces the correct token graph: >> >> TOKEN: naturwald >> offset: 0-14 >> pos: 0-4 >> type: SYNONYM >> >> TOKEN: forêt >> offset: 0-14 >> pos: 0-1 >> type: SYNONYM >> >> TOKEN: natürlicher >> offset: 0-14 >> pos: 0-2 >> type: SYNONYM >> >> TOKEN: natural >> offset: 0-7 >> pos: 0-3 >> type: word >> >> TOKEN: naturelle >> offset: 0-14 >> pos: 1-4 >> type: SYNONYM >> >> TOKEN: wald >> offset: 0-14 >> pos: 2-4 >> type: SYNONYM >> >> TOKEN: forest >> offset: 8-14 >> pos: 3-4 >> type: word >> >> Remember that the "pos: " output above is really "node IDs" and you >> can see the inserted side paths are correct. The offsets are >> necessarily always 0-14 for inserted tokens because that is the span >> of the two original tokens. >> >> Can you try removing the SPF filters in your test? Or otherwise >> simplify your test so it's closer to what my test case is doing? >> >> Mike McCandless >> >> http://blog.mikemccandless.com >> >> On Mon, Feb 13, 2017 at 7:52 AM, Michael McCandless >> <luc...@mikemccandless.com> wrote: >>> Thanks Bernd; I'll see if I can make a test case from this. >>> >>> Mike McCandless >>> >>> http://blog.mikemccandless.com >>> >>> >>> On Mon, Feb 13, 2017 at 5:00 AM, Bernd Fehling >>> <bernd.fehl...@uni-bielefeld.de> wrote: >>>> My very simple and small sysonym_test.txt has only one line: >>>> naturwald, natural\ forest, forêt\ naturelle, natürlicher\ wald >>>> >>>> If I only use WT (WhitespaceTokenizer) and SGF (with WhitespaceTokenizer) >>>> the result is: >>>> >>>> WT text start end positionLength type position >>>> natural 0 7 1 word 1 >>>> forest 8 14 1 word 2 >>>> >>>> SGF text start end positionLength type position >>>> natural 0 7 3 word 1 >>>> naturelle 0 14 3 SYNONYM 2 >>>> wald 0 14 2 SYNONYM 3 >>>> naturwald 0 14 4 SYNONYM 1 >>>> forêt 0 14 1 SYNONYM 1 >>>> natürlicher 0 14 2 SYNONYM 1 >>>> >>>> forest 8 14 1 word 4 >>>> >>>> The result is some kind of rubbish. >>>> Also note the empty line between "natürlicher" and "forest". >>>> >>>> Anything else I should try, may be with KeywordTokenizer? >>>> >>>> p.s. You might have noticed the SPF filters in my setup. >>>> First is SynonymPreFilter to set all attributes to the right value, >>>> second is SynonymPostFilter to again fix all attribute settings but >>>> also set multi-word synonyms as phrase and also cleanup the result >>>> of SGF. >>>> >>>> Regards >>>> Bernd >>>> >>>> Am 11.02.2017 um 00:45 schrieb Michael McCandless: >>>>> Yeah, those tokens should have position length 2. >>>>> >>>>> Can you reduce to a small set of synonyms and text? If you use only >>>>> whitespace tokenizer and SGF does the issue reproduce? >>>>> >>>>> Mike McCandless >>>>> >>>>> http://blog.mikemccandless.com >>>>> >>>>> >>>>> On Fri, Feb 10, 2017 at 10:07 AM, Bernd Fehling >>>>> <bernd.fehl...@uni-bielefeld.de> wrote: >>>>>> Example for position end and positionLength of SGF. >>>>>> >>>>>> query: natural forest >>>>>> >>>>>> WT text start end positionLength type position >>>>>> natural 0 7 1 word 1 >>>>>> forest 8 14 1 word 2 >>>>>> ... >>>>>> >>>>>> SPF text start end positionLength type position >>>>>> natural 0 7 1 word 1 >>>>>> natural forest 0 14 2 shingle 2 >>>>>> forest 8 14 1 word 3 >>>>>> >>>>>> SGF text start end positionLength type position >>>>>> natural 0 7 1 word 1 >>>>>> naturwald 0 14 1 SYNONYM 2 >>>>>> forêt naturelle 0 14 1 SYNONYM 2 >>>>>> natürlicher wald 0 14 1 SYNONYM 2 >>>>>> natural forest 0 14 1 shingle 2 >>>>>> forest 8 14 1 word 3 >>>>>> >>>>>> SPF text start end positionLength type position >>>>>> natural 0 7 1 word 1 >>>>>> naturwald 0 9 1 SYNONYM 2 >>>>>> "forêt naturelle" 0 17 2 SYNONYM 2 >>>>>> "natürlicher wald" 0 18 2 SYNONYM 2 >>>>>> "natural forest" 0 16 2 shingle 2 >>>>>> forest 8 14 1 word 3 >>>>>> >>>>>> >>>>>> SGF (SynonymsGraphFilter) has for all SYNONYM's the same position end >>>>>> and positionLength. >>>>>> I suppose that it is not correct? >>>>>> >>>>>> Regards >>>>>> Bernd >>>>>> >>>>>> >>>>>> Am 09.02.2017 um 18:39 schrieb Michael McCandless: >>>>>>> On Thu, Feb 9, 2017 at 2:40 AM, Bernd Fehling >>>>>>> <bernd.fehl...@uni-bielefeld.de> wrote: >>>>>>>> I tried SynonymGraphFilter with my setup and it works right away. >>>>>>>> It payed of that I did some modifications on my filters while >>>>>>>> testing 6.3 with my setup. >>>>>>> >>>>>>> Good! >>>>>>> >>>>>>>> I only replaced SynonymFilter with SynonymGraphFilter and did not >>>>>>>> use FlattenGraphFilter, pretty simple. So I can confirm that, up >>>>>>>> to this point, SynonymGraphFilter is a full replacement for >>>>>>>> SynonymFilter. At least for search-time synonym handling. >>>>>>>> >>>>>>>> But this also means there is still some work with the attributes, >>>>>>>> right? >>>>>>>> Position looks good, type and start are no problem anyway, but >>>>>>>> the end position is still wrong and the positionLength for multi-word >>>>>>>> synonyms. >>>>>>> >>>>>>> Can you give an example or make a small test case? >>>>>>> PositionLengthAttribute is supposed to be correct coming out of >>>>>>> SynonymGraphFilter. >>>>>>> >>>>>>>> One thing I noticed was that the originating token which "produces" >>>>>>>> synonyms comes out last from SynonymGraphFilter, after the >>>>>>>> "produced" synonyms. >>>>>>>> I will have a look inside with debugger but I guess this is due >>>>>>>> to output buffering of SynonymGraphFilter? >>>>>>> >>>>>>> Yeah they do come out in a different order, which token filters are >>>>>>> allowed to do in general for all tokens leaving from the same position >>>>>>> ... >>>>>>> >>>>>>> Mike McCandless >>>>>>> >>>>>>> http://blog.mikemccandless.com >>>>>>> > > --------------------------------------------------------------------- > To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org > For additional commands, e-mail: java-user-h...@lucene.apache.org > --------------------------------------------------------------------- To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org For additional commands, e-mail: java-user-h...@lucene.apache.org