Thanks Bernd; I'll see if I can make a test case from this. Mike McCandless
http://blog.mikemccandless.com On Mon, Feb 13, 2017 at 5:00 AM, Bernd Fehling <bernd.fehl...@uni-bielefeld.de> wrote: > My very simple and small sysonym_test.txt has only one line: > naturwald, natural\ forest, forêt\ naturelle, natürlicher\ wald > > If I only use WT (WhitespaceTokenizer) and SGF (with WhitespaceTokenizer) > the result is: > > WT text start end positionLength type position > natural 0 7 1 word 1 > forest 8 14 1 word 2 > > SGF text start end positionLength type position > natural 0 7 3 word 1 > naturelle 0 14 3 SYNONYM 2 > wald 0 14 2 SYNONYM 3 > naturwald 0 14 4 SYNONYM 1 > forêt 0 14 1 SYNONYM 1 > natürlicher 0 14 2 SYNONYM 1 > > forest 8 14 1 word 4 > > The result is some kind of rubbish. > Also note the empty line between "natürlicher" and "forest". > > Anything else I should try, may be with KeywordTokenizer? > > p.s. You might have noticed the SPF filters in my setup. > First is SynonymPreFilter to set all attributes to the right value, > second is SynonymPostFilter to again fix all attribute settings but > also set multi-word synonyms as phrase and also cleanup the result > of SGF. > > Regards > Bernd > > Am 11.02.2017 um 00:45 schrieb Michael McCandless: >> Yeah, those tokens should have position length 2. >> >> Can you reduce to a small set of synonyms and text? If you use only >> whitespace tokenizer and SGF does the issue reproduce? >> >> Mike McCandless >> >> http://blog.mikemccandless.com >> >> >> On Fri, Feb 10, 2017 at 10:07 AM, Bernd Fehling >> <bernd.fehl...@uni-bielefeld.de> wrote: >>> Example for position end and positionLength of SGF. >>> >>> query: natural forest >>> >>> WT text start end positionLength type position >>> natural 0 7 1 word 1 >>> forest 8 14 1 word 2 >>> ... >>> >>> SPF text start end positionLength type position >>> natural 0 7 1 word 1 >>> natural forest 0 14 2 shingle 2 >>> forest 8 14 1 word 3 >>> >>> SGF text start end positionLength type position >>> natural 0 7 1 word 1 >>> naturwald 0 14 1 SYNONYM 2 >>> forêt naturelle 0 14 1 SYNONYM 2 >>> natürlicher wald 0 14 1 SYNONYM 2 >>> natural forest 0 14 1 shingle 2 >>> forest 8 14 1 word 3 >>> >>> SPF text start end positionLength type position >>> natural 0 7 1 word 1 >>> naturwald 0 9 1 SYNONYM 2 >>> "forêt naturelle" 0 17 2 SYNONYM 2 >>> "natürlicher wald" 0 18 2 SYNONYM 2 >>> "natural forest" 0 16 2 shingle 2 >>> forest 8 14 1 word 3 >>> >>> >>> SGF (SynonymsGraphFilter) has for all SYNONYM's the same position end and >>> positionLength. >>> I suppose that it is not correct? >>> >>> Regards >>> Bernd >>> >>> >>> Am 09.02.2017 um 18:39 schrieb Michael McCandless: >>>> On Thu, Feb 9, 2017 at 2:40 AM, Bernd Fehling >>>> <bernd.fehl...@uni-bielefeld.de> wrote: >>>>> I tried SynonymGraphFilter with my setup and it works right away. >>>>> It payed of that I did some modifications on my filters while >>>>> testing 6.3 with my setup. >>>> >>>> Good! >>>> >>>>> I only replaced SynonymFilter with SynonymGraphFilter and did not >>>>> use FlattenGraphFilter, pretty simple. So I can confirm that, up >>>>> to this point, SynonymGraphFilter is a full replacement for >>>>> SynonymFilter. At least for search-time synonym handling. >>>>> >>>>> But this also means there is still some work with the attributes, right? >>>>> Position looks good, type and start are no problem anyway, but >>>>> the end position is still wrong and the positionLength for multi-word >>>>> synonyms. >>>> >>>> Can you give an example or make a small test case? >>>> PositionLengthAttribute is supposed to be correct coming out of >>>> SynonymGraphFilter. >>>> >>>>> One thing I noticed was that the originating token which "produces" >>>>> synonyms comes out last from SynonymGraphFilter, after the >>>>> "produced" synonyms. >>>>> I will have a look inside with debugger but I guess this is due >>>>> to output buffering of SynonymGraphFilter? >>>> >>>> Yeah they do come out in a different order, which token filters are >>>> allowed to do in general for all tokens leaving from the same position >>>> ... >>>> >>>> Mike McCandless >>>> >>>> http://blog.mikemccandless.com >>>> >>>> --------------------------------------------------------------------- >>>> To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org >>>> For additional commands, e-mail: java-user-h...@lucene.apache.org >>>> >>> >>> -- >>> ************************************************************* >>> Bernd Fehling Bielefeld University Library >>> Dipl.-Inform. (FH) LibTec - Library Technology >>> Universitätsstr. 25 and Knowledge Management >>> 33615 Bielefeld >>> Tel. +49 521 106-4060 bernd.fehling(at)uni-bielefeld.de >>> >>> BASE - Bielefeld Academic Search Engine - www.base-search.net >>> ************************************************************* >>> >>> --------------------------------------------------------------------- >>> To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org >>> For additional commands, e-mail: java-user-h...@lucene.apache.org >>> >> >> --------------------------------------------------------------------- >> To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org >> For additional commands, e-mail: java-user-h...@lucene.apache.org >> > > -- > ************************************************************* > Bernd Fehling Bielefeld University Library > Dipl.-Inform. (FH) LibTec - Library Technology > Universitätsstr. 25 and Knowledge Management > 33615 Bielefeld > Tel. +49 521 106-4060 bernd.fehling(at)uni-bielefeld.de > > BASE - Bielefeld Academic Search Engine - www.base-search.net > ************************************************************* > > --------------------------------------------------------------------- > To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org > For additional commands, e-mail: java-user-h...@lucene.apache.org > --------------------------------------------------------------------- To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org For additional commands, e-mail: java-user-h...@lucene.apache.org