+1 for patches to improve the documentation or to fix the bug ;) SynonymFilter.java has a NOTE about not being able to handle multiple incoming tokens at the same position, but we could make this stronger.
But note that there is resistance to improving "tokenstream as graph" handling in Lucene. E.g., see these "dying on the vine" improvements to synonym filter, e.g. https://issues.apache.org/jira/browse/LUCENE-6638 and https://issues.apache.org/jira/browse/LUCENE-6664 both of which are good improvements (at least in my opinion!) but won't be committed any time soon. Mike McCandless http://blog.mikemccandless.com On Fri, Feb 26, 2016 at 10:13 AM, Ryan Josal <[email protected]> wrote: > Is this by design or is there a Jira to track it? It makes it a little > difficult to use my own synonyms with wordnet. Other use cases: > *) SynonymFilters separated by other filters > *) SynonymFilters with different analyzers configured > *) SynonymFilters with different case sensitivity > > Expansion wouldn't be an issue since you can control it with file format. > > Would it be ok to update some documentation about this? The > AnalyzersTokenizersTokenFilters page comes to mind (by the way, the > tokenizerFactory search-lucene.com link is throwing an exception). > > Ryan > > > On Friday, February 26, 2016, Michael McCandless <[email protected]> > wrote: >> >> You can't put a SynonymFilter in front of another one, because the 2nd >> one is unable to properly consume an arbitrary graph.. >> >> For the same reason, you can't put e.g. JapaneseTokenizer before a >> SynonymFilter and expect it to always work. >> >> Mike McCandless >> >> http://blog.mikemccandless.com >> >> >> On Thu, Feb 25, 2016 at 6:10 PM, Ryan Josal <[email protected]> wrote: >> > I know, there's a ton of documentation about the query parser whitespace >> > issue, and there's also a fair bit of info on the >> > positionLengthAttribute >> > issue, but I seem to have stumbled upon a new issue with multi term >> > synonyms: it doesn't seem to play well with a bunch of tokens in the >> > same >> > position. >> > >> > I have a synonym filter with this expansion: >> > side table,end table >> > >> > I can see the synonym is applied when looking at the token stream output >> > for >> > "side table". Today I decided to throw an additional synonymFilter >> > immediately before that one with wordnet synonym expansions. Wordnet >> > expectedly bloats the tokenstream, but all of a sudden the original end >> > table expansion doesn't get applied. I see "side" followed by a bunch >> > of >> > tokens in the same position, followed by a couple new tokens in the next >> > position, followed by "table" in the same token position, followed by >> > some >> > more new tokens in the same position. Since side is still adjacent to >> > table >> > in token positions, I would expect the synonym to hit. Is this a known >> > issue (what's the Jira)? The impact seems significant. Since wordnet >> > is so >> > comprehensive, it's likely going to cause this issue with most of my >> > multi >> > term synonyms. Maybe the workaround is to apply multi term synonyms >> > first >> > as best is possible, although I don't know if you have that kind of >> > control >> > if all your synonyms are applied by a single SynonymFilter. >> > >> > Thanks, >> > Ryan >> > >> >> --------------------------------------------------------------------- >> To unsubscribe, e-mail: [email protected] >> For additional commands, e-mail: [email protected] >> > --------------------------------------------------------------------- To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
