Shad Storhaug created LUCENENET-612:
---------------------------------------

             Summary: SERIOUS issues with PerFieldAnalyzerWrapper in 4.8
                 Key: LUCENENET-612
                 URL: https://issues.apache.org/jira/browse/LUCENENET-612
             Project: Lucene.Net
          Issue Type: Bug
          Components: Lucene.Net.Analysis.Common
    Affects Versions: Lucene.Net 4.8.0
            Reporter: Shad Storhaug


This came in on the user mailing list on 15-July-2019 and was originally 
reported by Bryan Rojo (bryanr...@elliotelectric.com)

 
{quote}Not necessarily a bug, but for some people who use 
PerFieldAnalyzerWrapper like I do this might be worth noting.

PerFieldAnalyzerWrapper has been "improved" in 4.8 and now uses a 
PER_FIELD_REUSE_STRATEGY which means that the tokenized fields will be stored 
in a dictionary, so If you have multiple fields with the same name in your 
document, then you will only be able to index the very first one that makes it 
into that dictionary.

So the problem with this is that you can potentially lose thousands of terms in 
your index, which could cause your searches to be of very low quality.

BEWARE.
{quote}
 
There are 2 issues that need to be resolved to address this:

1. The documentation for {{PerFieldAnalyzerWrapper}} should be updated to 
inform users that if they need to use multiple dictionary keys, they should use 
{{TreeDictionary<K, V>}}.
2. {{TreeDictionary<K, V>}} does not currently implement 
{{System.Collections.Generic.IDictionary<TKey, TValue>}}, as it was brought 
over from C5 as-is.

Another thing of note is that C5 has added support for .NET Standard 1.0 since 
this was brought over.

However, there still seems to be a few problems that make the C5 types 
incompatible with Lucene.Net, most notably the lack of support for 
{{System.Collections.Generic.IDictionary<TKey, TValue>}} in {{TreeDictionary}} 
and {{System.Collections.Generic.ISet<T>}} in {{TreeSet}} (the latter of which 
has already been patched in {{Lucene.Net.Support.TreeSet}}).

I [reported|https://github.com/sestoft/C5/issues/53] the lack of support for 
{{ISet<T>}} on 6-Nov-2016, but although the maintainers agree this should be 
done, it still hasn't been. Perhaps a PR to the C5 project is the way to get 
this done, which would allow us to finally remove these collection copies from 
Lucene.Net.Support and add a package dependency on C5.

Another option is to shop around to see if there are any other generic 
TreeSet/TreeDictionary implementations that have popped up since late 2016 that 
we can check for compatibility.



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)

Reply via email to