[
https://issues.apache.org/jira/browse/SOLR-1782?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Hoss Man updated SOLR-1782:
---------------------------
Attachment: SOLR-1782.patch
Wojtek & David...
I spent some time today trying to look over the latest patch (
SOLR-1782.2013-01-07.patch ) and had some trouble wrapping my head arround it.
In particular i think the crux of my confusion stems from this comment from
Wojtek...
bq. New patch uses both UninvertedField and FieldCache.DocTermsIndex for
multi-valued facet fields
...which doens't really make sense to me.
If a field is multivalued, then (unless i'm missunderstanding something) a
DocTermsIndex won't make sense for that field at all. At best it's wasted RAM
to build up the DocTermsIndex when UnInvertedField is going to be used instead
-- but in this patch, as part of the "if (null == uif)" sprinkled throughout
FieldFacetStats, it looks like term ordinals from the
UnInvertedField.getTermNumbers method are then being used to look Term values
from the DocTermsIndex ... i don't understand that at all.
Hopefully this is just a minor bug in the patch, and the call to "si.lookup" in
the UIF block(s) of FieldFacetStats.facet & FieldFacetStats.facetTermNum are
ment to be calls to some similar method in UnivertedField, but it's not
entirely clear.
If that is the case, then it suggests to me that a better way to organize this
code so that it's more clear what's going on (and to save all that wasted RAM
used by the DocTermsIndex when going down the UIF code path!) would be to
refactor an abstract base class out of FieldFacetStats defining the API
contract and then have two concrete impls: one using DocTermsIndex provided in
teh constructor, and one using an UnInvertedField provided in it's constructor.
I'm also a little perplexed by the "UnInvertedField.getStats" method which _is_
modified in this patch, but seemingly only as an aside so it will still compile
because of the new param added to the FieldFacetStats constructor. At a
glance, I'm not clear on wether this method is even used anywhere, but that
makes it seem all the more suspicious: it appears that a lot of the logic in
getTermNumbers was cut/paste from this method -- should these methods be
refactored to eliminate that duplication? can UIF.getStats just be deprecated &
removed? Would be make more sense instead of adding some of these new methods
to just "fix" UIF.getStats so that it recognizes when another UIF instance is
needed for the facet field(s)?
Lastly: I noticed that while the patch did include tests, it didn't include the
original tests I wrote when this issue was filled (see SOLR-1782.test.patch).
When I attempted to merge those tests in, I got a failure in
testMicroStatsWithMultiValuedFacetField (even after i fixed the obvious case
sensitivity problem in the original test) -- hopefully this is just a
demonstration of the problem i mentioned above about mixing ords from the
DocTermsIndex and the UnInvertedField, but it may be a symptom of a diff
problem.
unified patch will all tests attached
> stats.facet assumes FieldCache.StringIndex - fails horribly on multivalued
> fields
> ---------------------------------------------------------------------------------
>
> Key: SOLR-1782
> URL: https://issues.apache.org/jira/browse/SOLR-1782
> Project: Solr
> Issue Type: Bug
> Components: search
> Affects Versions: 1.4
> Environment: reproduced on Win2k3 using 1.5.0-dev solr ($Id:
> CHANGES.txt 906924 2010-02-05 12:43:11Z noble $)
> Reporter: Gerald DeConto
> Assignee: Hoss Man
> Attachments: index.rar, SOLR-1782.2013-01-07.patch,
> SOLR-1782.2.patch, SOLR-1782.patch, SOLR-1782.patch, SOLR-1782.patch,
> SOLR-1782.test.patch
>
>
> the StatsComponent assumes any field specified in the stats.facet param can
> be faceted using FieldCache.DEFAULT.getStringIndex. This can cause problems
> with a variety of field types, but in the case of multivalued fields it can
> either cause erroneous false stats when the number of distinct values is
> small, or it can cause ArrayIndexOutOfBoundsException when the number of
> distinct values is greater then the number of documents.
--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]