[ 
https://issues.apache.org/jira/browse/SOLR-1782?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hoss Man updated SOLR-1782:
---------------------------

    Attachment: SOLR-1782.patch

Wojtek & David...

I spent some time today trying to look over the latest patch ( 
SOLR-1782.2013-01-07.patch ) and had some trouble wrapping my head arround it.

In particular i think the crux of my confusion stems from this comment from 
Wojtek...

bq. New patch uses both UninvertedField and FieldCache.DocTermsIndex for 
multi-valued facet fields

...which doens't really make sense to me.  

If a field is multivalued, then (unless i'm missunderstanding something) a 
DocTermsIndex won't make sense for that field at all.  At best it's wasted RAM 
to build up the DocTermsIndex when UnInvertedField is going to be used instead 
-- but in this patch, as part of the "if (null == uif)" sprinkled throughout 
FieldFacetStats, it looks like term ordinals from the 
UnInvertedField.getTermNumbers method are then being used to look Term values 
from the DocTermsIndex ... i don't understand that at all.

Hopefully this is just a minor bug in the patch, and the call to "si.lookup" in 
the UIF block(s) of FieldFacetStats.facet & FieldFacetStats.facetTermNum are 
ment to be calls to some similar method in UnivertedField, but it's not 
entirely clear.

If that is the case, then it suggests to me that a better way to organize this 
code so that it's more clear what's going on (and to save all that wasted RAM 
used by the DocTermsIndex when going down the UIF code path!) would be to 
refactor an abstract base class out of FieldFacetStats defining the API 
contract and then have two concrete impls: one using DocTermsIndex provided in 
teh constructor, and one using an UnInvertedField provided in it's constructor.

I'm also a little perplexed by the "UnInvertedField.getStats" method which _is_ 
modified in this patch, but seemingly only as an aside so it will still compile 
because of the new param added to the FieldFacetStats constructor.  At a 
glance, I'm not clear on wether this method is even used anywhere, but that 
makes it seem all the more suspicious: it appears that a lot of the logic in 
getTermNumbers was cut/paste from this method -- should these methods be 
refactored to eliminate that duplication? can UIF.getStats just be deprecated & 
removed?  Would be make more sense instead of adding some of these new methods 
to just "fix" UIF.getStats so that it recognizes when another UIF instance is 
needed for the facet field(s)?

Lastly: I noticed that while the patch did include tests, it didn't include the 
original tests I wrote when this issue was filled (see SOLR-1782.test.patch).  
When I attempted to merge those tests in, I got a failure in 
testMicroStatsWithMultiValuedFacetField (even after i fixed the obvious case 
sensitivity problem in the original test) -- hopefully this is just a 
demonstration of the problem i mentioned above about mixing ords from the 
DocTermsIndex and the UnInvertedField, but it may be a symptom of a diff 
problem.

unified patch will all tests attached


                
> stats.facet assumes FieldCache.StringIndex - fails horribly on multivalued 
> fields
> ---------------------------------------------------------------------------------
>
>                 Key: SOLR-1782
>                 URL: https://issues.apache.org/jira/browse/SOLR-1782
>             Project: Solr
>          Issue Type: Bug
>          Components: search
>    Affects Versions: 1.4
>         Environment: reproduced on Win2k3 using 1.5.0-dev solr ($Id: 
> CHANGES.txt 906924 2010-02-05 12:43:11Z noble $)
>            Reporter: Gerald DeConto
>            Assignee: Hoss Man
>         Attachments: index.rar, SOLR-1782.2013-01-07.patch, 
> SOLR-1782.2.patch, SOLR-1782.patch, SOLR-1782.patch, SOLR-1782.patch, 
> SOLR-1782.test.patch
>
>
> the StatsComponent assumes any field specified in the stats.facet param can 
> be faceted using FieldCache.DEFAULT.getStringIndex.  This can cause problems 
> with a variety of field types, but in the case of multivalued fields it can 
> either cause erroneous false stats when the number of distinct values is 
> small, or it can cause ArrayIndexOutOfBoundsException when the number of 
> distinct values is greater then the number of documents.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to