[ 
https://issues.apache.org/jira/browse/SOLR-11391?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16192388#comment-16192388
 ] 

Hoss Man commented on SOLR-11391:
---------------------------------


Comments on latest patch (broader thoughts below)...

* JoinQParserPlugin.java
** chooseJoinMethod
*** javadocs should be valid html, docs need to come before @param, etc...
*** IIUC, you've deleted the only code path that will give the user an error if 
the field points==true but docValues==false
**** validateJoinMethod will fail if the user explicitly asks for ENUM, but 
here in this method you'll attempt to use ENUM if the user sends 
{{method=null|SMART}}
*** treating "!= -1" as magic is dangerous, what if some code accidently passes 
-2? ... "< 0" is safer
**** either that or {{assert -1 <= domainSize}}
*** if we just want the LeafReaderContext then why ask for 
{{getSlowAtomicReader()}} ? why not just use {{searcher.getLeafContexts()}} ?
*** {{sumRatio = + segmentCardinality / segmentDocs;}} ... 100% certain that 
should be {{+=}}
*** the existence of the {{+=}} typo plus the "pick ENUM for poins w/o 
docValues w/o error" mentioned above tells me this method should most certainly 
have some very targeted unit tests
**** there should most certianly be a unit test of this specific method that 
asserts it gets ENUM back for some fields and DV back for others based on the 
individual segments
**** ideally, instead of needing to actaully build up tones of large indexes 
for this, this helper method should be refactored to take in some arrays of the 
data -- ie: {{chooseJoinMethod(SolrIndexSearcher, SchemaField, int)}} should 
delegate to some {{chooseJoinMethodByDvStats(long[] numDocsPerLeaf, long[] 
cardinalityPerLeaf)}} that we have direct unit tests for
** JoinQuery
*** constructor still needs jdocs
**** in particular needs to point out that if domainFilters != null it will 
have elements added to it when query is exececuted
**** (either that, or refactor the defensive copy down from createJoinQuery)
*** {{toString}} should definitely use the {{method}} in it's output
*** {{hashCode}} and {{equals}} currently don't consider {{method}} or 
{{domainFilters}} in processing
**** should diff {{method}} values make 2 {{JoinQuery}} objects not-equals?  
what about {{SMART}} values? ... not sure.
**** different {{domainFilters}} should *definitely* make 2 {{JoinQuery}} 
objects not-equals, othewise we're going to get weird caching behavior -- but 
this raises interesting questions about what happens when the {{JoinMethod}} 
type doesn't even support/use the {{domainFilters}} ... then should 2 
{{JoinQuery}} objects be considered equal?
**** no matter what, these questions make it clear we should add some more 
robust equal/not-equal checks for the join parser (and using createJoinQuery 
with diff domainFilters)
*** we should probably call {{method = chooseJoinMethod}} before 
{{validateJoinMethod(...)}} to help ensure we never intorduce bugs that cause 
our hueristic to pick broken options for the specified field
**** ideally this code path would/should even catch/retrow the SolrException 
and wrap it in a "code is broken, please file a bug" if the original method was 
SMART
**** either that, or just refactor validateJoinMethod and chooseJoinMethod into 
a single method that: 1) if SMART, attempts to narrow down by field props or 
error if nothing legal; 2) if still options, pick based on field stats.
**** either way, doing this choosing & validating should be done as early as 
possible -- why not put it in the {{JoinQueryWeight}} constructor instead of in 
the {{JoinQueryWeight.getDocSet()}} method ?
***** then we could also set the "actual" {{method}} in the {{dbg}} info 
created by {{JoinQueryWeight.scorer()}}
**** actually -- what would probably be best is if the {{JoinQuery}} 
constructor rejected {{null}} or {{SMART}} as input values, and required the 
caller to call thes new merged "choose+validate helper" (both the 
{{JoinQParser.parse() method and the facet code using {{createJoinQuery()}} are 
garunteed to have access the the neccessary SolrIndexSearcher) to resolve the 
user specified "method" into an "executable" method.  That way the 
{{JoinQuery}} constructor could explicitly set {{this.domainFilters=null}} 
unless the {{method}} supported them (which would simplify the questions about 
how {{hashCode}} and {{equals()}} should work.
** validateJoinMethod
*** once points is checked, why is there redundent nested if check for 
{{isPointField()}} ?
** executeJoinQuery
*** jdocs still need to mention this method modifies domainFilters
**** or better still: preserve the original {{domainFilter}} passed by the 
client all the way into the {{JoinQuery}} and assign to {{this.domainFilters}} 
(which should be final), but only make a copy here inside {{executeJoinQuery}} 
(to add the {{uncachedJoinQ}} to) and then let the copy by GCed after the 
{{toSearcher.getDocSet(...);}} call ... which would also help simplify the 
issues/risks/confusion as far as the {{hashCode}} and {{equals}} methods are 
concerned with {{domainFilter}}
*** why the redunent {{DocSet result = ...; return result;}} (in two places) ? 
... why not just "return ...}}
** getDocSetEnumerate
*** if this method *only* works with {{JoinMethod.ENUM}} there should be an 
assert to that effect
*** if this method ignores {{domainFilters}} then there should be an assert 
that it's {{null}}
*** (both of these will help protect us from developer mistakes down the road)

* TestCloudJSONFacetJoinDomain.java
** JoinDomain
*** jdocs need updated
*** constructor should probably assert {{null != from || null == method}} ... 
it will only confuse people (and future devs if there is a "method" specified 
even when there is no join and we're just filtering.
** buildRandomDomain()
*** we should randomize both {{null}} and {{"smart"}} values to test all the 
code paths (as things stand right now, we're never exercising the "hueristic" 
logic in these randomized tests
*** if {{from}} is {{null}}, method should _always_ be be {{null}} ... 
otherwise it could easily confuse people 9see above suggestion about adding 
assertion to the JoinDomain constructor)
**** probably best to start with {{null}} as the default, if from is non-null 
randomly set to {{"smart"}} sometimes, and/or randomly set to {{"dv"}} or 
{{"enum"}} if and only if we can tell from the field info that one of those 
options will be supported
** applyDomainToQuery
*** I notice that you did _not_ modify this method to use the new {{method}} 
option on the {{{!join\}}} query when verifying the facet results ... i'm not 
sure if that was intentional, but i think it's a good idea to leave the code 
the way it is in the current patch
*** this way, even as we randomize the {{method}} in the facet join {{domain}}, 
we verify the results using whatever the _default_ (implicit) algorithm is for 
the join query
*** i do think however, that "this method" (applyDomainToQuery)  should be 
updated with an explicit comment pointing out that we are explicitly ignoring 
{{this.method}} for this reason


* TestJoin.java
** runWithMethod + testJoin
*** in the event of failure, our test coverage will be a lot better if you just 
put each call to {{runWithMethod}} in it's own {{public void test*}} method
**** Example: {code}
public void testBasicJoinUsingEnum() { runWithMethod("enum"); }
{code}
**** that way if someone breaks something *only* in the DV code path, it will 
be really obvious if the enum test still passes -- but if noth tests fail, 
we'll know it's in the common code
*** likewise, there should probably be versions that calls 
{{runWithMethod("smart");}} and {{runWithMethod(null)'}}
*** better still: {{runWithMethod}} could be made a lot more robust by adding a 
{{String joinFieldSuffix}} option instead of picking the field names based on 
the method arg -- and then we could test all the permutations that are expected 
to be valid: {code}
public void testBasicJoinEnum() { runWithMethod("enum", "_s"); }
public void testBasicJoinDV() { runWithMethod("dv", "_ds"); }
public void testBasicJoinEnumDVField() { runWithMethod("enum", "_ds"); }
public void testBasicJoinHueristicDVField() { runWithMethod(null, "_ds"); }
public void testBasicJoinHueristicNoDVField() { runWithMethod(null, "_s"); }
{code}
** runWithMethod + error handling
*** docs 14 & 15 + the error handling checks for a missmatch between the 
{{method}} type and the field options don't seem like they belong in this 
method at all
**** these types of assertions should be refactored into their method
**** assertions of error messages should either use {{assertQEx(...)}} or 
{{Exception e = expectThrows(...}} (followed by assertions on the {{e}} to 
check it's message, type, etc...)
***** in particular it makes no sense to use {{assetJQ}} with an "expected" 
pattern match on the response ( numFound=3 + list of sorted docs) if the 
_actual_ expectation is a thrown exception.
**** we also seem to be missing a check for the case of someone requesting 
{{method=enum}} on a (non-points) field w/ {{indexed=false docValues=true}} ?
** testRandomJoin
*** we should just remove the dead code (ie: {{/* disable matching incompatible 
fields ...}}
*** {{System.getProperty("solr.tests.numeric.dv")}} ... replace with 
{{Boolean.getBoolean(NUMERIC_POINTS_SYSPROP)}}
**** in general this special case is weird & confusing an desrves a comment 
(why is groupNumber==1 special?)
*** this method should be also be randomly specifying {{method="smart"}} and 
"null|no method" (where no method localparam is specified at all)

* GraphTermsCollector.java
** the class javadocs should give some information about the specifics of how 
it extends the abstract base class -- not just link to it.

* FacetRequest.java
** createDomainQuery
*** I see once this method is called why it takes in a JoinMethod instead of 
using {{this.method}} but at first glance it's entirely unclear -- some 
docs/comments explaing will save other devs a lot of confusion down the road

* FacetProcessor.java
** handleJoinField
*** previous comment still applies....{quote}
since semantics have changed such that method assumes it won't be called if 
{{null==freq.domain.joinField}} method should start with an assert to that 
effect (and have jdocs noting it)
{quote}
*** the previously mentioned suggestion to merge {{chooseJoinMethod}} with 
{{validateJoinMethod}} would help simplify this method a fair bit
*** The more this patch evolves, and the more i think about it, the less 
comfortable i am with the brittle assumptions the FacetProcessor code has to 
make about if/when the qlist/domainFilters will be used.
**** right now it's {{if (method == JoinMethod.DV)}} but what if other methods 
are added later?
**** at a minimum we could add {{boolean getFiltersSupported()}} property to 
{{JoinMethod}} ... but what would {{SMART}} return?  (maybe throw an 
IllegalStateException?)
**** more broad picture thoughts on this below....
** handleBlockJoin
*** previous comment still applies...{quote}
since semantics have changed such that method assumes it won't be called if 
{{!(freq.domain.toChildren || freq.domain.toParent)}} method should start with 
an assert to that effect (and have jdocs noting it)
{quote}
*** there's a subtle change here (maybe it was in the last patch and i missed 
it) in the {{freq.domain.toChildren}} that i'm not clear on...
**** previuosly, {{appliedFilters = true;}} was only set if we had a filter ... 
now it's set (for the toChildren case) regardless of wether there is a filter
**** is this intentional? if so, then can't we just: refactor away both the 
{{boolean appliedFilters}} and {{DocSet result}} variables; make both code 
paths assign directly to {{fcontext.base}} ; make both bode paths {{return 
true}} or {{return false}} (respectively)

* other-parsers.adoc
** it doesn't really make sense to mention that {{score}} doesn't support 
{{method}} before we've even explained what {{method}} is
*** i would suggest that instead, it would make more sense to include an info 
(or warning) box in the "Join Parser Execution" section pointing out that 
specifying a {{method}} is incompatible with specifying a {{score}} mode
** the list of supported methods should include "smart" and it should note that 
this is the default method if non is specified

----


As mentioned above, I have some concerns about how FacetProcessor has to make 
some assumptions about if/when the filters are being handled by the JoinQuery 
and when the facet code has to handle them itself.  This also relates to the 
questions about wether or not 2 JoinQuery objects should be considered "equals" 
from a caching perspective if they have different {{domainFilters}} but 
{{method==enum}} so those filters aren't used anyway.

I really think that at a broader level we need to take 1 of 2 paths here...

* A
** add a property to the JoinMethod indicating if it supports domainFilters,
*** on SMART this property should throw illegal state exception
** force JoinQuery "creators" to pass in a "non-SMART" (non-null) JoinMethod 
when building the JoinQueries so the constuctor can definitively throw an 
exception if {{null!=domainFilters}} but the {{method}} doesn't support them.
*** give callers a public method to "choose" a non-SMART JoinMethod given the 
user's original input, the schema, and the index searcher -- so they can know 
if/when it's legal to pass in domainFilters, or when they must pass null
*** this way the FacetProcessor code doesn't have to make assumptions -- it 
knows exactly when it did/didn't pass {{qlist}} vs {{null}} to 
{{createDomainQuery(...)}}

* B
** make JoinQuery public & abstract
** refactor 2 diff concrete subclasses: one for TermEnum and one for DV
** only the DV based class should have a constructor that takes in domainFilters
** neither concrete subclass should know anything about JoinMethod -- that 
should be handled by the helper code in JoinQParserPlugin
** change the return type of {{JoinQParserPlugin.chooseJoinMethod}} to be 
{{JoinQuery}}
** add a {{List JoinQuery.getNestedFilters()}} method to JoinQuery that 
normally returns null, but the DV subclass returns it's domainFilters
** this way the FacetProcessor code doesn't have to make assumptions -- it can 
ask the JoinQuery it gets back from {{createDomainQuery(...)}} wether it has 
any nested filters or not 


...I don't know if i have a strong preferece between A or B.  What do other 
folkss think?


> JoinQParser for non point fields should use the GraphTermsCollector 
> --------------------------------------------------------------------
>
>                 Key: SOLR-11391
>                 URL: https://issues.apache.org/jira/browse/SOLR-11391
>             Project: Solr
>          Issue Type: Bug
>      Security Level: Public(Default Security Level. Issues are Public) 
>            Reporter: Varun Thacker
>         Attachments: SOLR-11391.patch, SOLR-11391.patch, SOLR-11391.patch, 
> SOLR-11391.patch, SOLR-11391.patch, SOLR-11391.patch, SOLR-11391.patch, 
> SOLR-11391.patch, SOLR-11391.patch
>
>
> The Join Query Parser uses the GraphPointsCollector for point fields. 
> For non point fields if we use the GraphTermsCollector instead of the current 
> algorithm I am seeing quite a bit of performance gains.
> I'm going to attach a quick patch which I cooked up , making sure TestJoin 
> and TestCloudJSONFacetJoinDomain passed. 
> More tests, benchmarking and code cleanup to follow



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to