[
https://issues.apache.org/jira/browse/SOLR-11391?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16192388#comment-16192388
]
Hoss Man commented on SOLR-11391:
---------------------------------
Comments on latest patch (broader thoughts below)...
* JoinQParserPlugin.java
** chooseJoinMethod
*** javadocs should be valid html, docs need to come before @param, etc...
*** IIUC, you've deleted the only code path that will give the user an error if
the field points==true but docValues==false
**** validateJoinMethod will fail if the user explicitly asks for ENUM, but
here in this method you'll attempt to use ENUM if the user sends
{{method=null|SMART}}
*** treating "!= -1" as magic is dangerous, what if some code accidently passes
-2? ... "< 0" is safer
**** either that or {{assert -1 <= domainSize}}
*** if we just want the LeafReaderContext then why ask for
{{getSlowAtomicReader()}} ? why not just use {{searcher.getLeafContexts()}} ?
*** {{sumRatio = + segmentCardinality / segmentDocs;}} ... 100% certain that
should be {{+=}}
*** the existence of the {{+=}} typo plus the "pick ENUM for poins w/o
docValues w/o error" mentioned above tells me this method should most certainly
have some very targeted unit tests
**** there should most certianly be a unit test of this specific method that
asserts it gets ENUM back for some fields and DV back for others based on the
individual segments
**** ideally, instead of needing to actaully build up tones of large indexes
for this, this helper method should be refactored to take in some arrays of the
data -- ie: {{chooseJoinMethod(SolrIndexSearcher, SchemaField, int)}} should
delegate to some {{chooseJoinMethodByDvStats(long[] numDocsPerLeaf, long[]
cardinalityPerLeaf)}} that we have direct unit tests for
** JoinQuery
*** constructor still needs jdocs
**** in particular needs to point out that if domainFilters != null it will
have elements added to it when query is exececuted
**** (either that, or refactor the defensive copy down from createJoinQuery)
*** {{toString}} should definitely use the {{method}} in it's output
*** {{hashCode}} and {{equals}} currently don't consider {{method}} or
{{domainFilters}} in processing
**** should diff {{method}} values make 2 {{JoinQuery}} objects not-equals?
what about {{SMART}} values? ... not sure.
**** different {{domainFilters}} should *definitely* make 2 {{JoinQuery}}
objects not-equals, othewise we're going to get weird caching behavior -- but
this raises interesting questions about what happens when the {{JoinMethod}}
type doesn't even support/use the {{domainFilters}} ... then should 2
{{JoinQuery}} objects be considered equal?
**** no matter what, these questions make it clear we should add some more
robust equal/not-equal checks for the join parser (and using createJoinQuery
with diff domainFilters)
*** we should probably call {{method = chooseJoinMethod}} before
{{validateJoinMethod(...)}} to help ensure we never intorduce bugs that cause
our hueristic to pick broken options for the specified field
**** ideally this code path would/should even catch/retrow the SolrException
and wrap it in a "code is broken, please file a bug" if the original method was
SMART
**** either that, or just refactor validateJoinMethod and chooseJoinMethod into
a single method that: 1) if SMART, attempts to narrow down by field props or
error if nothing legal; 2) if still options, pick based on field stats.
**** either way, doing this choosing & validating should be done as early as
possible -- why not put it in the {{JoinQueryWeight}} constructor instead of in
the {{JoinQueryWeight.getDocSet()}} method ?
***** then we could also set the "actual" {{method}} in the {{dbg}} info
created by {{JoinQueryWeight.scorer()}}
**** actually -- what would probably be best is if the {{JoinQuery}}
constructor rejected {{null}} or {{SMART}} as input values, and required the
caller to call thes new merged "choose+validate helper" (both the
{{JoinQParser.parse() method and the facet code using {{createJoinQuery()}} are
garunteed to have access the the neccessary SolrIndexSearcher) to resolve the
user specified "method" into an "executable" method. That way the
{{JoinQuery}} constructor could explicitly set {{this.domainFilters=null}}
unless the {{method}} supported them (which would simplify the questions about
how {{hashCode}} and {{equals()}} should work.
** validateJoinMethod
*** once points is checked, why is there redundent nested if check for
{{isPointField()}} ?
** executeJoinQuery
*** jdocs still need to mention this method modifies domainFilters
**** or better still: preserve the original {{domainFilter}} passed by the
client all the way into the {{JoinQuery}} and assign to {{this.domainFilters}}
(which should be final), but only make a copy here inside {{executeJoinQuery}}
(to add the {{uncachedJoinQ}} to) and then let the copy by GCed after the
{{toSearcher.getDocSet(...);}} call ... which would also help simplify the
issues/risks/confusion as far as the {{hashCode}} and {{equals}} methods are
concerned with {{domainFilter}}
*** why the redunent {{DocSet result = ...; return result;}} (in two places) ?
... why not just "return ...}}
** getDocSetEnumerate
*** if this method *only* works with {{JoinMethod.ENUM}} there should be an
assert to that effect
*** if this method ignores {{domainFilters}} then there should be an assert
that it's {{null}}
*** (both of these will help protect us from developer mistakes down the road)
* TestCloudJSONFacetJoinDomain.java
** JoinDomain
*** jdocs need updated
*** constructor should probably assert {{null != from || null == method}} ...
it will only confuse people (and future devs if there is a "method" specified
even when there is no join and we're just filtering.
** buildRandomDomain()
*** we should randomize both {{null}} and {{"smart"}} values to test all the
code paths (as things stand right now, we're never exercising the "hueristic"
logic in these randomized tests
*** if {{from}} is {{null}}, method should _always_ be be {{null}} ...
otherwise it could easily confuse people 9see above suggestion about adding
assertion to the JoinDomain constructor)
**** probably best to start with {{null}} as the default, if from is non-null
randomly set to {{"smart"}} sometimes, and/or randomly set to {{"dv"}} or
{{"enum"}} if and only if we can tell from the field info that one of those
options will be supported
** applyDomainToQuery
*** I notice that you did _not_ modify this method to use the new {{method}}
option on the {{{!join\}}} query when verifying the facet results ... i'm not
sure if that was intentional, but i think it's a good idea to leave the code
the way it is in the current patch
*** this way, even as we randomize the {{method}} in the facet join {{domain}},
we verify the results using whatever the _default_ (implicit) algorithm is for
the join query
*** i do think however, that "this method" (applyDomainToQuery) should be
updated with an explicit comment pointing out that we are explicitly ignoring
{{this.method}} for this reason
* TestJoin.java
** runWithMethod + testJoin
*** in the event of failure, our test coverage will be a lot better if you just
put each call to {{runWithMethod}} in it's own {{public void test*}} method
**** Example: {code}
public void testBasicJoinUsingEnum() { runWithMethod("enum"); }
{code}
**** that way if someone breaks something *only* in the DV code path, it will
be really obvious if the enum test still passes -- but if noth tests fail,
we'll know it's in the common code
*** likewise, there should probably be versions that calls
{{runWithMethod("smart");}} and {{runWithMethod(null)'}}
*** better still: {{runWithMethod}} could be made a lot more robust by adding a
{{String joinFieldSuffix}} option instead of picking the field names based on
the method arg -- and then we could test all the permutations that are expected
to be valid: {code}
public void testBasicJoinEnum() { runWithMethod("enum", "_s"); }
public void testBasicJoinDV() { runWithMethod("dv", "_ds"); }
public void testBasicJoinEnumDVField() { runWithMethod("enum", "_ds"); }
public void testBasicJoinHueristicDVField() { runWithMethod(null, "_ds"); }
public void testBasicJoinHueristicNoDVField() { runWithMethod(null, "_s"); }
{code}
** runWithMethod + error handling
*** docs 14 & 15 + the error handling checks for a missmatch between the
{{method}} type and the field options don't seem like they belong in this
method at all
**** these types of assertions should be refactored into their method
**** assertions of error messages should either use {{assertQEx(...)}} or
{{Exception e = expectThrows(...}} (followed by assertions on the {{e}} to
check it's message, type, etc...)
***** in particular it makes no sense to use {{assetJQ}} with an "expected"
pattern match on the response ( numFound=3 + list of sorted docs) if the
_actual_ expectation is a thrown exception.
**** we also seem to be missing a check for the case of someone requesting
{{method=enum}} on a (non-points) field w/ {{indexed=false docValues=true}} ?
** testRandomJoin
*** we should just remove the dead code (ie: {{/* disable matching incompatible
fields ...}}
*** {{System.getProperty("solr.tests.numeric.dv")}} ... replace with
{{Boolean.getBoolean(NUMERIC_POINTS_SYSPROP)}}
**** in general this special case is weird & confusing an desrves a comment
(why is groupNumber==1 special?)
*** this method should be also be randomly specifying {{method="smart"}} and
"null|no method" (where no method localparam is specified at all)
* GraphTermsCollector.java
** the class javadocs should give some information about the specifics of how
it extends the abstract base class -- not just link to it.
* FacetRequest.java
** createDomainQuery
*** I see once this method is called why it takes in a JoinMethod instead of
using {{this.method}} but at first glance it's entirely unclear -- some
docs/comments explaing will save other devs a lot of confusion down the road
* FacetProcessor.java
** handleJoinField
*** previous comment still applies....{quote}
since semantics have changed such that method assumes it won't be called if
{{null==freq.domain.joinField}} method should start with an assert to that
effect (and have jdocs noting it)
{quote}
*** the previously mentioned suggestion to merge {{chooseJoinMethod}} with
{{validateJoinMethod}} would help simplify this method a fair bit
*** The more this patch evolves, and the more i think about it, the less
comfortable i am with the brittle assumptions the FacetProcessor code has to
make about if/when the qlist/domainFilters will be used.
**** right now it's {{if (method == JoinMethod.DV)}} but what if other methods
are added later?
**** at a minimum we could add {{boolean getFiltersSupported()}} property to
{{JoinMethod}} ... but what would {{SMART}} return? (maybe throw an
IllegalStateException?)
**** more broad picture thoughts on this below....
** handleBlockJoin
*** previous comment still applies...{quote}
since semantics have changed such that method assumes it won't be called if
{{!(freq.domain.toChildren || freq.domain.toParent)}} method should start with
an assert to that effect (and have jdocs noting it)
{quote}
*** there's a subtle change here (maybe it was in the last patch and i missed
it) in the {{freq.domain.toChildren}} that i'm not clear on...
**** previuosly, {{appliedFilters = true;}} was only set if we had a filter ...
now it's set (for the toChildren case) regardless of wether there is a filter
**** is this intentional? if so, then can't we just: refactor away both the
{{boolean appliedFilters}} and {{DocSet result}} variables; make both code
paths assign directly to {{fcontext.base}} ; make both bode paths {{return
true}} or {{return false}} (respectively)
* other-parsers.adoc
** it doesn't really make sense to mention that {{score}} doesn't support
{{method}} before we've even explained what {{method}} is
*** i would suggest that instead, it would make more sense to include an info
(or warning) box in the "Join Parser Execution" section pointing out that
specifying a {{method}} is incompatible with specifying a {{score}} mode
** the list of supported methods should include "smart" and it should note that
this is the default method if non is specified
----
As mentioned above, I have some concerns about how FacetProcessor has to make
some assumptions about if/when the filters are being handled by the JoinQuery
and when the facet code has to handle them itself. This also relates to the
questions about wether or not 2 JoinQuery objects should be considered "equals"
from a caching perspective if they have different {{domainFilters}} but
{{method==enum}} so those filters aren't used anyway.
I really think that at a broader level we need to take 1 of 2 paths here...
* A
** add a property to the JoinMethod indicating if it supports domainFilters,
*** on SMART this property should throw illegal state exception
** force JoinQuery "creators" to pass in a "non-SMART" (non-null) JoinMethod
when building the JoinQueries so the constuctor can definitively throw an
exception if {{null!=domainFilters}} but the {{method}} doesn't support them.
*** give callers a public method to "choose" a non-SMART JoinMethod given the
user's original input, the schema, and the index searcher -- so they can know
if/when it's legal to pass in domainFilters, or when they must pass null
*** this way the FacetProcessor code doesn't have to make assumptions -- it
knows exactly when it did/didn't pass {{qlist}} vs {{null}} to
{{createDomainQuery(...)}}
* B
** make JoinQuery public & abstract
** refactor 2 diff concrete subclasses: one for TermEnum and one for DV
** only the DV based class should have a constructor that takes in domainFilters
** neither concrete subclass should know anything about JoinMethod -- that
should be handled by the helper code in JoinQParserPlugin
** change the return type of {{JoinQParserPlugin.chooseJoinMethod}} to be
{{JoinQuery}}
** add a {{List JoinQuery.getNestedFilters()}} method to JoinQuery that
normally returns null, but the DV subclass returns it's domainFilters
** this way the FacetProcessor code doesn't have to make assumptions -- it can
ask the JoinQuery it gets back from {{createDomainQuery(...)}} wether it has
any nested filters or not
...I don't know if i have a strong preferece between A or B. What do other
folkss think?
> JoinQParser for non point fields should use the GraphTermsCollector
> --------------------------------------------------------------------
>
> Key: SOLR-11391
> URL: https://issues.apache.org/jira/browse/SOLR-11391
> Project: Solr
> Issue Type: Bug
> Security Level: Public(Default Security Level. Issues are Public)
> Reporter: Varun Thacker
> Attachments: SOLR-11391.patch, SOLR-11391.patch, SOLR-11391.patch,
> SOLR-11391.patch, SOLR-11391.patch, SOLR-11391.patch, SOLR-11391.patch,
> SOLR-11391.patch, SOLR-11391.patch
>
>
> The Join Query Parser uses the GraphPointsCollector for point fields.
> For non point fields if we use the GraphTermsCollector instead of the current
> algorithm I am seeing quite a bit of performance gains.
> I'm going to attach a quick patch which I cooked up , making sure TestJoin
> and TestCloudJSONFacetJoinDomain passed.
> More tests, benchmarking and code cleanup to follow
--
This message was sent by Atlassian JIRA
(v6.4.14#64029)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]