[ 
https://issues.apache.org/jira/browse/SOLR-11391?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16183317#comment-16183317
 ] 

Hoss Man commented on SOLR-11391:
---------------------------------


My review below on the latest patch.  In general I'm very confused about 
why/how the "join method" (used in the facet domain) seems to be choosen based 
on the "facet method" (for the field faceting) ... i don't really understand 
why those 2 things should be connected in any way, and none of the comments so 
far in this issue seems to suggest that should (or needs to) be any connection 
between them -- other then yonik's suggestion that they should use similar 
vocabulary (enum, dv, etc...)

I suspect this is a logic mistake in the design, and that the intent was to add 
a "method" option on the "join domain" just like a "method" param was added to 
the "join parser" ?

In any case, here's all of my feedback as is...

* JoinQParserPlugin
** boolean useGraphCollector
*** seems like an enum would be better given local param isn't just true/false
*** especially if we anticipate that there may be more "methods" down the road
*** if nothing else: it would make the code easier to read & understand 
(especially when the helper method gets used by the facet code, see comments 
below)
** AFAICT: "method" is now the first param that JoinQParserPlugin supports that 
ScoreJoinQParserPlugin does *not* support
*** which means that the parse() and parseJoin() methods need to be refactored 
to give a clear error if someone tries to combine the "score" param with a 
"method" param. (before this parser delegates to ScoreJoinQParserPlugin

* JoinQuery
** I'm confused about the {{isPointField()}} points code path...
*** "if points, then use GraphPointsCollector" ... even if 
{{false==useGraphCollector}} ?
*** if that's inentional, then shouldn't param validation (in 
JoinQParserPlugin) reject any explicit {{method=enum}} for points fields?
** executeJoinQuery
*** needs jdocs
*** modifies the {{List<Query> domainFilters}} passed by caller
**** will be problematic if caller is reusing the same {{domainFilters}} for 
multiple JoinQueries
**** jdocs (all the way up) should either be very clear about this, or 
JoinQuery should make a defensive copy
**** off the top of my head: i have no idea if any of the existing (caller) 
code (in FacetProcessor) may be sharing the {{qlist}} across multiple 
facets/code-paths
*** {{//TODO: is it okay to cache this? I guess we were previously caching as 
well?}}
**** AFAICT from the existing code: No, it was not cached (code uses 
{{getDocSetNC}})
**** in current patch, copy/past comment says "don't cache the resulting 
docSet" but then calls {{getDocSet()}} ... which does cache. So either comment 
needs fixed, or method needs changed.

* SolrIndexSearcher.getCardinalityForDVField
** needs jdocs
** is this really the "cardinality" ???
*** it looks like it's actually the sum of the cardinalities across all 
segments (not the same when terms exist in multiple segments)
** perhaps (in general) a better approach then looking at an absolute 
cardinality/docs threshold of 1000 / 100,000 would be to look at the aproximate 
ratio of field cardinality to docCount?
*** that could be computed per segment, and then aproximated over all segments
**** ie: {{( sum of (seg_cardinality/seg_doccount) for all segments ) / 
num_segments}}
*** (obviously this wouldn't make sense as a generic SolrIndexSearcher helper 
method, just something that occured to me in the context of reviewing this 
method)

* FacetProcessor.handleJoinField
** jdocs need updated to explain return value
** since semantics have changed such that method assumes it won't be called if 
{{null==freq.domain.joinField}} method should start with an assert to that 
effect (and have jdocs noting it)

* FacetProcessor.chooseCollector
** needs jdocs
** shouldn't this name be something more specific to the fact that it's 
choosing the join domain collector?  AFAICT it has nothing to do with any other 
type of facet processing collector
*** in general, all this code seems like it belongs in the {{JoinField}} class?
**** Note the existing comments in {{createDomainQuery}} -- the same rationale 
seems applicable here.
** in general, i'm confused as to the _intent_ of this code ... why are we 
choosing a ("method" / "boolean useGraphCollector") for the JoinQuery that will 
power the domain based on the "method" used by the FacetField itself? ... those 
seem completely orthoginal
*** is this all a mistake?  Was the intent to add an optional "method" option 
to the "join" domain specified by the request?
*** if this code is intentional (ie: if we really are intentionally choosing 
the join collector based on the term FacetMethod) then:
*** this should be extensively explained in the comments
*** the code/comments should more completely handle the various FacetMethod 
enum options available
**** Example: code currently returns false immediately for FacetMethod.ENUM, 
but not for FacetMethod.STREAM which is (IIUC) functionally equivilent???
** again: using an enum (in JoinQParser) to indicate the type of join collector 
(might) make this method a lot easier to read and a lot less brittle
*** (depending on what the goal actually is ... if nothing else it would help 
clarify the distinction between when the FacetMethod is relevant vs when the 
"JoinMethod" is relevant )
** what are the cases where this {{catch (IOException e)}} block may trigger?
*** Why is that a simple "return false" instead of rethrowing???
** why is the hueristic logic (for choosing hte collector based on cardinality 
and/or numDocs) in this method, instead of in a JoinQParserPlugin helper method?
*** wouldn't the same hueristic logic be useful for when a user does a 
{{"\{!join\}"}} query w/o a "method" param?

* FacetProcessor.handleBlockJoin
** since semantics have changed such that method assumes it won't be called if 
{{!(freq.domain.toChildren || freq.domain.toParent)}} method should start with 
an assert to that effect (and have jdocs noting it)
** {{acceptDocs}} initialization could be heavily simplified with terinary op...
*** {{DocSet acceptDocs = (null == qlist) ? fcontext.searcher.getLiveDocs() : 
fcontext.searcher.getDocSet(qlist);}}

* GraphEdgeCollector & GraphTermsCollector
** now that these classes are public, they should definitely get some javadocs

* schema12.xml & TestJoin
** rather then adding a new special sys prop just for this test, it seems a lot 
more straight forward to add some new field/dynamicField declarations that 
explicitly set docValues=true/false
** that would also allow the test methods to be refactored, so that a single 
test class run could test both the enum & dv methods against diff fields as 
needed (ie: paramaterize the meat of the tests)
*** randomly only testing one each time is far from ideal.
*** the pattern used for points/tries isn't ment to be an example to follow 
when dealing with a "choice" in a test -- it was a path of last resort for 
retrofiting hundreds of pre-existing tests knowing that one of those "choices" 
was going to be immediately deprecated and phased out.
** TestJoin should also have some "test the error code/msg" checks that asking 
for a particular method fails with an expected error message for all the 
potential error paths.
*** having non-randomized types will make this much easier to add.

* TestCloudJSONFacetJoinDomain
** see comments above regarding my confusion about conflating the TermFacet 
"method" with the "method" used by the "join" domain
*** more on this as it pertains to randomized testing below...
** buildRandomFacets
*** in general i don't really understand these changes...
**** it seems like you are randomizing the *facet* method based entirely on the 
properties of the random domain .... but it should be possible to randomize the 
method param entirely independently (from specifics of the domain) and across 
all possible values of the FacetMethod enum
**** the only possible restrictions the "TermFacet.method" randomization logic 
(in this test) should place on what method it chooses should be based on the 
"TermFacet.field" (not the "JoinDomain.from" field) if and only if we know that 
it's an "error" to request a certain type of FacetMethod for a certain 
type/properties of the "facet field"
**** if there are in fact some join (domain) code paths that may be invalid 
with the "graphCollector?" value chosen automatically by some "facet method" 
options (depend on the various field types/properties) then that should be 
considered a bug in the "chooseCollector()" code and fixed -- it should not 
cause the client to get an error (that would needs to be worked around in this 
test by only picking certain "TermFacet.method" values based on if/when a 
JoinDomain uses certain types of fields)
*** regardless of intent: in at least some situations, the "TermFacet.method" 
variable should be randomized to "null" to test the default hueristics.
** JoinDomain
*** if we are in fact adding a {{"join: \{ method: foo, ... \}"}} type 
"JoinDomain.method" option to the code, then it should be accounted for in the 
test model here, and randomized.
*** This should be completley independently from any FacetField.method 
randomization
** testBespoke
*** please add some simple sanity checks of the new TermFacet "method" option
*** and, perhaps, some testing of a distinct JoinDomain "method" (if that is 
being added)


> JoinQParser for non point fields should use the GraphTermsCollector 
> --------------------------------------------------------------------
>
>                 Key: SOLR-11391
>                 URL: https://issues.apache.org/jira/browse/SOLR-11391
>             Project: Solr
>          Issue Type: Bug
>      Security Level: Public(Default Security Level. Issues are Public) 
>            Reporter: Varun Thacker
>         Attachments: SOLR-11391.patch, SOLR-11391.patch, SOLR-11391.patch, 
> SOLR-11391.patch, SOLR-11391.patch, SOLR-11391.patch
>
>
> The Join Query Parser uses the GraphPointsCollector for point fields. 
> For non point fields if we use the GraphTermsCollector instead of the current 
> algorithm I am seeing quite a bit of performance gains.
> I'm going to attach a quick patch which I cooked up , making sure TestJoin 
> and TestCloudJSONFacetJoinDomain passed. 
> More tests, benchmarking and code cleanup to follow



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to