[jira] [Comment Edited] (SOLR-9027) Add GraphTermsQuery to limit traversal on high frequency nodes

Joel Bernstein (JIRA) Wed, 27 Apr 2016 05:40:01 -0700

    [ 
https://issues.apache.org/jira/browse/SOLR-9027?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15260053#comment-15260053
 ]


Joel Bernstein edited comment on SOLR-9027 at 4/27/16 12:39 PM:
----------------------------------------------------------------

bq. why wrap each BytesRef in a Term when in the end you just need the 
BytesRef? Or maybe I'm mistaken.

I haven't really optimized things yet. I'll take a look at optimizing this.

bq. equals and hashcode is on id yet you initialize that to new Object(). 
Firstly; why not have equals/hashcode actually work? Secondly, if for some 
reason it should be this way, then you can do away with id and do equals on 
instance equality of the query instance – you don't need id.

It's designed to only be equal on identity so it doesn't cache. The main reason 
for this is that graph traversals are typically one time jobs so I wanted to 
avoid the overhead of hashcode and equals on large term lists.There may be a 
better approach to the identity equality, so I'll review you're suggestion.

bq. I think it's very suspicious that GraphTermsQuery holds List<TermContext>; 
I think the Query object should not hold state pertaining to the actual index 
as it could cause issues with caching. Maybe you could do the construction of 
this in createWeight and hold it on the Weight?

This sounds like a good idea.

bq. in no place do I see you sort the incoming terms. It's faster to seek 
sequentially and not randomly.

It appeared that the TermsQuery was sorting terms to account for different 
fields. But the GraphTermsQuery is always on one field. Since it's always doing 
a seekExact, I was assuming that it would always have to seek from the top of 
the terms enum anyway, because it can't make assumptions on the order of the 
terms. In this case it would seem sorting would just add overhead. But I could 
be wrong about this.




was (Author: joel.bernstein):
bq why wrap each BytesRef in a Term when in the end you just need the BytesRef? 
Or maybe I'm mistaken.

I haven't really optimized things yet. I'll take a look at optimizing this.

bq equals and hashcode is on id yet you initialize that to new Object(). 
Firstly; why not have equals/hashcode actually work? Secondly, if for some 
reason it should be this way, then you can do away with id and do equals on 
instance equality of the query instance – you don't need id.

It's designed to only be equal on identity so it doesn't cache. The main reason 
for this is that graph traversals are typically one time jobs so I wanted to 
avoid the overhead of hashcode and equals on large term lists.There may be a 
better approach to the identity equality, so I'll review you're suggestion.

bq I think it's very suspicious that GraphTermsQuery holds List<TermContext>; I 
think the Query object should not hold state pertaining to the actual index as 
it could cause issues with caching. Maybe you could do the construction of this 
in createWeight and hold it on the Weight?

This sounds like a good idea.

bq in no place do I see you sort the incoming terms. It's faster to seek 
sequentially and not randomly.

It appeared that the TermsQuery was sorting terms to account for different 
fields. But the GraphTermsQuery is always on one field. Since it's always doing 
a seekExact, I was assuming that it would always have to seek from the top of 
the terms enum anyway, because it can't make assumptions on the order of the 
terms. In this case it would seem sorting would just add overhead. But I could 
be wrong about this.



> Add GraphTermsQuery to limit traversal on high frequency nodes
> --------------------------------------------------------------
>
>                 Key: SOLR-9027
>                 URL: https://issues.apache.org/jira/browse/SOLR-9027
>             Project: Solr
>          Issue Type: New Feature
>            Reporter: Joel Bernstein
>            Priority: Minor
>         Attachments: SOLR-9027.patch, SOLR-9027.patch, SOLR-9027.patch, 
> SOLR-9027.patch
>
>
> The gatherNodes() Streaming Expression is currently using a basic disjunction 
> query to perform the traversals. This ticket is to create a specific 
> GraphTermsQuery for performing the traversals. 
> The GraphTermsQuery will be based off of the TermsQuery, but will also 
> include an option for a docFreq cutoff. Terms that are above the docFreq 
> cutoff will not be included in the query. This will help users do a more 
> precise and efficient traversal.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[jira] [Comment Edited] (SOLR-9027) Add GraphTermsQuery to limit traversal on high frequency nodes

Reply via email to