[
https://issues.apache.org/jira/browse/SOLR-9027?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15260053#comment-15260053
]
Joel Bernstein edited comment on SOLR-9027 at 4/27/16 12:39 PM:
----------------------------------------------------------------
bq. why wrap each BytesRef in a Term when in the end you just need the
BytesRef? Or maybe I'm mistaken.
I haven't really optimized things yet. I'll take a look at optimizing this.
bq. equals and hashcode is on id yet you initialize that to new Object().
Firstly; why not have equals/hashcode actually work? Secondly, if for some
reason it should be this way, then you can do away with id and do equals on
instance equality of the query instance – you don't need id.
It's designed to only be equal on identity so it doesn't cache. The main reason
for this is that graph traversals are typically one time jobs so I wanted to
avoid the overhead of hashcode and equals on large term lists.There may be a
better approach to the identity equality, so I'll review you're suggestion.
bq. I think it's very suspicious that GraphTermsQuery holds List<TermContext>;
I think the Query object should not hold state pertaining to the actual index
as it could cause issues with caching. Maybe you could do the construction of
this in createWeight and hold it on the Weight?
This sounds like a good idea.
bq. in no place do I see you sort the incoming terms. It's faster to seek
sequentially and not randomly.
It appeared that the TermsQuery was sorting terms to account for different
fields. But the GraphTermsQuery is always on one field. Since it's always doing
a seekExact, I was assuming that it would always have to seek from the top of
the terms enum anyway, because it can't make assumptions on the order of the
terms. In this case it would seem sorting would just add overhead. But I could
be wrong about this.
was (Author: joel.bernstein):
bq why wrap each BytesRef in a Term when in the end you just need the BytesRef?
Or maybe I'm mistaken.
I haven't really optimized things yet. I'll take a look at optimizing this.
bq equals and hashcode is on id yet you initialize that to new Object().
Firstly; why not have equals/hashcode actually work? Secondly, if for some
reason it should be this way, then you can do away with id and do equals on
instance equality of the query instance – you don't need id.
It's designed to only be equal on identity so it doesn't cache. The main reason
for this is that graph traversals are typically one time jobs so I wanted to
avoid the overhead of hashcode and equals on large term lists.There may be a
better approach to the identity equality, so I'll review you're suggestion.
bq I think it's very suspicious that GraphTermsQuery holds List<TermContext>; I
think the Query object should not hold state pertaining to the actual index as
it could cause issues with caching. Maybe you could do the construction of this
in createWeight and hold it on the Weight?
This sounds like a good idea.
bq in no place do I see you sort the incoming terms. It's faster to seek
sequentially and not randomly.
It appeared that the TermsQuery was sorting terms to account for different
fields. But the GraphTermsQuery is always on one field. Since it's always doing
a seekExact, I was assuming that it would always have to seek from the top of
the terms enum anyway, because it can't make assumptions on the order of the
terms. In this case it would seem sorting would just add overhead. But I could
be wrong about this.
> Add GraphTermsQuery to limit traversal on high frequency nodes
> --------------------------------------------------------------
>
> Key: SOLR-9027
> URL: https://issues.apache.org/jira/browse/SOLR-9027
> Project: Solr
> Issue Type: New Feature
> Reporter: Joel Bernstein
> Priority: Minor
> Attachments: SOLR-9027.patch, SOLR-9027.patch, SOLR-9027.patch,
> SOLR-9027.patch
>
>
> The gatherNodes() Streaming Expression is currently using a basic disjunction
> query to perform the traversals. This ticket is to create a specific
> GraphTermsQuery for performing the traversals.
> The GraphTermsQuery will be based off of the TermsQuery, but will also
> include an option for a docFreq cutoff. Terms that are above the docFreq
> cutoff will not be included in the query. This will help users do a more
> precise and efficient traversal.
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]