[jira] [Commented] (SOLR-9583) When the same exists across multiple collections that are searched with an alias, the document returned in the results list is indeterminate

Erick Erickson (JIRA) Fri, 30 Sep 2016 09:20:40 -0700

    [ 
https://issues.apache.org/jira/browse/SOLR-9583?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15536384#comment-15536384
 ]


Erick Erickson commented on SOLR-9583:
--------------------------------------

[~dsmiley]]

I disagree and think there's a bug here. I can be persuaded that there are two 
issues though, maybe we can split this JIRA.

Bug:
In the situation I described above, we return one doc or the other, and 
currently it's indeterminate which one comes back. In fact, the one that comes 
back will change for the _exact_ same query without the underlying collections 
changing at all just by resubmitting the query (I turned the queryResultCache 
off and can reproduce at will). This is even true in a one-shard, leader-only 
pair of collections. You'll have to argue really hard to persuade me that this 
is correct behavior. It's certainly not satisfactory to say to a user "we have 
no idea which one will be returned and there's nothing you can do about it, 
don't even try".

bq: ...it's asking for trouble. Solr isn't supposed to be used this way.

I don't understand this. We allow collection aliasing. There are no rules 
whatsoever requiring multiple collections have disjoint <uniqueKey>s. 
Arbitrarily returning only one is hard to justify.

Wish:
We add the ability to return all docs with the same ID when multiple 
collections have docs with the same ID under control of some flag.


[~noble.paul]

Not quite sure I understand the question. We "dedupe" currently, but it's 
arbitrary. I doubt it was designed, rather "just happens" as a side-effect of 
merging the lists. My suspicion is that when we merge the results, the final 
result changes based on the order in which the collection returns are 
processed. But before diving into the code I wanted to get some idea of what we 
think _should_ happen.

We at least should dedupe in a predictable fashion. What the algorithm should 
be is up for discussion. Perhaps "doc from last collection listed in the alias 
wins" (yuck, frankly but at least I can explain it to someone). Or maybe "break 
ties by comparing the collection name" (also yuck). Or we have to use the sort 
criteria. Or.... I don't want to get complicated here, just predictable.

If we decide to return multiple docs with the same ID from separate collections 
then there's the whole question of how to sort them, but I'll leave that for 
another day. Maybe we just use whatever we use to dedupe as the sort in this 
case.

> When the same <uniqueKey> exists across multiple collections that are 
> searched with an alias, the document returned in the results list is 
> indeterminate
> --------------------------------------------------------------------------------------------------------------------------------------------------------
>
>                 Key: SOLR-9583
>                 URL: https://issues.apache.org/jira/browse/SOLR-9583
>             Project: Solr
>          Issue Type: Wish
>      Security Level: Public(Default Security Level. Issues are Public) 
>            Reporter: Erick Erickson
>
> Not quite sure whether to call this a bug or improvement...
> Currently if I have two collections C1 and C2 and an alias that points to 
> both _and_ I have a document in both collections with the _same_ <unkqueKey>, 
> the returned list  sometimes has the doc from C1 and sometimes from C2.
> If I add shards.info=true I see the document found in each collection, but 
> only one in the document list. Which one changes if I re-submit the identical 
> query.
> This seems incorrect, perhaps a side effect of piggy-backing the collection 
> aliasing on searching multiple shards? (Thanks Shalin for that bit of 
> background).
> I can see both use-cases: 
> 1>  aliasing multiple collections validly assumes that <uniqueKey>s should be 
> unique across them all and only one doc should be returned. Even in this case 
> which doc should be returned should be deterministic.
> 2> these are arbitrary collections without any a-priori relationship and 
> identical <unkqueKey>s do NOT identify the "same" document so both should be 
> returned.
> So I propose we do two things:
> a> provide a param for the CREATEALIAS command that controls whether docs 
> with the same <unkqueKey> from different collections should both be returned. 
> If they both should, there's still the question of in what order.
> b> provide a deterministic way dups from different collections are resolved. 
> What that algorithm is I'm not quite sure. The order the collections were 
> specified in the CREATEALIAS command? Some field in the documents? Other??? 
> What happens if this option is not specified on the CREATEALIAS command?
> Implicit in the above is my assumption that it's perfectly valid to have 
> different aliases in the same cluster behave differently if specified.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[jira] [Commented] (SOLR-9583) When the same exists across multiple collections that are searched with an alias, the document returned in the results list is indeterminate

Reply via email to