[
https://issues.apache.org/jira/browse/SOLR-6314?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14090909#comment-14090909
]
Erick Erickson commented on SOLR-6314:
--------------------------------------
bq: If you want to dedup facet parameters for some reason, then it should
probably be
done in the faceting code.
Yeah, that's exactly what's making me uncomfortable about the patch, it's at
such a low level and it
affects _everything_. Unintended consequences and all that.
OTOH, what's the case for allowing dups? Do you have specific cases where that's
good or is your comment more of a statement that we shouldn't restrict future
possibilities
just because I'm not sufficiently imaginative ;) ?
I'm really having a tough time imagining scenarios where allowing dups is
useful, and I can
come up with scenarios where allowing dups is harmful (imagine multiple,
expensive, identical
fq clause with cache=false for instance) that would be caught here. Hmmm, a
WARN-level
log message is indicated for dups no matter what I think.
The counter-argument is that the user should be free to shoot themselves in the
foot as
they want to.
The counter-counter argument is that when we identify potential traps we
should do something about them if we can.
What do you think about this alternative? (note, I'm not proposing it as much
as throwing it out
for discussion). Leave the dup-detection where it is and log a WARN level
message when dups
are detected, and move the actual de-duping out to the faceting code. Then
de-dupe on a case-
by-case basis as situations arise.
Where this started was that the exact same query over the exact same data set
returns different
results in sharded and non-sharded situations. The results have the same
information, just
repeated in the single shard case. Which means that somehow the sharded code
manages to
ignore the extra entries. I'll look at how in a bit. At any rate, the
sharded case manages to avoid returning the data multiple times so either
there's code in there
specifically to deal with this or it's happening by chance, which is its own
gotcha.
I've seen some very large queries out in the wild and it's hard in many cases
to see things
like this so logging a message would help the users figure out their (perhaps
machine-generated) code was doing things they _probably_ don't want.
So this is a long winded way of saying "Hell, I don't know". My _slight_
preference here
would be to dedupe as it's being done in this patch (and log warnings when
doing so). It
just feels "more correct" and may prevent weird behavior in the future. But I'm
not
adamant about that, if the general consensus is that doing this on a
case-by-case basis
is a better idea I can make it so for the facet case.
> Multi-threaded facet counts differ when SolrCloud has >1 shard
> --------------------------------------------------------------
>
> Key: SOLR-6314
> URL: https://issues.apache.org/jira/browse/SOLR-6314
> Project: Solr
> Issue Type: Bug
> Components: SearchComponents - other, SolrCloud
> Affects Versions: 5.0
> Reporter: Vamsee Yarlagadda
> Assignee: Erick Erickson
> Attachments: SOLR-6314.patch
>
>
> I am trying to work with multi-threaded faceting on SolrCloud and in the
> process i was hit by some issues.
> I am currently running the below upstream test on different SolrCloud
> configurations and i am getting a different result set per configuration.
> https://github.com/apache/lucene-solr/blob/trunk/solr/core/src/test/org/apache/solr/request/TestFaceting.java#L654
> Setup:
> - *Indexed 50 docs into SolrCloud.*
> - *If the SolrCloud has only 1 shard, the facet field query has the below
> output (which matches with the expected upstream test output - # facet fields
> ~ 50).*
> {code}
> $ curl
> "http://localhost:8983/solr/collection1/select?facet=true&fl=id&indent=true&q=id%3A*&facet.limit=-1&facet.threads=1000&facet.field=f0_ws&facet.field=f0_ws&facet.field=f0_ws&facet.field=f0_ws&facet.field=f0_ws&facet.field=f1_ws&facet.field=f1_ws&facet.field=f1_ws&facet.field=f1_ws&facet.field=f1_ws&facet.field=f2_ws&facet.field=f2_ws&facet.field=f2_ws&facet.field=f2_ws&facet.field=f2_ws&facet.field=f3_ws&facet.field=f3_ws&facet.field=f3_ws&facet.field=f3_ws&facet.field=f3_ws&facet.field=f4_ws&facet.field=f4_ws&facet.field=f4_ws&facet.field=f4_ws&facet.field=f4_ws&facet.field=f5_ws&facet.field=f5_ws&facet.field=f5_ws&facet.field=f5_ws&facet.field=f5_ws&facet.field=f6_ws&facet.field=f6_ws&facet.field=f6_ws&facet.field=f6_ws&facet.field=f6_ws&facet.field=f7_ws&facet.field=f7_ws&facet.field=f7_ws&facet.field=f7_ws&facet.field=f7_ws&facet.field=f8_ws&facet.field=f8_ws&facet.field=f8_ws&facet.field=f8_ws&facet.field=f8_ws&facet.field=f9_ws&facet.field=f9_ws&facet.field=f9_ws&facet.field=f9_ws&facet.field=f9_ws&rows=1&wt=xml"
> <?xml version="1.0" encoding="UTF-8"?>
> <response>
> <lst name="responseHeader">
> <int name="status">0</int>
> <int name="QTime">21</int>
> <lst name="params">
> <str name="facet">true</str>
> <str name="fl">id</str>
> <str name="indent">true</str>
> <str name="q">id:*</str>
> <str name="facet.limit">-1</str>
> <str name="facet.threads">1000</str>
> <arr name="facet.field">
> <str>f0_ws</str>
> <str>f0_ws</str>
> <str>f0_ws</str>
> <str>f0_ws</str>
> <str>f0_ws</str>
> <str>f1_ws</str>
> <str>f1_ws</str>
> <str>f1_ws</str>
> <str>f1_ws</str>
> <str>f1_ws</str>
> <str>f2_ws</str>
> <str>f2_ws</str>
> <str>f2_ws</str>
> <str>f2_ws</str>
> <str>f2_ws</str>
> <str>f3_ws</str>
> <str>f3_ws</str>
> <str>f3_ws</str>
> <str>f3_ws</str>
> <str>f3_ws</str>
> <str>f4_ws</str>
> <str>f4_ws</str>
> <str>f4_ws</str>
> <str>f4_ws</str>
> <str>f4_ws</str>
> <str>f5_ws</str>
> <str>f5_ws</str>
> <str>f5_ws</str>
> <str>f5_ws</str>
> <str>f5_ws</str>
> <str>f6_ws</str>
> <str>f6_ws</str>
> <str>f6_ws</str>
> <str>f6_ws</str>
> <str>f6_ws</str>
> <str>f7_ws</str>
> <str>f7_ws</str>
> <str>f7_ws</str>
> <str>f7_ws</str>
> <str>f7_ws</str>
> <str>f8_ws</str>
> <str>f8_ws</str>
> <str>f8_ws</str>
> <str>f8_ws</str>
> <str>f8_ws</str>
> <str>f9_ws</str>
> <str>f9_ws</str>
> <str>f9_ws</str>
> <str>f9_ws</str>
> <str>f9_ws</str>
> </arr>
> <str name="wt">xml</str>
> <str name="rows">1</str>
> </lst>
> </lst>
> <result name="response" numFound="50" start="0">
> <doc>
> <float name="id">0.0</float></doc>
> </result>
> <lst name="facet_counts">
> <lst name="facet_queries"/>
> <lst name="facet_fields">
> <lst name="f0_ws">
> <int name="zero_1">25</int>
> <int name="zero_2">25</int>
> </lst>
> <lst name="f0_ws">
> <int name="zero_1">25</int>
> <int name="zero_2">25</int>
> </lst>
> <lst name="f0_ws">
> <int name="zero_1">25</int>
> <int name="zero_2">25</int>
> </lst>
> <lst name="f0_ws">
> <int name="zero_1">25</int>
> <int name="zero_2">25</int>
> </lst>
> <lst name="f0_ws">
> <int name="zero_1">25</int>
> <int name="zero_2">25</int>
> </lst>
> <lst name="f1_ws">
> <int name="one_1">33</int>
> <int name="one_3">17</int>
> </lst>
> <lst name="f1_ws">
> <int name="one_1">33</int>
> <int name="one_3">17</int>
> </lst>
> <lst name="f1_ws">
> <int name="one_1">33</int>
> <int name="one_3">17</int>
> </lst>
> <lst name="f1_ws">
> <int name="one_1">33</int>
> <int name="one_3">17</int>
> </lst>
> <lst name="f1_ws">
> <int name="one_1">33</int>
> <int name="one_3">17</int>
> </lst>
> <lst name="f2_ws">
> <int name="two_1">37</int>
> <int name="two_4">13</int>
> </lst>
> <lst name="f2_ws">
> <int name="two_1">37</int>
> <int name="two_4">13</int>
> </lst>
> <lst name="f2_ws">
> <int name="two_1">37</int>
> <int name="two_4">13</int>
> </lst>
> <lst name="f2_ws">
> <int name="two_1">37</int>
> <int name="two_4">13</int>
> </lst>
> <lst name="f2_ws">
> <int name="two_1">37</int>
> <int name="two_4">13</int>
> </lst>
> <lst name="f3_ws">
> <int name="three_1">40</int>
> <int name="three_5">10</int>
> </lst>
> <lst name="f3_ws">
> <int name="three_1">40</int>
> <int name="three_5">10</int>
> </lst>
> <lst name="f3_ws">
> <int name="three_1">40</int>
> <int name="three_5">10</int>
> </lst>
> <lst name="f3_ws">
> <int name="three_1">40</int>
> <int name="three_5">10</int>
> </lst>
> <lst name="f3_ws">
> <int name="three_1">40</int>
> <int name="three_5">10</int>
> </lst>
> <lst name="f4_ws">
> <int name="four_1">41</int>
> <int name="four_6">9</int>
> </lst>
> <lst name="f4_ws">
> <int name="four_1">41</int>
> <int name="four_6">9</int>
> </lst>
> <lst name="f4_ws">
> <int name="four_1">41</int>
> <int name="four_6">9</int>
> </lst>
> <lst name="f4_ws">
> <int name="four_1">41</int>
> <int name="four_6">9</int>
> </lst>
> <lst name="f4_ws">
> <int name="four_1">41</int>
> <int name="four_6">9</int>
> </lst>
> <lst name="f5_ws">
> <int name="five_1">42</int>
> <int name="five_7">8</int>
> </lst>
> <lst name="f5_ws">
> <int name="five_1">42</int>
> <int name="five_7">8</int>
> </lst>
> <lst name="f5_ws">
> <int name="five_1">42</int>
> <int name="five_7">8</int>
> </lst>
> <lst name="f5_ws">
> <int name="five_1">42</int>
> <int name="five_7">8</int>
> </lst>
> <lst name="f5_ws">
> <int name="five_1">42</int>
> <int name="five_7">8</int>
> </lst>
> <lst name="f6_ws">
> <int name="six_1">43</int>
> <int name="six_8">7</int>
> </lst>
> <lst name="f6_ws">
> <int name="six_1">43</int>
> <int name="six_8">7</int>
> </lst>
> <lst name="f6_ws">
> <int name="six_1">43</int>
> <int name="six_8">7</int>
> </lst>
> <lst name="f6_ws">
> <int name="six_1">43</int>
> <int name="six_8">7</int>
> </lst>
> <lst name="f6_ws">
> <int name="six_1">43</int>
> <int name="six_8">7</int>
> </lst>
> <lst name="f7_ws">
> <int name="seven_1">44</int>
> <int name="seven_9">6</int>
> </lst>
> <lst name="f7_ws">
> <int name="seven_1">44</int>
> <int name="seven_9">6</int>
> </lst>
> <lst name="f7_ws">
> <int name="seven_1">44</int>
> <int name="seven_9">6</int>
> </lst>
> <lst name="f7_ws">
> <int name="seven_1">44</int>
> <int name="seven_9">6</int>
> </lst>
> <lst name="f7_ws">
> <int name="seven_1">44</int>
> <int name="seven_9">6</int>
> </lst>
> <lst name="f8_ws">
> <int name="eight_1">45</int>
> <int name="eight_10">5</int>
> </lst>
> <lst name="f8_ws">
> <int name="eight_1">45</int>
> <int name="eight_10">5</int>
> </lst>
> <lst name="f8_ws">
> <int name="eight_1">45</int>
> <int name="eight_10">5</int>
> </lst>
> <lst name="f8_ws">
> <int name="eight_1">45</int>
> <int name="eight_10">5</int>
> </lst>
> <lst name="f8_ws">
> <int name="eight_1">45</int>
> <int name="eight_10">5</int>
> </lst>
> <lst name="f9_ws">
> <int name="nine_1">45</int>
> <int name="nine_11">5</int>
> </lst>
> <lst name="f9_ws">
> <int name="nine_1">45</int>
> <int name="nine_11">5</int>
> </lst>
> <lst name="f9_ws">
> <int name="nine_1">45</int>
> <int name="nine_11">5</int>
> </lst>
> <lst name="f9_ws">
> <int name="nine_1">45</int>
> <int name="nine_11">5</int>
> </lst>
> <lst name="f9_ws">
> <int name="nine_1">45</int>
> <int name="nine_11">5</int>
> </lst>
> </lst>
> <lst name="facet_dates"/>
> <lst name="facet_ranges"/>
> </lst>
> </response>
> {code}
> - *Now, if a create a new collection with 2 shards (>1 shard SolrCloud), the
> same above query results in a different output. (# facet fields ~ 10 ;
> Expected 50)*
> {code}
> $ curl
> "http://localhost:8983/solr/collection1/select?facet=true&fl=id&indent=true&q=id%3A*&facet.limit=-1&facet.threads=1000&facet.field=f0_ws&facet.field=f0_ws&facet.field=f0_ws&facet.field=f0_ws&facet.field=f0_ws&facet.field=f1_ws&facet.field=f1_ws&facet.field=f1_ws&facet.field=f1_ws&facet.field=f1_ws&facet.field=f2_ws&facet.field=f2_ws&facet.field=f2_ws&facet.field=f2_ws&facet.field=f2_ws&facet.field=f3_ws&facet.field=f3_ws&facet.field=f3_ws&facet.field=f3_ws&facet.field=f3_ws&facet.field=f4_ws&facet.field=f4_ws&facet.field=f4_ws&facet.field=f4_ws&facet.field=f4_ws&facet.field=f5_ws&facet.field=f5_ws&facet.field=f5_ws&facet.field=f5_ws&facet.field=f5_ws&facet.field=f6_ws&facet.field=f6_ws&facet.field=f6_ws&facet.field=f6_ws&facet.field=f6_ws&facet.field=f7_ws&facet.field=f7_ws&facet.field=f7_ws&facet.field=f7_ws&facet.field=f7_ws&facet.field=f8_ws&facet.field=f8_ws&facet.field=f8_ws&facet.field=f8_ws&facet.field=f8_ws&facet.field=f9_ws&facet.field=f9_ws&facet.field=f9_ws&facet.field=f9_ws&facet.field=f9_ws&rows=1&wt=xml"
>
> <?xml version="1.0" encoding="UTF-8"?>
> <response>
> <lst name="responseHeader">
> <int name="status">0</int>
> <int name="QTime">31</int>
> <lst name="params">
> <str name="facet">true</str>
> <str name="fl">id</str>
> <str name="indent">true</str>
> <str name="q">id:*</str>
> <str name="facet.limit">-1</str>
> <str name="facet.threads">1000</str>
> <arr name="facet.field">
> <str>f0_ws</str>
> <str>f0_ws</str>
> <str>f0_ws</str>
> <str>f0_ws</str>
> <str>f0_ws</str>
> <str>f1_ws</str>
> <str>f1_ws</str>
> <str>f1_ws</str>
> <str>f1_ws</str>
> <str>f1_ws</str>
> <str>f2_ws</str>
> <str>f2_ws</str>
> <str>f2_ws</str>
> <str>f2_ws</str>
> <str>f2_ws</str>
> <str>f3_ws</str>
> <str>f3_ws</str>
> <str>f3_ws</str>
> <str>f3_ws</str>
> <str>f3_ws</str>
> <str>f4_ws</str>
> <str>f4_ws</str>
> <str>f4_ws</str>
> <str>f4_ws</str>
> <str>f4_ws</str>
> <str>f5_ws</str>
> <str>f5_ws</str>
> <str>f5_ws</str>
> <str>f5_ws</str>
> <str>f5_ws</str>
> <str>f6_ws</str>
> <str>f6_ws</str>
> <str>f6_ws</str>
> <str>f6_ws</str>
> <str>f6_ws</str>
> <str>f7_ws</str>
> <str>f7_ws</str>
> <str>f7_ws</str>
> <str>f7_ws</str>
> <str>f7_ws</str>
> <str>f8_ws</str>
> <str>f8_ws</str>
> <str>f8_ws</str>
> <str>f8_ws</str>
> <str>f8_ws</str>
> <str>f9_ws</str>
> <str>f9_ws</str>
> <str>f9_ws</str>
> <str>f9_ws</str>
> <str>f9_ws</str>
> </arr>
> <str name="wt">xml</str>
> <str name="rows">1</str>
> </lst>
> </lst>
> <result name="response" numFound="50" start="0" maxScore="1.0">
> <doc>
> <float name="id">2.0</float></doc>
> </result>
> <lst name="facet_counts">
> <lst name="facet_queries"/>
> <lst name="facet_fields">
> <lst name="f0_ws">
> <int name="zero_1">25</int>
> <int name="zero_2">25</int>
> </lst>
> <lst name="f1_ws">
> <int name="one_1">33</int>
> <int name="one_3">17</int>
> </lst>
> <lst name="f2_ws">
> <int name="two_1">37</int>
> <int name="two_4">13</int>
> </lst>
> <lst name="f3_ws">
> <int name="three_1">40</int>
> <int name="three_5">10</int>
> </lst>
> <lst name="f4_ws">
> <int name="four_1">41</int>
> <int name="four_6">9</int>
> </lst>
> <lst name="f5_ws">
> <int name="five_1">42</int>
> <int name="five_7">8</int>
> </lst>
> <lst name="f6_ws">
> <int name="six_1">43</int>
> <int name="six_8">7</int>
> </lst>
> <lst name="f7_ws">
> <int name="seven_1">44</int>
> <int name="seven_9">6</int>
> </lst>
> <lst name="f8_ws">
> <int name="eight_1">45</int>
> <int name="eight_10">5</int>
> </lst>
> <lst name="f9_ws">
> <int name="nine_1">45</int>
> <int name="nine_11">5</int>
> </lst>
> </lst>
> <lst name="facet_dates"/>
> <lst name="facet_ranges"/>
> </lst>
> </response>
> {code}
> This behavior is quite strange as it is being dependent on the number of
> shards in SolrCloud. It would be great if someone can shed some light on this?
--
This message was sent by Atlassian JIRA
(v6.2#6252)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]