RE: Ghost Documents or Shards out of Sync
: Let me add some background. A user triggers an operation which under the : hood needs to update a single field. Atomic update fails with a message : that one of the mandatory fields is missing (which is strange by : itself). When I query Solr for the exact document (fq with the document : id) I sometimes get the expected single result and sometimes zero. Those : queries are done sometimes couple of days later so auto commits : necessarily have been performed. I suspect waht you are seeing is that the update succeeds on a leader, but for some reason (i'm not really understanding your description ofthe atomic udpate failure) it fals on areplica -- leaving them in an inconsistent state. Restarting all the nodes forces the out of sync replica to recover. If i'm correct, then when you see these inconsistent results, you should be able to query each individual *core* that is a replica of the shard this document belongs in, using a "distrib=false" request param, and see that it exists on the "leader" replica, but not on one/some of the other replicas. Understanding why/how you got into this situation though would require understand what exactly you mean by "a message that one of the mandatory fields is missing" can you show us some details? solrconfig/schema, example documents, example updates, log messages from the various nodes when these updates "fail", etc... ? https://cwiki.apache.org/confluence/display/SOLR/UsingMailingLists : One more thing that might be important - we're using nested schema, and : we recently encountered several issues that make me think that this : combination - nested and atomic updates (of parent documents) - is the : root cause. it's very possible that there are some bugs related to atomic updates and neste documents -- the code for dealing with that combination is relatively new, and making it work correctly requires special fields in the schema -- on top of the normal atomic update rules. The documentation on this was heavily updated in the 8.7 ref-guide... https://lucene.apache.org/solr/guide/8_7/indexing-nested-documents.html https://lucene.apache.org/solr/guide/8_7/updating-parts-of-documents.html#updating-child-documents -Hoss http://www.lucidworks.com/
RE: Ghost Documents or Shards out of Sync
Hi, Thank you your replies - much appreciated! I'm afraid it's neither one... Let me add some background. A user triggers an operation which under the hood needs to update a single field. Atomic update fails with a message that one of the mandatory fields is missing (which is strange by itself). When I query Solr for the exact document (fq with the document id) I sometimes get the expected single result and sometimes zero. Those queries are done sometimes couple of days later so auto commits necessarily have been performed. Only after a restart, the query is working and then atomic update succeeds. One more thing that might be important - we're using nested schema, and we recently encountered several issues that make me think that this combination - nested and atomic updates (of parent documents) - is the root cause. Ronen. -Original Message- From: Mike Drob Sent: יום ב 01 פברואר 2021 22:58 To: solr-user@lucene.apache.org Subject: Re: Ghost Documents or Shards out of Sync To expand on what Jason suggested, if the issue is the non-deterministic ordering due to staggered commits per replica, you may have more consistency with TLOG replicas rather than the NRT replicas. In this case, the underlying segment files should be identical and lead to more predictable results. On Mon, Feb 1, 2021 at 2:50 PM Jason Gerlowski wrote: > Hi Ronen, > > The first thing I'd figure out in your situation is whether the > results are actually different each time, or whether the ordering is > what differs (which might push a particular result off the page you're > looking at, giving the appearance that it didn't match). > > In the case of the former, this can happen briefly if queries come in > when some but not all replicas have seen a commit. But usually this > is a transient concern - either waiting for the next autocommit or > triggering an explicit commit resolves the discrepancy in this case. > Since you only see identical results after a restart, this _doesn't_ > sound like what you're seeing. > > In the case of the latter (same results, differently ordered) this is > expected sometimes. Solr sorts on relevance by default with the > internal Lucene document ID being a tiebreaker. Both the relevance > statistics and Lucene's document IDs can differ across SolrCloud > replicas (due to non-deterministic conditions such as the segment > merging and deleted-doc removal that Lucene does under the hood), and > this can produce differently-ordered result sets for users that issue > the same query repeatedly. > > Good luck narrowing things down! > > Jason > > On Mon, Jan 25, 2021 at 3:32 AM Ronen Nussbaum wrote: > > > > Hi All, > > > > I'm using Solr Cloud (version 8.3.0) with shards and replicas > (replication > > factor of 2). > > Recently, I've encountered several times that running the same query > > repeatedly yields different results. Restarting the nodes fixes the > problem > > (until next time). > > I assume that some shards are not synchronized and I have several > questions: > > 1. What can cause this - many atomic updates? issues with commits? > > 2. Can I trigger the "fixing" mechanism that Solr runs at restart by > > an > API > > call or some other method? > > > > Thanks in advance, > > Ronen. > This electronic message may contain proprietary and confidential information of Verint Systems Inc., its affiliates and/or subsidiaries. The information is intended to be for the use of the individual(s) or entity(ies) named above. If you are not the intended recipient (or authorized to receive this e-mail for the intended recipient), you may not use, copy, disclose or distribute to anyone this message or any information contained in this message. If you have received this electronic message in error, please notify us by replying to this e-mail.
Re: Ghost Documents or Shards out of Sync
To expand on what Jason suggested, if the issue is the non-deterministic ordering due to staggered commits per replica, you may have more consistency with TLOG replicas rather than the NRT replicas. In this case, the underlying segment files should be identical and lead to more predictable results. On Mon, Feb 1, 2021 at 2:50 PM Jason Gerlowski wrote: > Hi Ronen, > > The first thing I'd figure out in your situation is whether the > results are actually different each time, or whether the ordering is > what differs (which might push a particular result off the page you're > looking at, giving the appearance that it didn't match). > > In the case of the former, this can happen briefly if queries come in > when some but not all replicas have seen a commit. But usually this > is a transient concern - either waiting for the next autocommit or > triggering an explicit commit resolves the discrepancy in this case. > Since you only see identical results after a restart, this _doesn't_ > sound like what you're seeing. > > In the case of the latter (same results, differently ordered) this is > expected sometimes. Solr sorts on relevance by default with the > internal Lucene document ID being a tiebreaker. Both the relevance > statistics and Lucene's document IDs can differ across SolrCloud > replicas (due to non-deterministic conditions such as the segment > merging and deleted-doc removal that Lucene does under the hood), and > this can produce differently-ordered result sets for users that issue > the same query repeatedly. > > Good luck narrowing things down! > > Jason > > On Mon, Jan 25, 2021 at 3:32 AM Ronen Nussbaum wrote: > > > > Hi All, > > > > I'm using Solr Cloud (version 8.3.0) with shards and replicas > (replication > > factor of 2). > > Recently, I've encountered several times that running the same query > > repeatedly yields different results. Restarting the nodes fixes the > problem > > (until next time). > > I assume that some shards are not synchronized and I have several > questions: > > 1. What can cause this - many atomic updates? issues with commits? > > 2. Can I trigger the "fixing" mechanism that Solr runs at restart by an > API > > call or some other method? > > > > Thanks in advance, > > Ronen. >
Re: Ghost Documents or Shards out of Sync
Forgot to answer your second question: > Can I trigger the "fixing" mechanism that Solr runs at restart by an API call > or some other method? It depends on what the cause is. But for at least some possible causes there is an API call that can resolve this. Though that API itself (Solr's misnamed "optimize" feature) comes with a lot of warnings and has been discouraged by the community in the past. (I won't get into those specifics though until you figure out the cause.) Before you consider calling "optimize" or taking any other means to fix this though, it might be worth revisiting whether this is really an issue? While this quirk of Solr's can bedevil automated tests or other things that rely on repeatability, it's unusual in many applications for end-users to submit identical queries multiple times. Every case is different of course, but something to consider. Best, Jason On Mon, Feb 1, 2021 at 3:49 PM Jason Gerlowski wrote: > > Hi Ronen, > > The first thing I'd figure out in your situation is whether the > results are actually different each time, or whether the ordering is > what differs (which might push a particular result off the page you're > looking at, giving the appearance that it didn't match). > > In the case of the former, this can happen briefly if queries come in > when some but not all replicas have seen a commit. But usually this > is a transient concern - either waiting for the next autocommit or > triggering an explicit commit resolves the discrepancy in this case. > Since you only see identical results after a restart, this _doesn't_ > sound like what you're seeing. > > In the case of the latter (same results, differently ordered) this is > expected sometimes. Solr sorts on relevance by default with the > internal Lucene document ID being a tiebreaker. Both the relevance > statistics and Lucene's document IDs can differ across SolrCloud > replicas (due to non-deterministic conditions such as the segment > merging and deleted-doc removal that Lucene does under the hood), and > this can produce differently-ordered result sets for users that issue > the same query repeatedly. > > Good luck narrowing things down! > > Jason > > On Mon, Jan 25, 2021 at 3:32 AM Ronen Nussbaum wrote: > > > > Hi All, > > > > I'm using Solr Cloud (version 8.3.0) with shards and replicas (replication > > factor of 2). > > Recently, I've encountered several times that running the same query > > repeatedly yields different results. Restarting the nodes fixes the problem > > (until next time). > > I assume that some shards are not synchronized and I have several questions: > > 1. What can cause this - many atomic updates? issues with commits? > > 2. Can I trigger the "fixing" mechanism that Solr runs at restart by an API > > call or some other method? > > > > Thanks in advance, > > Ronen.
Re: Ghost Documents or Shards out of Sync
Hi Ronen, The first thing I'd figure out in your situation is whether the results are actually different each time, or whether the ordering is what differs (which might push a particular result off the page you're looking at, giving the appearance that it didn't match). In the case of the former, this can happen briefly if queries come in when some but not all replicas have seen a commit. But usually this is a transient concern - either waiting for the next autocommit or triggering an explicit commit resolves the discrepancy in this case. Since you only see identical results after a restart, this _doesn't_ sound like what you're seeing. In the case of the latter (same results, differently ordered) this is expected sometimes. Solr sorts on relevance by default with the internal Lucene document ID being a tiebreaker. Both the relevance statistics and Lucene's document IDs can differ across SolrCloud replicas (due to non-deterministic conditions such as the segment merging and deleted-doc removal that Lucene does under the hood), and this can produce differently-ordered result sets for users that issue the same query repeatedly. Good luck narrowing things down! Jason On Mon, Jan 25, 2021 at 3:32 AM Ronen Nussbaum wrote: > > Hi All, > > I'm using Solr Cloud (version 8.3.0) with shards and replicas (replication > factor of 2). > Recently, I've encountered several times that running the same query > repeatedly yields different results. Restarting the nodes fixes the problem > (until next time). > I assume that some shards are not synchronized and I have several questions: > 1. What can cause this - many atomic updates? issues with commits? > 2. Can I trigger the "fixing" mechanism that Solr runs at restart by an API > call or some other method? > > Thanks in advance, > Ronen.