On the Solr side there's at least https://issues.apache.org/jira/browse/SOLR-9818 which may cause trouble with the queue. I once had the core reload command in the admin UI add more than 200k entries to the overseer queue..

--Ere

Shawn Heisey kirjoitti 25.10.2017 klo 15.57:
On 10/24/2017 8:11 AM, Tarjono, C. A. wrote:
Would like to check if anyone have seen this issue before, we started
having this a few days ago:

The only error I can see in solr console is below:

5960847[main-SendThread(172.16.130.132:2281)] WARN
org.apache.zookeeper.ClientCnxn [ ] – Session 0x65f4e28b7370001 for
server 172.16.130.132/172.16.130.132:2281, unexpected error, closing
socket connection and attempting reconnect java.io.IOException: Packet
len30829010 is out of range!


Combining the last part of what I quoted above with the image you shared
later, I am pretty sure I know what is happening.

The overseer queue in zookeeper (at the ZK path of /overseer/queue) has
a lot of entries in it.  Based on the fact that you are seeing a packet
length beyond 30 million bytes, I am betting that the number of entries
in the queue is between 1.5 million and 2 million.  ZK cannot handle
that packet size without a special startup argument.  The value of the
special parameter defaults to a little over one million bytes.

To fix this, you're going to need to wipe out the overseer queue.  ZK
includes a script named ZkCli.  Note that Solr includes a script called
zkcli as well, which does very different things.  You need the one
included with zookeeper.

Wiping out the queue when it is that large is not straightforward.  You
need to start the ZkCli script included with zookeeper with a
-Djute.maxbuffer=31000000 argument and the same zkHost value used by
Solr, and then use a command like "rmr /overseer/queue" in that command
shell to completely remove the /overseer/queue path.  Then you can
restart the ZK servers without the jute.maxbuffer setting.  You may need
to restart Solr.  Running this procedure might also require temporarily
restarting the ZK servers with the same jute.maxbuffer argument, but I
am not sure whether that is required.

The basic underlying problem here is that ZK allows adding new nodes
even when the size of the parent node exceeds the default buffer size.
That issue is documented here:

https://issues.apache.org/jira/browse/ZOOKEEPER-1162

I can't be sure why why your cloud is adding so many entries to the
overseer queue.  I have seen this problem happen when restarting a
server in the cloud, particularly when there are a large number of
collections or shard replicas in the cloud.  Restarting multiple servers
or restarting the same server multiple times without waiting for the
overseer queue to empty could also cause the issue.

Thanks,
Shawn


--
Ere Maijala
Kansalliskirjasto / The National Library of Finland

Reply via email to