[
https://issues.apache.org/jira/browse/SOLR-9030?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15270875#comment-15270875
]
Scott Blum commented on SOLR-9030:
----------------------------------
ZkStateWriter is basically a write cache. It should be much simpler than it
is. A few things that bug me in no particular order:
1) Tracking lastStateFormat / lastCollectionName and in general having a
maybeFlushBefore / maybeFlushAfter makes no real sense to me. If ZkStateWriter
were capable of operating as a perfect write cache, the *content* of what's
being written should never force a flush. It should be able to just always
keep queuing operations until the desired time delay is hit, or it's flushed
from the outside.
2) ZkStateWriter's ClusterState liveNodes should probably be a view on
ZkStateReader's ClusterState liveNode.
3) ZkWriteCallback - the one place this is used is the Overseer
stateUpdateQueue handling. I think the way that loop works would ZkStateWriter
could be done a little better. Ideally, I would want to peek up to N children
at a time from that queue, send them all through ZkStateWriter in succession,
flush, then remove those N items from the stateUpdateQueue. If the flush
failed from some reason, it could return a count of items committed so we could
remove that many items from the stateUpdateQueue. It seems a little nuts to
have a second workQueue in operation the way it is today. I get that in some
situations we'd end up doing more net cluster state writes, but I think we'd
still do fewer net writes to ZK since we do so much queue management.
> The 'downnode' command can trip asserts in ZkStateWriter or cause
> BadVersionException in Overseer
> -------------------------------------------------------------------------------------------------
>
> Key: SOLR-9030
> URL: https://issues.apache.org/jira/browse/SOLR-9030
> Project: Solr
> Issue Type: Bug
> Components: SolrCloud
> Reporter: Shalin Shekhar Mangar
> Fix For: 6.1, master
>
>
> While working on SOLR-9014 I came across a strange test failure.
> {code}
> [junit4] ERROR 16.9s |
> AsyncCallRequestStatusResponseTest.testAsyncCallStatusResponse <<<
> [junit4] > Throwable #1:
> com.carrotsearch.randomizedtesting.UncaughtExceptionError: Captured an
> uncaught exception in thread: Thread[id=46,
> name=OverseerStateUpdate-95769832112259076-127.0.0.1:51135_z_oeg%2Ft-n_0000000000,
> state=RUNNABLE, group=Overseer state updater.]
> [junit4] > at
> __randomizedtesting.SeedInfo.seed([91F68DA7E10807C3:CBF7E84BCF328A1A]:0)
> [junit4] > Caused by: java.lang.AssertionError
> [junit4] > at
> __randomizedtesting.SeedInfo.seed([91F68DA7E10807C3]:0)
> [junit4] > at
> org.apache.solr.cloud.overseer.ZkStateWriter.writePendingUpdates(ZkStateWriter.java:231)
> [junit4] > at
> org.apache.solr.cloud.Overseer$ClusterStateUpdater.run(Overseer.java:240)
> [junit4] > at java.lang.Thread.run(Thread.java:745)
> {code}
> The underlying problem can manifest by tripping the above assert or a
> BadVersionException as well. I found that this was introduced in SOLR-7281
> where a new 'downnode' command was added.
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]