[
https://issues.apache.org/jira/browse/SOLR-13815?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16946917#comment-16946917
]
Yonik Seeley commented on SOLR-13815:
-------------------------------------
I dug into
https://issues.apache.org/jira/secure/attachment/12982490/fail.191004_053129 a
bit.
Here are some commented log lines from it:
{code}
# doc_38 finishes on subshard leader (was forwarded by leader)
# TODO: what is the state of the leader at this point?
2> 13961 INFO (qtp577294902-45) [n:127.0.0.1:39195_solr c:livesplit1
s:shard1_1 r:core_node6 x:livesplit1_shard1_1_replica_n4 ]
o.a.s.u.p.LogUpdateProcessorFactory [livesplit1_shard1_1_replica_n4]
webapp=/solr path=/update
params={update.distrib=FROMLEADER&distrib.from.parent=shard1&distrib.from=http://127.0.0.1:39195/solr/livesplit1_shard1_replica_n1/&wt=javabin&version=2}{add=[doc_38
(1646454616073699328)]} 0 1
# doc_38 finishes on leader
2> 13963 INFO (qtp577294902-118) [n:127.0.0.1:39195_solr c:livesplit1
s:shard1 r:core_node2 x:livesplit1_shard1_replica_n1 ]
o.a.s.u.p.LogUpdateProcessorFactory [livesplit1_shard1_replica_n1]
webapp=/solr path=/update
params={_stateVer_=livesplit1:4&wt=javabin&version=2}{add=[doc_38
(1646454616073699328)]} 0 11
# The split is finished and the shards change their states
2> 14113 INFO
(OverseerStateUpdate-72062023354613765-127.0.0.1:39195_solr-n_0000000000)
[n:127.0.0.1:39195_solr ] o.a.s.c.o.SliceMutator Update shard state shard1
to inactive
2> 14113 INFO
(OverseerStateUpdate-72062023354613765-127.0.0.1:39195_solr-n_0000000000)
[n:127.0.0.1:39195_solr ] o.a.s.c.o.SliceMutator Update shard state
shard1_1 to active
2> 14113 INFO
(OverseerStateUpdate-72062023354613765-127.0.0.1:39195_solr-n_0000000000)
[n:127.0.0.1:39195_solr ] o.a.s.c.o.SliceMutator Update shard state
shard1_0 to active
# doc_39 finishes on leader (was never forwarded to sub-shard leader!)
2> 14219 INFO (qtp577294902-48) [n:127.0.0.1:39195_solr c:livesplit1
s:shard1 r:core_node2 x:livesplit1_shard1_replica_n1 ]
o.a.s.u.p.LogUpdateProcessorFactory [livesplit1_shard1_replica_n1]
webapp=/solr path=/update
params={_stateVer_=livesplit1:4&wt=javabin&version=2}{add=[doc_39
(1646454616338989056)]} 0 14
# doc_40 finishes on sub-shardleader (was forwarded by leader)
2> 14291 INFO (qtp577294902-47) [n:127.0.0.1:39195_solr c:livesplit1
s:shard1_1 r:core_node6 x:livesplit1_shard1_1_replica_n4 ]
o.a.s.u.p.LogUpdateProcessorFactory [livesplit1_shard1_1_replica_n4]
webapp=/solr path=/update
params={update.distrib=TOLEADER&distrib.from=http://127.0.0.1:39195/solr/livesplit1_shard1_replica_n1/&wt=javabin&version=2}{add=[doc_40
(1646454616426020864)]} 0 3
# doc_40 finishes on leader
2> 14293 INFO (qtp577294902-49) [n:127.0.0.1:39195_solr c:livesplit1
s:shard1 r:core_node2 x:livesplit1_shard1_replica_n1 ]
o.a.s.u.p.LogUpdateProcessorFactory [livesplit1_shard1_replica_n1]
webapp=/solr path=/update
params={_stateVer_=livesplit1:4&wt=javabin&version=2}{add=[doc_40]} 0 16
{code}
doc_39 is the missing doc in this case. Notice that the document indexed both
before *and* after it both go to the original shard leader and are forwarded to
the sub-shard leader, whereas doc_39 is not forwarded (or at least is not
logged). My guess at this point is that there is some sort of race condition
around changing the states of the shards and forwarding updates. We'll
probably need higher levels of logging (or more instrumentation) to confirm
what's going on though.
> Live split can lose data
> ------------------------
>
> Key: SOLR-13815
> URL: https://issues.apache.org/jira/browse/SOLR-13815
> Project: Solr
> Issue Type: Bug
> Security Level: Public(Default Security Level. Issues are Public)
> Reporter: Yonik Seeley
> Priority: Major
> Attachments: fail.191004_053129, fail.191004_093307
>
> Time Spent: 10m
> Remaining Estimate: 0h
>
> This issue is to investigate potential data loss during a "live" split (i.e.
> split happens while updates are flowing)
> This was discovered during the shared storage work which was based on a
> non-release branch_8x sometime before 8.3, hence the first steps are to try
> and reproduce on the master branch without any shared storage changes.
--
This message was sent by Atlassian Jira
(v8.3.4#803005)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]