[
https://issues.apache.org/jira/browse/SOLR-5944?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Hoss Man updated SOLR-5944:
---------------------------
Attachment: hoss.62D328FA1DEA57FD.fail3.txt
hoss.62D328FA1DEA57FD.fail2.txt
hoss.62D328FA1DEA57FD.fail.txt
hoss.D768DD9443A98DC.fail.txt
hoss.D768DD9443A98DC.pass.txt
I've been reviewing the logs from sarowe's failures -- I won't pretend to
understand half of what i'm looking at here (I'm still not up on most of the
new code) but here's some interesting patterns i've noticed...
* in both failure logs posted, doc "13" was the doc having problems
* the specific docId is probably just a coincidence, but it does mean that the
same egrep command works on both log files to give you the particularly
interesting bits realtive to the failure...{noformat}
egrep add=\\[13\|id=13\|ids=13
TestStressInPlaceUpdates.eb044ac71.beast-167-failure.stdout.txt >
beast-167.important.txt
egrep add=\\[13\|id=13\|ids=13
TestStressInPlaceUpdates.eb044ac71.beast-587-failure.stdout.txt >
beast-587.important.txt
{noformat}
* looking first at beast-587.important.txt:
** the ERROR that failed the test was first logged by READER2 @ (timestamp)
34456:{noformat}
34456 ERROR (READER2) [ ] o.a.s.c.TestStressInPlaceUpdates Realtime=true,
ERROR, id=13 found={response={numFound=1,start=0,docs=[SolrDocument{id=13,
title_s=[title13], val1_i_dvo=3, val2_l_dvo=3000000006,
_version_=1534607778351415296, ratings=0.0, price=0}]}}
model=[1534607780231512064, 3, 3000000012]
{noformat}
** Working backwards, that expected version 1534607780231512064 was logged by
WRITER10 as being returned to a PARTIAL update @ 31219:{noformat}
31219 INFO (WRITER10) [ ] o.a.s.c.TestStressInPlaceUpdates PARTIAL: Writing
id=13, val=[3,3000000012], version=1534607779993485312, Prev
was=[3,3000000009]. Returned version=1534607780231512064
{noformat}
*** WRITER10's logging of this "Returned version=1534607780231512064" came
after core_node1, core_node2, and core_node3 all logged it being written to
their TLOG & reported it via LogUpdateProc:{noformat}
30557 INFO (qtp2010985731-180) [n:127.0.0.1:37972__m c:collection1 s:shard1
r:core_node1 x:collection1] o.a.s.u.UpdateLog TLOG: added id
13(ver=1534607780231512064, prevVersion=1534607779993485312, prevPtr=2343) to
tlog{file=/tmp/beast-tmp-output/587/J0/temp/solr.cloud.TestStressInPlaceUpdates_FFC46C473EC471E6-001/shard-1-001/cores/collection1/data/tlog/tlog.0000000000000000004
refcount=1} LogPtr(2396) map=1977112331, actual doc=SolrInputDocument(fields:
[id=13, val2_l_dvo=3000000012, _version_=1534607780231512064, val1_i_dvo=3])
30589 INFO (qtp1755078679-232) [n:127.0.0.1:38407__m c:collection1 s:shard1
r:core_node2 x:collection1] o.a.s.u.UpdateLog TLOG: added id
13(ver=1534607780231512064, prevVersion=1534607779993485312, prevPtr=2343) to
tlog{file=/tmp/beast-tmp-output/587/J0/temp/solr.cloud.TestStressInPlaceUpdates_FFC46C473EC471E6-001/shard-2-001/cores/collection1/data/tlog/tlog.0000000000000000002
refcount=1} LogPtr(2396) map=1630836284, actual doc=SolrInputDocument(fields:
[id=13, val2_l_dvo=3000000012, _version_=1534607780231512064, val1_i_dvo=3])
30589 INFO (qtp1755078679-232) [n:127.0.0.1:38407__m c:collection1 s:shard1
r:core_node2 x:collection1] o.a.s.u.p.LogUpdateProcessorFactory [collection1]
webapp=/_m path=/update
params={update.distrib=FROMLEADER&distrib.from=http://127.0.0.1:37972/_m/collection1/&distrib.inplace.prevversion=1534607779993485312&wt=javabin&version=2&distrib.inplace.update=true}{add=[13
(1534607780231512064)]} 0 0
31216 INFO (qtp2143623462-144) [n:127.0.0.1:58295__m c:collection1 s:shard1
r:core_node3 x:collection1] o.a.s.u.UpdateLog TLOG: added id
13(ver=1534607780231512064, prevVersion=1534607779993485312, prevPtr=2343) to
tlog{file=/tmp/beast-tmp-output/587/J0/temp/solr.cloud.TestStressInPlaceUpdates_FFC46C473EC471E6-001/shard-3-001/cores/collection1/data/tlog/tlog.0000000000000000002
refcount=1} LogPtr(2396) map=1500522809, actual doc=SolrInputDocument(fields:
[id=13, val2_l_dvo=3000000012, _version_=1534607780231512064, val1_i_dvo=3])
31216 INFO (qtp2143623462-144) [n:127.0.0.1:58295__m c:collection1 s:shard1
r:core_node3 x:collection1] o.a.s.u.p.LogUpdateProcessorFactory [collection1]
webapp=/_m path=/update
params={update.distrib=FROMLEADER&distrib.from=http://127.0.0.1:37972/_m/collection1/&distrib.inplace.prevversion=1534607779993485312&wt=javabin&version=2&distrib.inplace.update=true}{add=[13
(1534607780231512064)]} 0 0
31219 INFO (qtp2010985731-180) [n:127.0.0.1:37972__m c:collection1 s:shard1
r:core_node1 x:collection1] o.a.s.u.p.LogUpdateProcessorFactory [collection1]
webapp=/_m path=/update params={versions=true&wt=javabin&version=2}{add=[13
(1534607780231512064)]} 0 662
{noformat}
** but looking *after* the ERROR was first logged @ 34456, we see that before
the test had a chance to shut down all the nodes, there was some suspicious
looking logging from core_node2 regarding updates out of order, that refer to
the expected 1534607780231512064 version:{noformat}
34976 INFO (qtp1755078679-111) [n:127.0.0.1:38407__m c:collection1 s:shard1
r:core_node2 x:collection1] o.a.s.u.p.DistributedUpdateProcessor Fetched the
update: add{_version_=1534607780231512064,id=13}
35810 INFO (qtp1755078679-111) [n:127.0.0.1:38407__m c:collection1 s:shard1
r:core_node2 x:collection1] o.a.s.u.p.DistributedUpdateProcessor Fetched the
update: add{_version_=1534607779993485312,id=13}
35811 INFO (qtp1755078679-111) [n:127.0.0.1:38407__m c:collection1 s:shard1
r:core_node2 x:collection1] o.a.s.u.UpdateLog TLOG: added id
13(ver=1534607779993485312, prevVersion=1534607778351415296, prevPtr=-1) to
tlog{file=/tmp/beast-tmp-output/587/J0/temp/solr.cloud.TestStressInPlaceUpdates_FFC46C473EC471E6-001/shard-2-001/cores/collection1/data/tlog/tlog.0000000000000000002
refcount=1} LogPtr(3955) map=748019145, actual doc=SolrInputDocument(fields:
[id=13, val2_l_dvo=3000000009, _version_=1534607779993485312, val1_i_dvo=3])
35811 INFO (qtp1755078679-111) [n:127.0.0.1:38407__m c:collection1 s:shard1
r:core_node2 x:collection1] o.a.s.u.p.DistributedUpdateProcessor Fetched
missing dependent updates from leader, which likely succeeded. Current update:
add{_version_=1534607780231512064,id=13}, Missing update:
add{_version_=1534607779993485312,id=13}
35811 INFO (qtp1755078679-111) [n:127.0.0.1:38407__m c:collection1 s:shard1
r:core_node2 x:collection1] o.a.s.u.UpdateLog TLOG: added id
13(ver=1534607780231512064, prevVersion=1534607779993485312, prevPtr=3955) to
tlog{file=/tmp/beast-tmp-output/587/J0/temp/solr.cloud.TestStressInPlaceUpdates_FFC46C473EC471E6-001/shard-2-001/cores/collection1/data/tlog/tlog.0000000000000000002
refcount=1} LogPtr(4015) map=748019145, actual doc=SolrInputDocument(fields:
[id=13, val2_l_dvo=3000000012, _version_=1534607780231512064, val1_i_dvo=3])
35811 INFO (qtp1755078679-111) [n:127.0.0.1:38407__m c:collection1 s:shard1
r:core_node2 x:collection1] o.a.s.u.p.DistributedUpdateProcessor Fetched
missing dependent updates from leader, which likely succeeded. Current update:
add{,id=13}, Missing update: add{_version_=1534607780231512064,id=13}
35811 INFO (qtp1755078679-111) [n:127.0.0.1:38407__m c:collection1 s:shard1
r:core_node2 x:collection1] o.a.s.u.UpdateLog TLOG: added id
13(ver=1534607783219953664, prevVersion=1534607780231512064, prevPtr=4015) to
tlog{file=/tmp/beast-tmp-output/587/J0/temp/solr.cloud.TestStressInPlaceUpdates_FFC46C473EC471E6-001/shard-2-001/cores/collection1/data/tlog/tlog.0000000000000000002
refcount=1} LogPtr(4069) map=748019145, actual doc=SolrInputDocument(fields:
[id=13, val2_l_dvo=3000000015, _version_=1534607783219953664, val1_i_dvo=3])
35812 INFO (qtp1755078679-111) [n:127.0.0.1:38407__m c:collection1 s:shard1
r:core_node2 x:collection1] o.a.s.u.p.LogUpdateProcessorFactory [collection1]
webapp=/_m path=/update
params={update.distrib=FROMLEADER&distrib.from=http://127.0.0.1:37972/_m/collection1/&distrib.inplace.prevversion=1534607780231512064&wt=javabin&version=2&distrib.inplace.update=true}{add=[13
(1534607783219953664)]} 0 1714
{noformat}
*** these might be unrelated to the failure, since (IIRC) realtime get is
always routed to the leader (can somone confirm this?) so this replica needing
to fetch these out of order updates from the leader may not actaully be an
indication of a problem -- but it did catch my eye.
** if we change our focus and direct it at the "actual" version
1534607778351415296 that was returned by the realtime get causing the test
failure, we notice it being recorded by WRITER1 as a partial update @
28769:{noformat}
28769 INFO (WRITER11) [ ] o.a.s.c.TestStressInPlaceUpdates PARTIAL: Writing
id=13, val=[3,3000000006], version=1534607778339880960, Prev
was=[3,3000000003]. Returned version=1534607778351415296
{noformat}
*** This is subsequent to the expected TLOG and LogUpdateProc messages we would
expect for this update
** Starting at timestamp 30330, we see the expected log messages for the update
that should have *replaced* 1534607778351415296, with params indicating that
1534607778351415296 is the previous version:{noformat}
30330 INFO (qtp2010985731-180) [n:127.0.0.1:37972__m c:collection1 s:shard1
r:core_node1 x:collection1] o.a.s.u.UpdateLog TLOG: added id
13(ver=1534607779993485312, prevVersion=1534607778351415296, prevPtr=636) to
tlog{file=/tmp/beast-tmp-output/587/J0/temp/solr.cloud.TestStressInPlaceUpdates_FFC46C473EC471E6-001/shard-1-001/cores/collection1/data/tlog/tlog.0000000000000000004
refcount=1} LogPtr(2343) map=1977112331, actual doc=SolrInputDocument(fields:
[id=13, val2_l_dvo=3000000009, _version_=1534607779993485312, val1_i_dvo=3])
30331 INFO (qtp2143623462-144) [n:127.0.0.1:58295__m c:collection1 s:shard1
r:core_node3 x:collection1] o.a.s.u.UpdateLog TLOG: added id
13(ver=1534607779993485312, prevVersion=1534607778351415296, prevPtr=636) to
tlog{file=/tmp/beast-tmp-output/587/J0/temp/solr.cloud.TestStressInPlaceUpdates_FFC46C473EC471E6-001/shard-3-001/cores/collection1/data/tlog/tlog.0000000000000000002
refcount=1} LogPtr(2343) map=1500522809, actual doc=SolrInputDocument(fields:
[id=13, val2_l_dvo=3000000009, _version_=1534607779993485312, val1_i_dvo=3])
30332 INFO (qtp2143623462-144) [n:127.0.0.1:58295__m c:collection1 s:shard1
r:core_node3 x:collection1] o.a.s.u.p.LogUpdateProcessorFactory [collection1]
webapp=/_m path=/update
params={update.distrib=FROMLEADER&distrib.from=http://127.0.0.1:37972/_m/collection1/&distrib.inplace.prevversion=1534607778351415296&wt=javabin&version=2&distrib.inplace.update=true}{add=[13
(1534607779993485312)]} 0 0
30336 INFO (qtp1755078679-241) [n:127.0.0.1:38407__m c:collection1 s:shard1
r:core_node2 x:collection1] o.a.s.u.UpdateLog TLOG: added id
13(ver=1534607779993485312, prevVersion=1534607778351415296, prevPtr=636) to
tlog{file=/tmp/beast-tmp-output/587/J0/temp/solr.cloud.TestStressInPlaceUpdates_FFC46C473EC471E6-001/shard-2-001/cores/collection1/data/tlog/tlog.0000000000000000002
refcount=1} LogPtr(2343) map=1630836284, actual doc=SolrInputDocument(fields:
[id=13, val2_l_dvo=3000000009, _version_=1534607779993485312, val1_i_dvo=3])
30336 INFO (qtp1755078679-241) [n:127.0.0.1:38407__m c:collection1 s:shard1
r:core_node2 x:collection1] o.a.s.u.p.LogUpdateProcessorFactory [collection1]
webapp=/_m path=/update
params={update.distrib=FROMLEADER&distrib.from=http://127.0.0.1:37972/_m/collection1/&distrib.inplace.prevversion=1534607778351415296&wt=javabin&version=2&distrib.inplace.update=true}{add=[13
(1534607779993485312)]} 0 0
30396 INFO (qtp2143623462-227) [n:127.0.0.1:58295__m c:collection1 s:shard1
r:core_node3 x:collection1] o.a.s.c.S.Request [collection1] webapp=/_m
path=/get params={qt=/get&ids=13&wt=javabin&version=2} status=0 QTime=0
30441 INFO (qtp1755078679-106) [n:127.0.0.1:38407__m c:collection1 s:shard1
r:core_node2 x:collection1] o.a.s.c.S.Request [collection1] webapp=/_m
path=/get params={qt=/get&ids=13&wt=javabin&version=2} status=0 QTime=0
30469 INFO (qtp2010985731-180) [n:127.0.0.1:37972__m c:collection1 s:shard1
r:core_node1 x:collection1] o.a.s.u.p.LogUpdateProcessorFactory [collection1]
webapp=/_m path=/update params={versions=true&wt=javabin&version=2}{add=[13
(1534607779993485312)]} 0 140
30556 INFO (WRITER10) [ ] o.a.s.c.TestStressInPlaceUpdates PARTIAL: Writing
id=13, val=[3,3000000009], version=1534607778351415296, Prev
was=[3,3000000006]. Returned version=1534607779993485312
{noformat}
** BUT! Just before this logging about replacing 1534607778351415296 w/
1534607779993485312 there is a log message mentioning version
1534607778351415296 that jumped out at me @ 30329:{noformat}
30329 INFO (qtp2010985731-180) [n:127.0.0.1:37972__m c:collection1 s:shard1
r:core_node1 x:collection1] o.a.s.u.p.AtomicUpdateDocumentMerger Uncommitted
doc is: SolrInputDocument(fields: [id=13, val2_l_dvo=3000000006,
_version_=1534607778351415296, val1_i_dvo=3])
{noformat}
*** I don't know anything about this AtomicUpdateDocumentMerger class, but this
logging smells particularly suspicious to me because it is the *last* time
AtomicUpdateDocumentMerger logs anything about doc #13
*** is this an indication that the "state" AtomicUpdateDocumentMerger is
tracking for doc #13 isn't getting updated when the (subsequently expected)
version 1534607780231512064 is added?
* switching to beast-167.important.txt:
** the ERROR that failed the test was first logged by READER0 @ 32666:{noformat}
32666 ERROR (READER0) [ ] o.a.s.c.TestStressInPlaceUpdates Realtime=true,
ERROR, id=13 found={response={numFound=1,start=0,docs=[SolrDocument{id=13,
title_s=[title13], val1_i_dvo=2, val2_l_dvo=2000000002,
_version_=1534599975700267008, ratings=0.0, price=0}]}}
model=[1534599977073901568, 2, 2000000004]
{noformat}
** this test didn't linger long enough to see if any interesting "out of order
update" messages were logged by replicas after the failure.
** the expected version 1534599977073901568 is reported as the returned version
by WRITER5 @ 32108:{noformat}
32108 INFO (WRITER5) [ ] o.a.s.c.TestStressInPlaceUpdates PARTIAL: Writing
id=13, val=[2,2000000004], version=1534599975700267008, Prev
was=[2,2000000002]. Returned version=1534599977073901568
{noformat}
*** This is subsequent to the expected TLOG and LogUpdateProc messages we would
expect for this update
** We again see no logging from AtomicUpdateDocumentMerger regarding the
expected version 1534599977073901568
** We *DO* see logging from AtomicUpdateDocumentMerger regarding the _actual_
version return in the request that failed the assertion:{noformat}
32082 INFO (qtp498157202-244) [n:127.0.0.1:55136__nge%2Fr c:collection1
s:shard1 r:core_node1 x:collection1] o.a.s.u.p.AtomicUpdateDocumentMerger
Uncommitted doc is: SolrInputDocument(fields: [id=13, val2_l_dvo=2000000002,
_version_=1534599975700267008])
{noformat}
*** This is again the last (and only) log message from
AtomicUpdateDocumentMerger regarding doc #13
----
In my own hammering of TestStressInPlaceUpdates, i've encountered a handful of
failures that haven't reproduced 100% reliably, but most of the failures i've
seen have followed this pattern {red}The most recent version logged by
AtomicUpdateDocumentMerger matches the 'actual' version logged in an ERROR
assertion failure message{red}
* hoss.62D328FA1DEA57FD.fail.txt - follows pattern:{noformat}
23263 INFO (qtp905061545-62) [n:127.0.0.1:43930_i%2Fop c:collection1 s:shard1
r:core_node1 x:collection1] o.a.s.u.p.AtomicUpdateDocumentMerger Uncommitted
doc is: SolrInputDocument(fields: [id=4, val2_l_dvo=3000000009,
_version_=1534612078380187648, val1_i_dvo=3])
...
23543 ERROR (READER3) [ ] o.a.s.c.TestStressInPlaceUpdates Realtime=true,
ERROR, id=4 found={response={numFound=1,start=0,docs=[SolrDocument{id=4,
title_s=[title4], val1_i_dvo=3, val2_l_dvo=3000000009,
_version_=1534612078380187648, ratings=0.0, price=0}]}}
model=[1534612078708391936, 3, 3000000012]
{noformat}
* hoss.62D328FA1DEA57FD.fail2.txt - does NOT follow the pattern:{noformat}
32481 INFO (qtp1749706169-67) [n:127.0.0.1:48694_i%2Fop c:collection1 s:shard1
r:core_node1 x:collection1] o.a.s.u.p.AtomicUpdateDocumentMerger Uncommitted
doc is: SolrInputDocument(fields: [id=13, val2_l_dvo=23000000023,
_version_=1534622450566823936, val1_i_dvo=23])
...
32581 INFO (qtp1749706169-238) [n:127.0.0.1:48694_i%2Fop c:collection1
s:shard1 r:core_node1 x:collection1] o.a.s.u.p.AtomicUpdateDocumentMerger
Uncommitted doc is: SolrInputDocument(fields: [id=13, val2_l_dvo=23000000046,
_version_=1534622450634981376, val1_i_dvo=23])
...
32661 ERROR (READER7) [ ] o.a.s.c.TestStressInPlaceUpdates Realtime=false,
ERROR, id=13 found={response={numFound=1,start=0,docs=[SolrDocument{id=13,
title_s=[title13], val1_i_dvo=23, val2_l_dvo=23000000023,
_version_=1534622450566823936, ratings=0.0, price=0}]}}
model=[1534622450634981376, 23, 23000000046]
{noformat}
* hoss.62D328FA1DEA57FD.fail3.txt - follows pattern:{noformat}
33869 INFO (qtp1530691049-201) [n:127.0.0.1:33944_i%2Fop c:collection1
s:shard1 r:core_node1 x:collection1] o.a.s.u.p.AtomicUpdateDocumentMerger
Uncommitted doc is: SolrInputDocument(fields: [id=1, val2_l_dvo=7000000035,
_version_=1534622560585515008])
...
34070 ERROR (READER1) [ ] o.a.s.c.TestStressInPlaceUpdates Realtime=true,
ERROR, id=1 found={response={numFound=1,start=0,docs=[SolrDocument{id=1,
title_s=[title1], val1_i_dvo=7, val2_l_dvo=7000000035,
_version_=1534622560585515008, ratings=0.0, price=0}]}}
model=[1534622560698761216, 7, 7000000042]
{noformat}
* hoss.D768DD9443A98DC.fail.txt - follows pattern:{noformat}
29666 INFO (qtp2123501190-71) [n:127.0.0.1:45821_norsg%2Fy c:collection1
s:shard1 r:core_node1 x:collection1] o.a.s.u.p.AtomicUpdateDocumentMerger
Uncommitted doc is: SolrInputDocument(fields: [id=16, val2_l_dvo=7000000014,
_version_=1534604230269075456, val1_i_dvo=7])
...
31018 ERROR (READER9) [ ] o.a.s.c.TestStressInPlaceUpdates Realtime=true,
ERROR, id=16 found={response={numFound=1,start=0,docs=[SolrDocument{id=16,
title_s=[title16], val1_i_dvo=7, val2_l_dvo=7000000014,
_version_=1534604230269075456, ratings=0.0, price=0}]}}
model=[1534604230432653312, 7, 7000000021]
{noformat}
** NOTE: this failure didn't reproduce for me reliably, see
hoss.D768DD9443A98DC.pass.txt
... hoss.62D328FA1DEA57FD.fail2.txt is currently the odd duck out.
my best guess (having not yet looked into AtomicUpdateDocumentMerger and what
exactly it does and it's "Uncommitted doc is: ..." logging and what that refers
to) is that some state tracked/updated by AtomicUpdateDocumentMerger is
happening asyncronously and in the failure cases isn't "real" at the moments
the test failures are happening -- and that hoss.62D328FA1DEA57FD.fail2.txt may
represent a situation where AtomicUpdateDocumentMerger is _logging_ the version
info in the "new" state before it's made it "real" and available for RTG.
...that's my theory anyway.
> Support updates of numeric DocValues
> ------------------------------------
>
> Key: SOLR-5944
> URL: https://issues.apache.org/jira/browse/SOLR-5944
> Project: Solr
> Issue Type: New Feature
> Reporter: Ishan Chattopadhyaya
> Assignee: Shalin Shekhar Mangar
> Attachments: DUP.patch, SOLR-5944.patch, SOLR-5944.patch,
> SOLR-5944.patch, SOLR-5944.patch, SOLR-5944.patch, SOLR-5944.patch,
> SOLR-5944.patch, SOLR-5944.patch, SOLR-5944.patch, SOLR-5944.patch,
> SOLR-5944.patch, SOLR-5944.patch, SOLR-5944.patch, SOLR-5944.patch,
> SOLR-5944.patch, SOLR-5944.patch, SOLR-5944.patch, SOLR-5944.patch,
> SOLR-5944.patch, SOLR-5944.patch, SOLR-5944.patch, SOLR-5944.patch,
> SOLR-5944.patch, SOLR-5944.patch, SOLR-5944.patch, SOLR-5944.patch,
> TestStressInPlaceUpdates.eb044ac71.beast-167-failure.stdout.txt,
> TestStressInPlaceUpdates.eb044ac71.beast-587-failure.stdout.txt,
> hoss.62D328FA1DEA57FD.fail.txt, hoss.62D328FA1DEA57FD.fail2.txt,
> hoss.62D328FA1DEA57FD.fail3.txt, hoss.D768DD9443A98DC.fail.txt,
> hoss.D768DD9443A98DC.pass.txt
>
>
> LUCENE-5189 introduced support for updates to numeric docvalues. It would be
> really nice to have Solr support this.
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]