[jira] [Updated] (GEODE-10401) Oplog recovery takes too long due to fault in fastutil library

Jakov Varenina (Jira) Wed, 27 Jul 2022 02:17:06 -0700


     [ 
https://issues.apache.org/jira/browse/GEODE-10401?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]


Jakov Varenina updated GEODE-10401:
-----------------------------------
    Description: 
{color:#0e101a}As we already know, the .drf file delete operations only contain 
OplogEntryID. During recovery, the server reads (byte by byte) each 
OplogEntryID and stores it in a HashSet to use later when recovering .crf 
files. There are two types of HashSets: IntOpenHashSet and LongOpenHashSet. The 
OplogEntryID of type {color}_{color:#0e101a}integer{color}_{color:#0e101a} will 
be stored in IntOpenHashSet, and {color}_{color:#0e101a}long 
integer{color}_{color:#0e101a} in LongOpenHashSet, probably due to memory 
optimization and performance factors. OplogEntryID starts with a zero and 
increments throughout time. Recovery speed could differ depending on which 
HashSet is used, so please consider that when estimating .drf recovery 
time.{color}

{color:#0e101a}We have observed in logs that between exception (There is a 
large number of deleted entries) and the previous log have passed more than 4 
minutes (sometimes even more).{color}
{code:java}
{"timestamp":"2022-06-14T21:41:43.772+08:00","severity":"info","message":"Recovering
 oplog#271 /opt/dbservice/data/datastore/BACKUPdataDiskStore_271.drf for disk 
store dataDiskStore.","metadata":
{"timestamp":"2022-06-14T21:46:02.152+08:00","severity":"warning","message":"There
 is a large number of deleted entries within the disk-store, please execute an 
offline
compaction.","metadata":
{code}
{color:#0e101a}When the above exception occurs, that means that the limit of 
{color}_{color:#0e101a}805306401{color}_{color:#0e101a} entries in 
IntOpenHashSet has been reached. In that case, the server rolls to the new 
IntOpenHashSet, where an exception and the delay could happen again.{color}

{color:#0e101a}The problem is that due to the fault in FastUtil dependency 
(IntOpenHashSet and LongOpenHashSet), the unnecessary rehashing happens 
multiple times before the max size is reached. The{color} 
_{color:#0e101a}rehashing starts from{color}_ {color:#0e101a}805306368 onwards 
for each new entry until the max size. This rehashing adds several minutes to 
.drf Oplog recovery.{color}

  was:
{color:#0e101a}As we already know, the .drf file delete operations only contain 
OplogEntryID. During recovery, the server reads (byte by byte) each 
OplogEntryID and stores it in a HashSet to use later when recovering .crf 
files. There are two types of HashSets: IntOpenHashSet and LongOpenHashSet. The 
OplogEntryID of type {color}_{color:#0e101a}integer{color}_{color:#0e101a} will 
be stored in IntOpenHashSet, and {color}_{color:#0e101a}long 
integer{color}_{color:#0e101a} in LongOpenHashSet, probably due to memory 
optimization and performance factors. OplogEntryID starts with a zero and 
increments throughout time. Recovery speed could differ depending on which 
HashSet is used, so please consider that when estimating .drf recovery 
time.{color}

{color:#0e101a}We have observed in logs that between exception (There is a 
large number of deleted entries) and the previous log have passed more than 4 
minutes (sometimes even more).{color}

 
{code:java}
{"timestamp":"2022-06-14T21:41:43.772+08:00","severity":"info","message":"Recovering
 oplog#271 /opt/dbservice/data/datastore/BACKUPdataDiskStore_271.drf for disk 
store dataDiskStore.","metadata":
{"timestamp":"2022-06-14T21:46:02.152+08:00","severity":"warning","message":"There
 is a large number of deleted entries within the disk-store, please execute an 
offline
compaction.","metadata":
{code}
{color:#0e101a}When the above exception occurs, that means that the limit of 
{color}_{color:#0e101a}805306401{color}_{color:#0e101a} entries in 
IntOpenHashSet has been reached. In that case, the server rolls to the new 
IntOpenHashSet, where an exception and the delay could happen again.{color}

{color:#0e101a}The problem is that due to the fault in FastUtil dependency 
(IntOpenHashSet and LongOpenHashSet), the unnecessary rehashing happens 
multiple times before the max size is reached. The{color} 
_{color:#0e101a}rehashing starts from{color}_ {color:#0e101a}805306368 onwards 
for each new entry until the max size. This rehashing adds several minutes to 
.drf Oplog recovery.{color}


> Oplog recovery takes too long due to fault in fastutil library
> --------------------------------------------------------------
>
>                 Key: GEODE-10401
>                 URL: https://issues.apache.org/jira/browse/GEODE-10401
>             Project: Geode
>          Issue Type: Bug
>            Reporter: Jakov Varenina
>            Assignee: Jakov Varenina
>            Priority: Major
>              Labels: needsTriage
>
> {color:#0e101a}As we already know, the .drf file delete operations only 
> contain OplogEntryID. During recovery, the server reads (byte by byte) each 
> OplogEntryID and stores it in a HashSet to use later when recovering .crf 
> files. There are two types of HashSets: IntOpenHashSet and LongOpenHashSet. 
> The OplogEntryID of type 
> {color}_{color:#0e101a}integer{color}_{color:#0e101a} will be stored in 
> IntOpenHashSet, and {color}_{color:#0e101a}long 
> integer{color}_{color:#0e101a} in LongOpenHashSet, probably due to memory 
> optimization and performance factors. OplogEntryID starts with a zero and 
> increments throughout time. Recovery speed could differ depending on which 
> HashSet is used, so please consider that when estimating .drf recovery 
> time.{color}
> {color:#0e101a}We have observed in logs that between exception (There is a 
> large number of deleted entries) and the previous log have passed more than 4 
> minutes (sometimes even more).{color}
> {code:java}
> {"timestamp":"2022-06-14T21:41:43.772+08:00","severity":"info","message":"Recovering
>  oplog#271 /opt/dbservice/data/datastore/BACKUPdataDiskStore_271.drf for disk 
> store dataDiskStore.","metadata":
> {"timestamp":"2022-06-14T21:46:02.152+08:00","severity":"warning","message":"There
>  is a large number of deleted entries within the disk-store, please execute 
> an offline
> compaction.","metadata":
> {code}
> {color:#0e101a}When the above exception occurs, that means that the limit of 
> {color}_{color:#0e101a}805306401{color}_{color:#0e101a} entries in 
> IntOpenHashSet has been reached. In that case, the server rolls to the new 
> IntOpenHashSet, where an exception and the delay could happen again.{color}
> {color:#0e101a}The problem is that due to the fault in FastUtil dependency 
> (IntOpenHashSet and LongOpenHashSet), the unnecessary rehashing happens 
> multiple times before the max size is reached. The{color} 
> _{color:#0e101a}rehashing starts from{color}_ {color:#0e101a}805306368 
> onwards for each new entry until the max size. This rehashing adds several 
> minutes to .drf Oplog recovery.{color}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Updated] (GEODE-10401) Oplog recovery takes too long due to fault in fastutil library

Reply via email to