[ceph-users] Re: multisite sync issue with bucket sync

Christopher Durham Wed, 20 Nov 2024 11:10:18 -0800

 Ok,
Source code review reveals that full sync is marker based and sync errors 
within a marker group *suggest* that data within the marker isre-checked, (I 
may be wrong about this, but that is consistent with my 304 errors below). I do 
however, have the folllowing question:
Is there a way to otherwise abort a full sync of a bucket (as a result of 
radosgw-admin bucket sync init --bucket <bucket> and bucket sync run (or 
restart of radosgw),and have it just do incremental sync from then on (yes, 
having the objects not be the same on both sides prior to the 'restart' of an 
incremental sync.
Would radosgw-admin bucket sync disable --bucket <bucket> followed by 
radosgw-admin bucket sync enable --bucket <bucket> do this? Or would that do 
anotherfull sync and not an incremental? Thanks
-Chris


    On Thursday, November 14, 2024 at 04:18:34 PM MST, Christopher Durham 
<[email protected]> wrote:   

  Hi,
I have heard nothing on this, but have done some more research.
Again, both sides of a multisite s3 configuration are ceph 18.2.4 on Rocky 9.
For a given bucket, there are thousands of 'missing' objects. I did:
radosgw-admin bucket sync init --bucket <bucket> --src-zone <other side 
zone>sync starts after I restart a radosgw on the source zone that has a sync 
thread.
But based on number and size of objects needing replication, it NEVER finishes, 
as more objects are created as I am going.I may need to increase the number of 
radosgw and or the sync threads.

What I have discovered that if a radosgw on the side with missing objects is 
restarted, all sycing starts over!In other words, it starts polling each 
object, getting a 304 error in the radosgw log on the server on the multisite 
that has the missing objects.It *appears* to do this sequential object scan in 
lexographic order of object and/or prefix name, although I cannot be sure.

So some questions:
1. Is there a recommendation/rule of thumb/formula for the number of 
radosgws/syncthreads/ etc based on number of objects, buckets, bandwidth, 
etc?2. Why does the syncing restart for a bucket when a radosgw is restarted? 
Is there a way to tell it to restart where it left off as opposed to starting 
over?There may be reasons to restart a bucket sync if a radosgw restarts, but 
there should be a way to checkpoint/force it to not restart/start where left 
off, etc.3. Is there a way to 'abort' the sync and cause the bucket to think it 
is up to date and only replicate new objects from the time it was marked up to 
date?
Thanks for any information
-Chris



    On Friday, November 8, 2024 at 03:45:05 PM MST, Christopher Durham 
<[email protected]> wrote:   

 
I have a 2-site multisite configuration on cdnh 18.2.4 on EL9.
After system updates, we discovered that a particular bucket had several 
thousand objects missing, which the other side had. Newly created objects were 
being replicated just fine.

I decided to 'restart' syncing that bucket. Here is what I did
On the side with misisng objects:
> radosgw-admin bucket sync init --bucket <bucketname> --src-zone <zone>

I restarted the radosgw set up to do the sync thread on the same zone as I ran 
the radosgw-admin command. 

Logs on the radosgw src-zone side show GETs with http code 200 for objects that 
do not exist on the side with missing objects, and GETs with http 304 for 
objects that already exist on the side with missing objects.
So far, so good.
As I said, the bucket is active. So on the src-zone side, data is continually 
being written to /prefixA/../../ There is also data being written to 
/prefixB/../../
prefixA/ comes lexographically before prefixB/
What happens is that all the 304s happen as it scans the bucket, then starts 
pulling with GETs and http 200s for the objects the side doing the sync doesnt 
have. This is on /prefixA. When it 'caches up' with alldata in /prefixA at the 
moment, the sync seems to START OVER with /prefixA, giving 304s for everything 
that existed in the bucket up to the moment it caught up, then doing GETs with 
200s for the remainingnewer objects. This happens over and over again. It NEVER 
gets to /prefixB. So it seems to be periodically catching up to /prefixA, but 
never going on to /prefixB that is also being written to
There are 1.2 million objects in this bucket, with about 35 TiB in the bucket.
There is a lifecycle expiration happening of 60 days.
Any thoughts would be appreciated.
-Chris


    
_______________________________________________
ceph-users mailing list -- [email protected]
To unsubscribe send an email to [email protected]

[ceph-users] Re: multisite sync issue with bucket sync

Reply via email to