Re: [ceph-users] Snapshot trimming

Karun Josy Mon, 29 Jan 2018 01:31:30 -0800

fast-diff map is not enabled for RBD images.
Can it be a reason for Trimming not happening ?


Karun Josy

On Sat, Jan 27, 2018 at 10:19 PM, Karun Josy <[email protected]> wrote:

> Hi David,
>
> Thank you for your reply! I really appreciate it.
>
> The images are in pool id 55. It is an erasure coded pool.
>
> ---------------
> $ echo $(( $(ceph pg  55.58 query | grep snap_trimq | cut -d[ -f2 | cut
> -d] -f1 | tr ',' '\n' | wc -l) - 1 ))
> 0
> $ echo $(( $(ceph pg  55.a query | grep snap_trimq | cut -d[ -f2 | cut -d]
> -f1 | tr ',' '\n' | wc -l) - 1 ))
> 0
> $ echo $(( $(ceph pg  55.65 query | grep snap_trimq | cut -d[ -f2 | cut
> -d] -f1 | tr ',' '\n' | wc -l) - 1 ))
> 0
> --------------
>
> Current snap_trim_sleep value is default.
> "osd_snap_trim_sleep": "0.000000". I assume it means there is no delay.
> (Can't find any documentation related to it)
> Will changing its value initiate snaptrimming, like
> ceph tell osd.* injectargs '--osd_snap_trim_sleep 0.05'
>
> Also, we are using an rbd user with the below profile. It is used while
> deleting snapshots
> -------
>         caps: [mon] profile rbd
>         caps: [osd] profile rbd pool=ecpool, profile rbd pool=vm, profile
> rbd-read-only pool=templates
> -------
>
> Can it be a reason ?
>
> Also, can you let me know which all logs to check while deleting snapshots
> to see if it is snaptrimming ?
> I am sorry I feel like pestering you too much.
> But in mailing lists, I can see you have dealt with similar issues with
> Snapshots
> So I think you can help me figure this mess out.
>
>
> Karun Josy
>
> On Sat, Jan 27, 2018 at 7:15 PM, David Turner <[email protected]>
> wrote:
>
>> Prove* a positive
>>
>> On Sat, Jan 27, 2018, 8:45 AM David Turner <[email protected]> wrote:
>>
>>> Unless you have things in your snap_trimq, your problem isn't snap
>>> trimming. That is currently how you can check snap trimming and you say
>>> you're caught up.
>>>
>>> Are you certain that you are querying the correct pool for the images
>>> you are snapshotting. You showed that you tested 4 different pools. You
>>> should only need to check the pool with the images you are dealing with.
>>>
>>> You can inversely price a positive by changing your snap_trim settings
>>> to not do any cleanup and see if the appropriate PGs have anything in their
>>> q.
>>>
>>> On Sat, Jan 27, 2018, 12:06 AM Karun Josy <[email protected]> wrote:
>>>
>>>> Is scrubbing and deep scrubbing necessary for Snaptrim operation to
>>>> happen ?
>>>>
>>>> Karun Josy
>>>>
>>>> On Fri, Jan 26, 2018 at 9:29 PM, Karun Josy <[email protected]>
>>>> wrote:
>>>>
>>>>> Thank you for your quick response!
>>>>>
>>>>> I used the command to fetch the snap_trimq from many pgs, however it
>>>>> seems they don't have any in queue ?
>>>>>
>>>>> For eg :
>>>>> ====================
>>>>> $ echo $(( $(ceph pg  55.4a query | grep snap_trimq | cut -d[ -f2 |
>>>>> cut -d] -f1 | tr ',' '\n' | wc -l) - 1 ))
>>>>> 0
>>>>> $ echo $(( $(ceph pg  55.5a query | grep snap_trimq | cut -d[ -f2 |
>>>>> cut -d] -f1 | tr ',' '\n' | wc -l) - 1 ))
>>>>> 0
>>>>> $ echo $(( $(ceph pg  55.88 query | grep snap_trimq | cut -d[ -f2 |
>>>>> cut -d] -f1 | tr ',' '\n' | wc -l) - 1 ))
>>>>> 0
>>>>> $ echo $(( $(ceph pg  55.55 query | grep snap_trimq | cut -d[ -f2 |
>>>>> cut -d] -f1 | tr ',' '\n' | wc -l) - 1 ))
>>>>> 0
>>>>> $ echo $(( $(ceph pg  54.a query | grep snap_trimq | cut -d[ -f2 | cut
>>>>> -d] -f1 | tr ',' '\n' | wc -l) - 1 ))
>>>>> 0
>>>>> $ echo $(( $(ceph pg  34.1d query | grep snap_trimq | cut -d[ -f2 |
>>>>> cut -d] -f1 | tr ',' '\n' | wc -l) - 1 ))
>>>>> 0
>>>>> $ echo $(( $(ceph pg  1.3f query | grep snap_trimq | cut -d[ -f2 | cut
>>>>> -d] -f1 | tr ',' '\n' | wc -l) - 1 ))
>>>>> 0
>>>>> =====================
>>>>>
>>>>>
>>>>> While going through the PG query, I find that these PGs have no value
>>>>> in purged_snaps section too.
>>>>> For eg :
>>>>> ceph pg  55.80 query
>>>>> --
>>>>> ---
>>>>> ---
>>>>>  {
>>>>>             "peer": "83(3)",
>>>>>             "pgid": "55.80s3",
>>>>>             "last_update": "43360'15121927",
>>>>>             "last_complete": "43345'15073146",
>>>>>             "log_tail": "43335'15064480",
>>>>>             "last_user_version": 15066124,
>>>>>             "last_backfill": "MAX",
>>>>>             "last_backfill_bitwise": 1,
>>>>>             "purged_snaps": [],
>>>>>             "history": {
>>>>>                 "epoch_created": 5950,
>>>>>                 "epoch_pool_created": 5950,
>>>>>                 "last_epoch_started": 43339,
>>>>>                 "last_interval_started": 43338,
>>>>>                 "last_epoch_clean": 43340,
>>>>>                 "last_interval_clean": 43338,
>>>>>                 "last_epoch_split": 0,
>>>>>                 "last_epoch_marked_full": 42032,
>>>>>                 "same_up_since": 43338,
>>>>>                 "same_interval_since": 43338,
>>>>>                 "same_primary_since": 43276,
>>>>>                 "last_scrub": "35299'13072533",
>>>>>                 "last_scrub_stamp": "2018-01-18 14:01:19.557972",
>>>>>                 "last_deep_scrub": "31372'12176860",
>>>>>                 "last_deep_scrub_stamp": "2018-01-15 12:21:17.025305",
>>>>>                 "last_clean_scrub_stamp": "2018-01-18 14:01:19.557972"
>>>>>             },
>>>>>
>>>>> Not sure if it is related.
>>>>>
>>>>> The cluster is not open to any new clients. However we see a steady
>>>>> growth of  space usage every day.
>>>>> And worst case scenario, it might grow faster than we can add more
>>>>> space, which will be dangerous.
>>>>>
>>>>> Any help is really appreciated.
>>>>>
>>>>> Karun Josy
>>>>>
>>>>> On Fri, Jan 26, 2018 at 8:23 PM, David Turner <[email protected]>
>>>>> wrote:
>>>>>
>>>>>> "snap_trimq": "[]",
>>>>>>
>>>>>> That is exactly what you're looking for to see how many objects a PG
>>>>>> still had that need to be cleaned up. I think something like this should
>>>>>> give you the number of objects in the snap_trimq for a PG.
>>>>>>
>>>>>> echo $(( $(ceph pg $pg query | grep snap_trimq | cut -d[ -f2 | cut
>>>>>> -d] -f1 | tr ',' '\n' | wc -l) - 1 ))
>>>>>>
>>>>>> Note, I'm not at a computer and topping this from my phone so it's
>>>>>> not pretty and I know of a few ways to do that better, but that should 
>>>>>> work
>>>>>> all the same.
>>>>>>
>>>>>> For your needs a visual inspection of several PGs should be
>>>>>> sufficient to see if there is anything in the snap_trimq to begin with.
>>>>>>
>>>>>> On Fri, Jan 26, 2018, 9:18 AM Karun Josy <[email protected]>
>>>>>> wrote:
>>>>>>
>>>>>>>  Hi David,
>>>>>>>
>>>>>>> Thank you for the response. To be honest, I am afraid it is going to
>>>>>>> be a issue in our cluster.
>>>>>>> It seems snaptrim has not been going on for sometime now , maybe
>>>>>>> because we were expanding the cluster adding nodes for the past few 
>>>>>>> weeks.
>>>>>>>
>>>>>>> I would be really glad if you can guide me how to overcome this.
>>>>>>> Cluster has about 30TB data and 11 million objects. With about 100
>>>>>>> disks spread across 16 nodes. Version is 12.2.2
>>>>>>> Searching through the mailing lists I can see many cases where the
>>>>>>> performance were affected while snaptrimming.
>>>>>>>
>>>>>>> Can you help me figure out these :
>>>>>>>
>>>>>>> - How to find snaptrim queue of a PG.
>>>>>>> - Can snaptrim be started just on 1 PG
>>>>>>> - How can I make sure cluster IO performance is not affected ?
>>>>>>> I read about osd_snap_trim_sleep , how can it be changed ?
>>>>>>> Is this the command : ceph tell osd.* injectargs
>>>>>>> '--osd_snap_trim_sleep 0.005'
>>>>>>>
>>>>>>> If yes what is the recommended value that we can use ?
>>>>>>>
>>>>>>> Also, what all parameters should we be concerned about? I would
>>>>>>> really appreciate any suggestions.
>>>>>>>
>>>>>>>
>>>>>>> Below is a brief extract of a PG queried
>>>>>>> ----------------------------
>>>>>>> ceph pg  55.77 query
>>>>>>> {
>>>>>>>     "state": "active+clean",
>>>>>>>     "snap_trimq": "[]",
>>>>>>> ---
>>>>>>> ----
>>>>>>>
>>>>>>> "pgid": "55.77s7",
>>>>>>>             "last_update": "43353'17222404",
>>>>>>>             "last_complete": "42773'16814984",
>>>>>>>             "log_tail": "42763'16812644",
>>>>>>>             "last_user_version": 16814144,
>>>>>>>             "last_backfill": "MAX",
>>>>>>>             "last_backfill_bitwise": 1,
>>>>>>>             "purged_snaps": [],
>>>>>>>             "history": {
>>>>>>>                 "epoch_created": 5950,
>>>>>>> ---
>>>>>>> ---
>>>>>>> ---
>>>>>>>
>>>>>>>
>>>>>>> Karun Josy
>>>>>>>
>>>>>>> On Fri, Jan 26, 2018 at 6:36 PM, David Turner <[email protected]
>>>>>>> > wrote:
>>>>>>>
>>>>>>>> You may find the information in this ML thread useful.
>>>>>>>> https://www.spinics.net/lists/ceph-users/msg41279.html
>>>>>>>>
>>>>>>>> It talks about a couple ways to track your snaptrim queue.
>>>>>>>>
>>>>>>>> On Fri, Jan 26, 2018 at 2:09 AM Karun Josy <[email protected]>
>>>>>>>> wrote:
>>>>>>>>
>>>>>>>>> Hi,
>>>>>>>>>
>>>>>>>>> We have set no scrub , no deep scrub flag on a ceph cluster.
>>>>>>>>> When we are deleting snapshots we are not seeing any change in
>>>>>>>>> usage space.
>>>>>>>>>
>>>>>>>>> I understand that Ceph OSDs delete data asynchronously, so
>>>>>>>>> deleting a snapshot doesn’t free up the disk space immediately. But 
>>>>>>>>> we are
>>>>>>>>> not seeing any change for sometime.
>>>>>>>>>
>>>>>>>>> What can be possible reason ? Any suggestions would be really
>>>>>>>>> helpful as the cluster size seems to be growing each day even though
>>>>>>>>> snapshots are deleted.
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> Karun
>>>>>>>>> _______________________________________________
>>>>>>>>> ceph-users mailing list
>>>>>>>>> [email protected]
>>>>>>>>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>>>>>>>>>
>>>>>>>>
>>>>>>>
>>>>>
>>>>
>

_______________________________________________
ceph-users mailing list
[email protected]
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] Snapshot trimming

Reply via email to