Re: [ceph-users] Cleaning Up Failed Multipart Uploads

Yehuda Sadeh-Weinraub Wed, 03 Aug 2016 11:19:46 -0700

On Wed, Aug 3, 2016 at 10:57 AM, Brian Felton <[email protected]> wrote:


> I should clarify:
>
> There doesn't seem to be a problem with list_multipart_parts -- upon
> further review, it seems to be doing the right thing.  What tipped me off
> is that when one aborts a multipart upload where parts have been uploaded
> more than once, the last copy of each part uploaded is successfully removed
> (not just removed from the bucket's stats, as with complete multipart, but
> submitted for garbage collection).  The difference seems to be in the
> following:
>
> In RGWCompleteMultipart::execute, the removal doesn't occur on the entries
> returned from list_mutlpart_parts; instead, we initialize a 'src_obj'
> rgw_obj structure and grab its index key
> (src_obj.get_index_key(&remove_key)), which is then pushed onto remove_objs.
>

iirc, we don't really remove the objects there. Only remove the entries
from the index.


>
> In RGWAbortMultipart::execute, we operate directly on the
> RGWUploadPartInfo value in the obj_parts map, submitting it for deletion
> (gc) if its manifest is empty.
>
> If this is correct, there is no "fix" for list_multipart_parts; instead,
> it would seem that the only fix is to not allow an upload part to generate
> a new prefix in RGWPutObj::execute().
>

The problem is that operations can happen concurrently, so the decision
whether to remove or not to remove an entry is not very easy. We have seen
before that application initiated multiple uploads of the same part, but
the one that actually complete the last was not the last to upload (e.g.,
due to networking timeouts and retries that happen in different layers).


> Since I don't really have any context on why a new prefix would be
> generated if the object already exists, I'm not the least bit confident
> that changing it will not have all sorts of unforeseen consequences.  That
> said, since all knowledge of an uploaded part seems to vanish from
> existence once it has been replaced, I don't see how the accounting of
> multipart data will ever be correct.
>

Having a mutable part is problematic, since different uploads might step on
each other (as with the example I provided above), and you end up with
corrupted data.


>
> And yes, I've tried the orphan find, but I'm not really sure what to do
> with the results.  The post I could find in the mailing list (mostly from
> you), seemed to conclude that no action should be taken on the things that
> it finds are orphaned.  Also, I have removed a significant number of
> multipart and shadow files that are not valid, but none of that actually
>

The tool is not removing data, only reporting about possible leaked rados
objects.


> updates the buckets stats to the correct values.  If I had some mechanism
> for forcing that, this would be much less of a big deal.
>

Right, this is a separate issue. Did you try running 'radosgw-admin bucket
check --fix'?

Yehuda


>
>
> Brian
>
> On Wed, Aug 3, 2016 at 12:46 PM, Yehuda Sadeh-Weinraub <[email protected]>
> wrote:
>
>>
>>
>> On Wed, Aug 3, 2016 at 10:10 AM, Brian Felton <[email protected]> wrote:
>>
>>> This may just be me having a conversation with myself, but maybe this
>>> will be helpful to someone else.
>>>
>>> Having dug and dug and dug through the code, I've come to the following
>>> realizations:
>>>
>>>    1. When a multipart upload is completed, the function
>>>    list_multipart_parts in rgw_op.cc is called.  This seems to be the start 
>>> of
>>>    the problems, as it will only return those parts in the 'multipart'
>>>    namespace that include the upload id in the name, irrespective of how 
>>> many
>>>    copies of parts exist on the system with non-upload id prefixes
>>>    2. In the course of writing to the OSDs, a list (remove_objs) is
>>>    processed in cls_rgw.cc:unaccount_entry(), causing bucket stats to be
>>>    decremented
>>>    3. These decremented stats are written to the bucket's index
>>>    entry/entries in .rgw.buckets.index via the CEPH_OSD_OP_OMAPSETHEADER 
>>> case
>>>    in ReplicatedPG::do_osd_ops
>>>
>>> So this explains why manually removing the multipart entries from
>>> .rgw.buckets and cleaning the shadow entries in .rgw.buckets.index does not
>>> cause the bucket's stats to be updated.  What I don't know how to do is
>>> force an update of the bucket's stats from the CLI.  I can retrieve the
>>> omap header from each of the bucket's shards in .rgw.buckets.index, but I
>>> don't have the first clue how to read the data or rebuild it into something
>>> valid.  I've searched the docs and mailing list archives, but I didn't find
>>> any solution to this problem.  For what it's worth, I've tried 'bucket
>>> check' with all combinations of '--check-objects' and '--fix' after
>>> cleaning up .rgw.buckets and .rgw.buckets.index.
>>>
>>> From a long-term perspective, it seems there are two possible fixes here:
>>>
>>>    1. Update the logic in list_multipart_parts to return all the parts
>>>    for a multipart object, so that *all* parts in the 'multipart' namespace
>>>    can be properly removed
>>>    2. Update the logic in RGWPutObj::execute() to not restart a write
>>>    if the put_data_and_throttle() call returns -EEXIST but instead put the
>>>    data in the original file(s)
>>>
>>> While I think 2 would involve the least amount of yak shaving with the
>>> multipart logic since the MP logic already assumes a happy path where all
>>> objects have a prefix of the multipart upload id, I'm all but certain this
>>> is going to horribly break many other parts of the system that I don't
>>> fully understand.
>>>
>>
>> #2 is dangerous. That was the original behavior, and it is racy and
>> *will* lead to data corruption.  OTOH, I don't think #1 is an easy option.
>> We only keep a single entry per part, so we don't really have a good way to
>> see all the uploaded pieces. We could extend the meta object to keep record
>> of all the uploaded parts, and at the end, when assembling everything
>> remove the parts that aren't part of the final assembly.
>>
>>> The good news is that the assembly of the multipart object is being done
>>> correctly; what I can't figure out is how it knows about the non-upload id
>>> prefixes when creating the metadata on the multipart object in
>>> .rgw.buckets.  My best guess is that it's copying the metadata from the
>>> 'meta' object in .rgw.buckets.extra (which is correctly updated with the
>>> new part prefixes after each successful upload), but I haven't absolutely
>>> confirmed that.
>>>
>>
>> Yeah, something along these lines.
>>
>>
>>> If one of the developer folk that are more familiar with this could
>>> weigh in, I would be greatly appreciative.
>>>
>>
>> btw, did you try to run the radosgw-admin orphan find tool?
>>
>> Yehuda
>>
>>> Brian
>>>
>>> On Tue, Aug 2, 2016 at 8:59 AM, Brian Felton <[email protected]> wrote:
>>>
>>>> I am actively working through the code and debugging everything.  I
>>>> figure the issue is with how RGW is listing the parts of a multipart upload
>>>> when it completes or aborts the upload (read: it's not getting *all* the
>>>> parts, just those that are either most recent or tagged with the upload
>>>> id).  As soon as I can figure out a patch, or, more importantly, how to
>>>> manually address the problem, I will respond with instructions.
>>>>
>>>> The reported bug contains detailed instructions on reproducing the
>>>> problem, so it's trivial to reproduce and test on a small and/or new
>>>> cluster.
>>>>
>>>> Brian
>>>>
>>>>
>>>> On Tue, Aug 2, 2016 at 8:53 AM, Tyler Bishop <
>>>> [email protected]> wrote:
>>>>
>>>>> We're having the same issues.   I have a 1200TB pool at 90%
>>>>> utilization however disk utilization is only 40%
>>>>>
>>>>>
>>>>>
>>>>>  [image: http://static.beyondhosting.net/img/bh-small.png]
>>>>>
>>>>>
>>>>> *Tyler Bishop *Chief Technical Officer
>>>>> 513-299-7108 x10
>>>>>
>>>>> [email protected]
>>>>>
>>>>> If you are not the intended recipient of this transmission you are
>>>>> notified that disclosing, copying, distributing or taking any action in
>>>>> reliance on the contents of this information is strictly prohibited.
>>>>>
>>>>>
>>>>>
>>>>> ------------------------------
>>>>> *From: *"Brian Felton" <[email protected]>
>>>>> *To: *"ceph-users" <[email protected]>
>>>>> *Sent: *Wednesday, July 27, 2016 9:24:30 AM
>>>>> *Subject: *[ceph-users] Cleaning Up Failed Multipart Uploads
>>>>>
>>>>> Greetings,
>>>>>
>>>>> Background: If an object storage client re-uploads parts to a
>>>>> multipart object, RadosGW does not clean up all of the parts properly when
>>>>> the multipart upload is aborted or completed.  You can read all of the 
>>>>> gory
>>>>> details (including reproduction steps) in this bug report:
>>>>> http://tracker.ceph.com/issues/16767.
>>>>>
>>>>> My setup: Hammer 0.94.6 cluster only used for S3-compatible object
>>>>> storage.  RGW stripe size is 4MiB.
>>>>>
>>>>> My problem: I have buckets that are reporting TB more utilization
>>>>> (and, in one case, 200k more objects) than they should report.  I am 
>>>>> trying
>>>>> to remove the detritus from the multipart uploads, but removing the
>>>>> leftover parts directly from the .rgw.buckets pool is having no effect on
>>>>> bucket utilization (i.e. neither the object count nor the space used are
>>>>> declining).
>>>>>
>>>>> To give an example, I have a client that uploaded a very large
>>>>> multipart object (8000 15MiB parts).  Due to a bug in the client, it
>>>>> uploaded each of the 8000 parts 6 times.  After the sixth attempt, it gave
>>>>> up and aborted the upload, at which point RGW removed the 8000 parts from
>>>>> the sixth attempt.  When I list the bucket's contents with radosgw-admin
>>>>> (radosgw-admin bucket list --bucket=<bucket> --max-entries=<size of
>>>>> bucket>), I see all of the object's 8000 parts five separate times, each
>>>>> under a namespace of 'multipart'.
>>>>>
>>>>> Since the multipart upload was aborted, I can't remove the object by
>>>>> name via the S3 interface.  Since my RGW stripe size is 4MiB, I know that
>>>>> each part of the object will be stored across 4 entries in the 
>>>>> .rgw.buckets
>>>>> pool -- 4 MiB in a 'multipart' file, and 4, 4, and 3 MiB in three
>>>>> successive 'shadow' files.  I've created a script to remove these parts
>>>>> (rados -p .rgw.buckets rm <bucket_id>__multipart_<object+prefix>.<part> 
>>>>> and
>>>>> rados -p .rgw.buckets rm
>>>>> <bucket_id>__shadow_<object+prefix>.<part>.[1-3]).  The removes are
>>>>> completing successfully (in that additional attempts to remove the object
>>>>> result in a failure), but I'm not seeing any decrease in the bucket's 
>>>>> space
>>>>> used, nor am I seeing a decrease in the bucket's object count.  In fact, 
>>>>> if
>>>>> I do another 'bucket list', all of the removed parts are still included.
>>>>>
>>>>> I've looked at the output of 'gc list --include-all', and the removed
>>>>> parts are never showing up for garbage collection.  Garbage collection is
>>>>> otherwise functioning normally and will successfully remove data for any
>>>>> object properly removed via the S3 interface.
>>>>>
>>>>> I've also gone so far as to write a script to list the contents of
>>>>> bucket shards in the .rgw.buckets.index pool, check for the existence of
>>>>> the entry in .rgw.buckets, and remove entries that cannot be found, but
>>>>> that is also failing to decrement the size/object count counters.
>>>>>
>>>>> What am I missing here?  Where, aside from .rgw.buckets and
>>>>> .rgw.buckets.index is RGW looking to determine object count and space used
>>>>> for a bucket?
>>>>>
>>>>> Many thanks to any and all who can assist.
>>>>>
>>>>> Brian Felton
>>>>>
>>>>>
>>>>>
>>>>> _______________________________________________
>>>>> ceph-users mailing list
>>>>> [email protected]
>>>>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>>>>>
>>>>
>>>>
>>>
>>> _______________________________________________
>>> ceph-users mailing list
>>> [email protected]
>>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>>>
>>>
>>
>

_______________________________________________
ceph-users mailing list
[email protected]
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] Cleaning Up Failed Multipart Uploads

Reply via email to