Hey Yehuda,

I work with Sean on the dev side. We thought we should put together a short 
report on what we’ve been seeing in the hopes that the behavior might make some 
sense to you.

We had originally noticed these issues a while ago with our first iteration of 
this particular Ceph deployment. The issues we had seen were characterized by 
two different behaviors:

        • Some objects would appear truncated, returning different sizes for 
each request. Repeated attempts would eventually result in a successful 
retrieval if the second behavior doesn’t apply.
        • Some objects would always appear truncated, missing an integer 
multiple of 512KB.

This is where the report that we are encountering ‘truncation’ came from, which 
is slightly misleading. We recently verified that we are indeed encountering 
the first behavior, for which I believe Sean has supplied or will be supplying 
Ceph logs showcasing the server-side errors, and is true truncation. However, 
the second behavior is not really truncation, but missing 512KB chunks, as Sean 
has brought up.

We’ve had some luck with identifying some of the patterns that are seemingly 
related to this issue. Without going into too great of detail, we’ve found the 
following appear to hold true for all objects affected by the second behavior:

        • The amount of data missing is always in integer multiples of 512KB.
        • The expected file size is always found via the bucket index.
        • Ceph objects do not appear to be missing chunks or have holes in them.
        • The missing 512KB chunks are always at the beginning of multipart 
segments (1GB in our case).
        • For large files missing multiple chunks, the segments affected appear 
to be clustered and contiguous.

The first pattern was identified when we noticed that the bucket index and the 
object manifest differed in reported size. This is useful as an quick method of 
identifying affected objects. We’ve used this to avoid having to pull down and 
check each object individually. In total, we have 108 affected objects, which 
translates to approximately 0.25% of our S3 objects.

We noticed that the bucket index always reports the object size that would be 
expected had the upload gone correctly. Since we only ever report the segment 
sizes to the gateway, this would suggest that the segment sizes were reported 
accurately and aggregated correctly server side.

Sean identified the Ceph objects that compose one of our affected S3 objects. 
We thought we might see the first Ceph object missing some data, but found it 
to be a full 4MB. Retrieving the first Ceph object and comparing it to the 
bytes in the corresponding file, it appears that the Ceph object matches the 
4MB of the file after the first 512KB. We took this as evidence that the data 
was never getting to Ceph in the first place. However, in our testing, we were 
unable to get the gateway to accept segments with less data than reported.

Dissecting some of the affected objects, we were able to identify the positions 
of the missing 512KB segments. The positions align with the starting positions 
of 1GB segments, matching our multipart segment length. Additionally, for files 
where we had a large number of missing chunks, the segments appeared to cluster 
(e.g. segments 161-163 and 170-189 in a ~220 segment file). However, there did 
not appear to be a pattern as to which segments were affected.


Some of our testing:

One of our first thoughts was that it might be only specific files, perhaps 
particular file sizes, that caused the behavior to manifest. We generated fresh 
copies of the files in question and re-uploaded them without error. We 
repeatedly uploaded the same object, and always received positive results.

We next asked whether this could be connection related. We simulated a 
connection drop by killing the upload processes early. This resulted in 
uncompleted multipart uploads, as expected. Completing these resulted in 
truncated objects with a mismatch between the bucket index and the object 
manifest. This is the only test that resulted in this size mismatch behavior, 
however the objects were missing data from the end of the file, specifically 
the last segment - actual truncation vs the missing 512KB at the beginning of 
segments.

We then attempted to forge false multipart requests in an attempt to replicate 
the behavior assuming a bad client. Providing data differing than what was used 
to produce the md5 digest always resulted in a BadDigest error. Forging the 
content-length header but providing the correct md5 digest, to simulate a 
portion of a file not being sent, would cause a short hang and then a 
PermissionDenied error. In no cases were we able to get a positive response 
from the gateway.


What is confusing about all of this is that the bucket index is correct. That 
is, the index has the size expected had the upload completed correctly. This 
would suggest that the gateway correctly inferred the full size from the 
segment sizes, which means that our upload had reported correctly the full file 
size via segment sizes. But every time we’ve attempted to send less data than 
is reported to the gateway, as would be the case if the first 512KB were never 
sent, we encounter an error.

We’re still working to track down where the issue is. We’re hoping it’s just 
something in the client that hiccuped, but any insight into these behaviors 
from a gateway perspective would be greatly appreciated.

Thanks,
Mark


> On May 7, 2015, at 12:09 AM, Yehuda Sadeh-Weinraub <[email protected]> wrote:
> 
> 
> 
> ----- Original Message -----
>> From: "Sean" <[email protected]>
>> To: "Yehuda Sadeh-Weinraub" <[email protected]>
>> Cc: [email protected]
>> Sent: Tuesday, May 5, 2015 12:14:19 PM
>> Subject: Re: [ceph-users] Civet RadosGW S3 not storing complete obects; 
>> civetweb logs stop after rotation
>> 
>> 
>> 
>> Hello Yehuda and the rest of the mailing list.
>> 
>> 
>> My main question currently is why are the bucket index and the object
>> manifest ever different? Based on how we are uploading data I do not think
>> that the rados gateway should ever know the full file size without having
>> all of the objects within ceph at one point in time. So after the multipart
>> is marked as completed Rados gateway should cat through all of the objects
>> and make a complete part, correct?
> 
> That's what *should* happen, but obviously there's some bug there.
> 
>> 
>> 
>> 
>> Secondly,
>> 
>> I think I am not understanding the process to grab all of the parts
>> correctly. To continue to use my example file
>> "86b6fad8-3c53-465f-8758-2009d6df01e9/TCGA-A2-A0T7-01A-21D-A099-09_IlluminaGA-DNASeq_exome.bam"
>> in bucket tcga_cghub_protected. I would be using the following to grab the
>> prefix:
>> 
>> 
>> prefix=$(radosgw-admin object stat --bucket=tcga_cghub_protected
>> --object=86b6fad8-3c53-465f-8758-2009d6df01e9/TCGA-A2-A0T7-01A-21D-A099-09_IlluminaGA-DNASeq_exome.bam
>> | grep -iE '"prefix"' | awk -F"\"" '{print $4}')
>> 
>> 
>> Which should take everything between quotes for the prefix key and give me
>> the value.
>> 
>> 
>> In this case::
>> 
>> "prefix":
>> "86b6fad8-3c53-465f-8758-2009d6df01e9\/TCGA-A2-A0T7-01A-21D-A099-09_IlluminaGA-DNASeq_exome.bam.2\/YAROhWaAm9LPwCHeP55cD4CKlLC0B4S",
>> 
>> 
>> So
>> 
>> lacadmin@kh10-9:~$ echo ${prefix}
>> 
>> 86b6fad8-3c53-465f-8758-2009d6df01e9\/TCGA-A2-A0T7-01A-21D-A099-09_IlluminaGA-DNASeq_exome.bam.2\/YAROhWaAm9LPwCHeP55cD4CKlLC0B4S
>> 
>> 
>> From here I list all of the objects in the .rgw.buckets pool and grep for
>> that said prefix which yields 1335 objects. From here if I cat all of these
>> objects together I only end up with a 5468160 byte file which is 2G short of
>> what the object manifest says it should be. If I grab the file and tail the
>> Rados gateway log I end up with 1849 objects and when I sum them all up I
> 
> How are these objects named?
> 
>> end up with 7744771642 which is the same size that the manifest reports. I
>> understand that this does nothing other than verify the manifests accuracy
>> but I still find it interesting. The missing chunks may still exist in ceph
>> outside of the object manifest and tagged with the same prefix, correct? Or
>> am I misunderstanding something?
> 
> Either it's missing a chunk, or one of the objects is truncated. Can you stat 
> all the parts? I expect most of the objects to have two different sizes 
> (e.g., 4MB, 1MB), but at it is likely that the last part is smaller, and 
> maybe another object that is missing 512k. 
> 
>> 
>> 
>> We have over 40384 files in the tcga_cghub_protected bucket and only 66 of
>> these files are suffering from this truncation issue. What I need to know
>> is: is this happening on the gateway side or on the client side? Next I need
>> to know what possible actions can occur where the bucket index and the
>> object manifest would be mismatched like this as 40318 out of 40384 are
>> working without issue.
>> 
>> 
>> The truncated files are of all different sizes (5 megabytes - 980 gigabytes)
>> and the truncation seems to be all over. By "all over" I mean some files are
>> missing the first few bytes that should read "bam" and some are missing
>> parts in the middle.
> 
> Can you give an example of an object manifest for a broken object, and all 
> the rados objects that build it (e.g., the output of 'rados stat' on these 
> objects). A smaller object might be easier.
> 
>> 
>> 
>> So our upload code is using mmap to stream chunks of the file to the Rados
>> gateway via a multipart upload but no where on the client side do we have a
>> direct reference to the files we are using nor do we specify the size in
>> anyway. So where is the gateway getting the correct complete filesize from
>> and how is the bucket index showing the intended file size?
>> 
>> 
>> This implies that, at some point in time, ceph was able to see all of the
>> parts of the file and calculate the correct total size. This to me seems
>> like a rados gateway bug regardless of how the file is being uploaded. I
>> think that the RGW should be able to be fuzzed and still store the data
>> correctly.
>> 
>> 
>> Why is the bucket list not matching the bucket index and how can I verify
>> that the data is not being corrupted by the RGW or worse, after it is
>> committed to ceph ?
> 
> That's what we're trying to find out.
> 
> Thanks,
> Yehuda
> _______________________________________________
> ceph-users mailing list
> [email protected]
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

_______________________________________________
ceph-users mailing list
[email protected]
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Reply via email to