Greetings,

We are running a number of Ceph clusters in production to provide object
storage services.  We have stumbled upon an issue where objects of certain
sizes are irretrievable.  The symptoms are very similar to the fix
referenced here:
https://www.redhat.com/archives/rhsa-announce/2015-November/msg00060.html.
We can put objects into the cluster via s3/radosgw, but we cannot retrieve
them (cluster closes the connection without delivering all bytes).
Unfortunately, this fix does not apply to us, as we are and have always
been running Hammer.  We've stumbled on a brand-new edge case.

We have produced this issue on the 0.94.3, 0.94.4, and 0.94.6 releases of
Hammer.

We have produced this issues using three different storage hardware
configurations -- 5 instances of clusters running 648 6TB OSDs across nine
physical nodes, 1 cluster running 30 10GB OSDs across ten VM nodes, and 1
cluster running 288 6TB OSDs across four physical nodes.

We have determined that this issue only occurs when using erasure coding
(we've only tested plugin=jerasure technique=reed_sol_van
ruleset-failure-domain=host).

Objects of exactly 4.5MiB (4718592 bytes) can be placed into the cluster
but not retrieved.  At every interval of `rgw object stripe size`
thereafter (in our case, 4 MiB), the objects are similarly irretrievable.
We have tested this from 4.5 to 24.5 MiB, then have spot-checked for much
larger values to prove the pattern holds.  There is a small range of bytes
less than this boundary that are irretrievable.  After much testing, we
have found this boundary to be strongly correlated with the k value in our
erasure coded pool.  We have observed that the m value in the erasure
coding has no effect on the window size.  We have tested erasure coded
values of k from 2 to 9, and we've observed the following ranges:

k = 2, m = 1 -> No error
k = 3, m = 1 -> 32 bytes (i.e. errors when objects are inclusively
between 4718561
- 4718592 bytes)
k = 3, m = 2 -> 32 bytes
k = 4, m = 2 -> No error
k = 4, m = 1 -> No error
k = 5, m = 4 -> 128 bytes
k = 6, m = 3 -> 512 bytes
k = 6, m = 2 -> 512 bytes
k = 7, m = 1 -> 800 bytes
k = 7, m = 2 -> 800 bytes
k = 8, m = 1 -> No error
k = 9, m = 1 -> 800 bytes

The "bytes" represent a 'dead zone' object size range wherein objects can
be put into the cluster but not retrieved.  The range of bytes is 4.5MiB -
(4.5MiB - buffer - 1) bytes. Up until k = 9, the error occurs for values of
k that are not powers of two, at which point the "dead zone" window is
(k-2)^2 * 32 bytes.  My team has not been able to determine why we plateau
at 800 bytes (we expected a range of 1568 bytes here).

This issue cannot be reproduced using rados to place objects directly into
EC pools.  The issue has only been observed with using RadosGW's S3
interface.

The issue can be reproduced with any S3 client (s3cmd, s3curl, CyberDuck,
CloudBerry Backup, and many others have been tested).

At this point, we are evaluating the Ceph codebase in an attempt to patch
the issue.  As this is an issue affecting data retrievability (and possibly
integrity), we wanted to bring this to the attention of the community as
soon as we could reproduce the issue.  We are hoping both that others out
there can independently verify and possibly that some with a more intimate
understanding of the codebase could investigate and propose a fix.  We have
observed this issue in our production clusters, so it is a very high
priority for my team.

Furthermore, we believe the objects to be corrupted at the point they are
placed into the cluster.  We have tested copying the .rgw.buckets pool to a
non-erasure coded pool, then swapping names, and we have found that objects
copied from the EC pool to the non-EC pool to be irretrievable once RGW is
pointed to the non-EC pool.  If we overwrite the object in the non-EC pool
with the original, it becomes retrievable again.  This has not been tested
as exhaustively, though, but we felt it important enough to mention.

I'm sure I've omitted some details here that would aid in an investigation,
so please let me know what other information I can provide.  My team will
be filing an issue shortly.

Many thanks,

Brian Felton
_______________________________________________
ceph-users mailing list
[email protected]
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Reply via email to