Re: Ceph hard lock Hammer 9.2
Sure, I guess it's actually a soft kernel lock since it's only the filesystem that is hung with high IO wait. The kernel is 4.0.4-1.el6.elrepo.x86. The Ceph version is 0.94.2 (Sorry about the confusion I missed a 4 when I typed in the subject line). I was testing copying 100,000 files from directory (dir1) to (dir1-`hostname`) on three septate hosts. 2 of the hosts completed the job and the third one hung with the stack trace in /var/log/messages. On Tue, Jun 23, 2015 at 6:54 AM, Gregory Farnum g...@gregs42.com wrote: On Mon, Jun 22, 2015 at 9:45 PM, Barclay Jameson almightybe...@gmail.com wrote: Has anyone seen this? Can you describe the kernel you're using, the workload you were running, the Ceph cluster you're running against, etc? Jun 22 15:09:27 node kernel: Call Trace: Jun 22 15:09:27 node kernel: [816803ee] schedule+0x3e/0x90 Jun 22 15:09:27 node kernel: [8168062e] schedule_preempt_disabled+0xe/0x10 Jun 22 15:09:27 node kernel: [81681ce3] __mutex_lock_slowpath+0x93/0x100 Jun 22 15:09:27 node kernel: [a060def8] ? __cap_is_valid+0x58/0x70 [ceph] Jun 22 15:09:27 node kernel: [81681d73] mutex_lock+0x23/0x40 Jun 22 15:09:27 node kernel: [a0610f2d] ceph_check_caps+0x38d/0x780 [ceph] Jun 22 15:09:27 node kernel: [812f5a9b] ? __radix_tree_delete_node+0x7b/0x130 Jun 22 15:09:27 node kernel: [a0612637] ceph_put_wrbuffer_cap_refs+0xf7/0x240 [ceph] Jun 22 15:09:27 node kernel: [a060b170] writepages_finish+0x200/0x290 [ceph] Jun 22 15:09:27 node kernel: [a05e2731] handle_reply+0x4f1/0x640 [libceph] Jun 22 15:09:27 node kernel: [a05e3065] dispatch+0x85/0xa0 [libceph] Jun 22 15:09:27 node kernel: [a05d7ceb] process_message+0xab/0xd0 [libceph] Jun 22 15:09:27 node kernel: [a05db052] try_read+0x2d2/0x430 [libceph] Jun 22 15:09:27 node kernel: [a05db7e8] con_work+0x78/0x220 [libceph] Jun 22 15:09:27 node kernel: [8108c475] process_one_work+0x145/0x460 Jun 22 15:09:27 node kernel: [8108c8b2] worker_thread+0x122/0x420 Jun 22 15:09:27 node kernel: [8167fdb8] ? __schedule+0x398/0x840 Jun 22 15:09:27 node kernel: [8108c790] ? process_one_work+0x460/0x460 Jun 22 15:09:27 node kernel: [8108c790] ? process_one_work+0x460/0x460 Jun 22 15:09:27 node kernel: [8109170e] kthread+0xce/0xf0 Jun 22 15:09:27 node kernel: [81091640] ? kthread_freezable_should_stop+0x70/0x70 Jun 22 15:09:27 node kernel: [81683dd8] ret_from_fork+0x58/0x90 Jun 22 15:09:27 node kernel: [81091640] ? kthread_freezable_should_stop+0x70/0x70 Jun 22 15:11:27 node kernel: INFO: task kworker/2:1:40 blocked for more than 120 seconds. Jun 22 15:11:27 node kernel: Tainted: G I 4.0.4-1.el6.elrepo.x86_64 #1 Jun 22 15:11:27 node kernel: echo 0 /proc/sys/kernel/hung_task_timeout_secs disables this message. Jun 22 15:11:27 node kernel: kworker/2:1 D 881ff279f7f8 0 40 2 0x Jun 22 15:11:27 node kernel: Workqueue: ceph-msgr con_work [libceph] Jun 22 15:11:27 node kernel: 881ff279f7f8 881ff261c010 881ff2b67050 88207fd95270 Jun 22 15:11:27 node kernel: 881ff279c010 88207fd15200 7fff 0002 Jun 22 15:11:27 node kernel: 81680ae0 881ff279f818 816803ee 810ae63b -- To unsubscribe from this list: send the line unsubscribe ceph-devel in -- To unsubscribe from this list: send the line unsubscribe ceph-devel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
librbd cacher lock protection?
Hi All, I'm investigating librbd code related to caching (ObjectCacher). What I cannot find is the data integrity protection while there is a 'cache miss' (full or partial). It looks like _readx exits with 'defer' and cache_lock is released (and locked again in LibWriteback). The BufferHead's are marked as 'rx' but not protected against write. Writex is not skipping nor checking for any BH. It's just populating data in cache. That confuses me. So where is the protection? How does the cache integrity protection actually work? Thanks, maciej -- To unsubscribe from this list: send the line unsubscribe ceph-devel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: librbd cacher lock protection?
You are correct that a rx BH can be overwritten by a write request -- this will then mark the BH as dirty. If it's a partial overwrite, the BH will be split and only the affected section's state will be changed to dirty. When the outstanding read request completes, it will invoke 'ObjectCacher::bh_read_finish', which verifies that the BH is still in the rx state (and that the transaction ids match) before overwriting the data. The pending client read request will then complete and will be provided the latest contents of the BH(s). -- Jason Dillaman Red Hat dilla...@redhat.com http://www.redhat.com - Original Message - From: Maciej Patelczyk maciej.patelc...@intel.com To: ceph-devel@vger.kernel.org Sent: Tuesday, June 23, 2015 11:21:32 AM Subject: librbd cacher lock protection? Hi All, I'm investigating librbd code related to caching (ObjectCacher). What I cannot find is the data integrity protection while there is a 'cache miss' (full or partial). It looks like _readx exits with 'defer' and cache_lock is released (and locked again in LibWriteback). The BufferHead's are marked as 'rx' but not protected against write. Writex is not skipping nor checking for any BH. It's just populating data in cache. That confuses me. So where is the protection? How does the cache integrity protection actually work? Thanks, maciej -- To unsubscribe from this list: send the line unsubscribe ceph-devel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html -- To unsubscribe from this list: send the line unsubscribe ceph-devel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: OSD-Based Object Stubs
On Sat, Jun 20, 2015 at 11:18 AM, Marcel Lauhoff m...@irq0.org wrote: Hi, thanks for the comments! Gregory Farnum g...@gregs42.com writes: On Thu, May 28, 2015 at 3:01 AM, Marcel Lauhoff m...@irq0.org wrote: Gregory Farnum g...@gregs42.com writes: Do you have a shorter summary than the code of how these stub and unstub operations relate to the object redirects? We didn't make a great deal of use of them but the basic data structures are mostly present in the codebase, are interpreted in at least some of the right places, and were definitely intended to cover this kind of use case. :) -Greg As far as I understood the redirect feature it is about pointing to other objects inside the Ceph cluster. The stubs feature allows pointing to anything. An HTTP server in concept code. Then stubs use an IMHO simpler approach to getting objects back: It's the task of the OSD. Stubbed objects just take longer to access, due to unstubbing it first. Redirects on the other hand leave this to the client: Object redirected - Tell client to retrieve it elsewhere. Ah, of course. I got a chance to look at this briefly today. Some notes: * You're using synchronous reads. That will prevent use of stubbing on EC pools (which only do async reads, as they might need to hit another OSD for the data), which seems sad. Good point. I didn't look at how EC pools work, yet. I assumed that a stub feature would be quite different for both pool types and tried the replicated first. I'm not sure that will be necessary, actually. The advantage of only doing GET/PUT (unstub/stub) is that you're doing only full-object reads and writes; it doesn't require any of the features EC pools don't provide. * There seems to be a race if you need to unstub an op for two separate requests that come in simultaneously, with nothing preventing both of them from initiating the unstub. Right. I should probably add some in flight states there. * You can inject an unstub for read ops, but that turns them into a write. That will cause problems in various cases where the object isn't writeable yet. I thought I fixed that by doing ctx-op-set_write() in the implicit unstub code. No, the implicit unstub will have to be more involved than that. :( RADOS writes aren't allowed to return any data to the user except for a return code, and I believe that's enforced at the end by clearing out/ignoring any of the return bufferlists we would otherwise pack up. This is because we have to be able to return the exact same stuff on replayed ops, in case the acting set of OSDs changes without the client getting a response. Now, the unstub is a bit different in that the data doesn't change in response to the user requiring an unstub, but I think it still has some parallelism issues in that scenario. * Why does a delete need the object data? That was just a short cut: In the quite simplistic Remote API there is only put and get. A unstub before delete also deletes the remote object. * You definitely wouldn't want to unstub data for scrubbing. What's the alternative? The remote should do scrubbing or just skip the stubbed object? I think you'd want to scrub both the full and stub metadata for the object, but rely on the stub target to keep the actual bundle of bytes safe. * There's a CEPH_OSD_OP_STAT which looks at what's in the object info; that is broken here because you're using the normal truncation path. There probably needs to be more cleverness or machinery distinguishing between the local size used and the size of the object represented. Of course. * I think snapshots are probably busted with this; did you check how they interact? With this implementation I think they really are. Stubs+Snapshouts could be a nice thing for backups. Just stub a read only snapshot. Right, so all of these things will need to be worked out well before we contemplate merging, and some of them are complicated enough that they might require changing the core implementation to handle. You probably don't want to delay it. :) -Greg -- To unsubscribe from this list: send the line unsubscribe ceph-devel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
librados clone_range
ObjectWriteOperations currently allow you to perform a clone_range from another object with the same object locator. Years ago, rgw used this as part of multipart upload. Today, the implementation complicates the OSD considerably, and it doesn't appear to have any users left. Is there anyone who would be sad to see it removed from the librados interface? -Sam -- To unsubscribe from this list: send the line unsubscribe ceph-devel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: RGW S3 Website hosting, non-clean code for early review
- Original Message - From: Robin H. Johnson robb...@gentoo.org To: ceph-devel@vger.kernel.org Cc: Jonathan LaCour jonathan.lac...@dreamhost.com Sent: Tuesday, June 23, 2015 2:33:25 AM Subject: RGW S3 Website hosting, non-clean code for early review Hi, As an extension of earlier work done by Yehuda [1], I've gotten the great majority of the work done to support static website hosting in RGW, just like AmazonS3 [2]. I need to do some cleanups of the code prior to major review for submission, and solve one thorny problem first, have a few discussions about best courses of action, and then I'll be submitting this for more reviews before merging. ceph [3] s3-tests, unit tests [4] s3-tests, fuzzer tests [5] The thorny problem: --- One of the pieces of functionality in S3Website is the ability to serve any public object in the bucket as the content on a custom error page (think shiny 404 error). In some cases, like trivial 403/404 errors, we can determine this quite early, before we fetch the object, and redirect the request to the error object instead (provided that we also redo the ACL check on the error object). In more complex cases (eg 416 Range Unsatisfiable, 412 Precondition Failed), it happens very late in the RGW request processing, and the req_state struct seems to have been mangled/pre-filled with a lot of decisions that aren't solvable. Either I have to repeat a lot of code for it, which I'm not happy about, or I have to refactor RGWGetObj* to more safely made the second GET request for the error object (and make sure range headers etc are NOT used for the get of the error object). I'm leaning to the latter. Is generating a new req_state a possibility? E.g., you catch the error at the top level, and restart most of the request processing with a newly created req_state? Oh, and for added fun, if an error object is configured, but is missing or private, you get a similar but different than without any error object configured, and sometimes the error codes are in the headers, but not always. Discussion pieces: -- RGWRegion - presently has both endpoints and hostnames, but doesn't make clear which APIs (S3, Swift, S3Website) might be available at each; or allow combinations to dedicate a specific FQDN to a given API. I'd like to replace both structures with a map structure [6] Makes sense. Bucket existence privacy: - In general I agree with the goal that we should be closely compatible with AmazonS3, but with an eye to security, I'd like to consider a specific deviation: - In AmazonS3, you can enumerate buckets for existence, simply looking for 404 NoSuchBucket vs 403 AccessDenied. I'd like to offer a configuration option that returns 403 Forbidden or 401 Unauthorized on anonymous requests to non-existent buckets. As long as it's configurable. - Testing some of functionality against AmazonS3 has been somewhat painful, as AmazonS3 only provides eventual consistency of the website configuration (with the highest time I've seen so far being about 30 seconds). Yup. New configuration options/changes: -- rgw_enable_apis: gains 's3website' mode rgw_dns_s3website_name: similar to rgw_dns_name, but for s3website endpoint RGWRegion having per-rgw-api hostnames Patch series breakdown plans: - Here's the breakdown of patch series I'm considering for the changes (net 2kLOC in ceph, 1kLOC in testcases). [TODO marks pieces not in these sets of commits yet, but will be soon). ceph - split Formatter.cc - JSON/XML/Table formatter are separator now - add header footer support for formatters - add knowledge of status - add HTML formatter - Add optional error handler hooks to RGWOp and RGWHandler for abort_early - Add optional retarget handler hooks - Add more flexible redirect handling - S3website code - x-amz-website-redirect-location handling (TODO: needs a bit more polish and testing) - TODO: Add more input validations to match S3, on stuff that's NOT documented but was discovered when I applied weirder testcases to AmazonS3: - 'Hostname' field has non-trivial validation (maybe borrow the outcome of wip-bucket_name_restrictions) - The 'Protocol' field for a redirect must be http/https, cannot be gopher or anything else. - The HttpRedirectCode field must contain one of: 301-305, 307, 308 The docs don't say this, and the error message says 'Any 3XX value except 300'. - First-match in RoutingRules wins; watch out with rules that match 4XX error codes. - Documentation - TODO: esp the parts missing from the S3 docs above s3-tests, unit tests - refactor for more requests - add new utiliities - add website tests s3-tests, fuzzer tests [5] Links for all the bits above [1]
Re: cephfs obsolescence and object location
On Mon, Jun 22, 2015 at 10:18 PM, Bill Sharer bsha...@sharerland.com wrote: I'm currently running giant on gentoo and was wondering about the stability of the api for mapping MDS files to rados objects. The cephfs binary complains that it is obsolete for getting layout information, but it also provides object location info. AFAICT this is the only way to map files in a cephfs filesystem to object locations if I want to take advantage of the UFO nature of ceph's stores in order to access via both cephfs and rados methods. I have a content store that scans files, calculates their sha1hash and then stores them on a cephfs filesystem tree with their filenames set to their sha1hash name. I can then build views of this content using an external local filesystem and symlinks pointing into the cephfs store. At the same time, I want to be able to use this store via rados either through the gateway or my own software that is rados aware. The store is being treated as a write-once, read-many style system. Towards this end, I started writing a QT4 based library that includes this little Location routine (which currently works) to grab the rados object location from a hash object in this store. I'm just wondering whether this is all going to break horribly in the future when ongoing MDS development decides to break the code I borrowed from cephfs :-) I don't know when things will break exactly, but it will probably be when we remove the ioctl rather than when the MDS stops supporting it. This particular one is implemented entirely on the client without talking to the MDS. :) You can also do this yourself in userspace: get the layout structure information on the file via the virtual xattrs (ceph.layout, I believe?). Use that to map to the specific object you're interested in (you can look at the kernel's fs/ceph/ioctl.c ceph_ioctl_get_dataloc() function, or any of the userspace stuff that does it). The tricky bit is that finding locations does require an up-to-date cluster map, but libcephfs will do that for you (and it looks to me like you really just want object names, not their locations). -Greg QString Shastore::Location(const QString hash) { QString result = ; QString cache_path = this-dbcache + / + hash.left(2) + / + hash.mid(2,2) + / + hash; QFile cache_file(cache_path); if (cache_file.exists()) { if (cache_file.open(QIODevice::ReadOnly)) { /* * Ripped from cephfs code, grab the handle and use the ceph version of ioctl to * rummage through the file's xattrs for rados location. cephfs whines about being * obsolete to get layout this way, but this appears to be only way to get location. * This may all break horribly in a future release since MDS is undergoing heavy development * * cephfs lets user pass file_offset in argv but it defaults to 0. Presumably this is the first * extent of the pile of extents (4mb each?) and shards for the file. If user wants to jump * elsewhere with a non-zero offset, the resulting rados object location may be different */ int fd = cache_file.handle(); struct ceph_ioctl_dataloc location; location.file_offset = 0; int err = ioctl(fd, CEPH_IOC_GET_DATALOC, (unsigned long)location); if (err) { qDebug() Location: Error getting rados location for cache_path; } else { result = QString(location.object_name); } cache_file.close(); } else { qDebug() Location: unable to open cache_path readonly; } } else { qDebug() Location: cache file cache_path does not exist; } return result; } -- To unsubscribe from this list: send the line unsubscribe ceph-devel in -- To unsubscribe from this list: send the line unsubscribe ceph-devel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Ceph hard lock Hammer 9.2
On Mon, Jun 22, 2015 at 9:45 PM, Barclay Jameson almightybe...@gmail.com wrote: Has anyone seen this? Can you describe the kernel you're using, the workload you were running, the Ceph cluster you're running against, etc? Jun 22 15:09:27 node kernel: Call Trace: Jun 22 15:09:27 node kernel: [816803ee] schedule+0x3e/0x90 Jun 22 15:09:27 node kernel: [8168062e] schedule_preempt_disabled+0xe/0x10 Jun 22 15:09:27 node kernel: [81681ce3] __mutex_lock_slowpath+0x93/0x100 Jun 22 15:09:27 node kernel: [a060def8] ? __cap_is_valid+0x58/0x70 [ceph] Jun 22 15:09:27 node kernel: [81681d73] mutex_lock+0x23/0x40 Jun 22 15:09:27 node kernel: [a0610f2d] ceph_check_caps+0x38d/0x780 [ceph] Jun 22 15:09:27 node kernel: [812f5a9b] ? __radix_tree_delete_node+0x7b/0x130 Jun 22 15:09:27 node kernel: [a0612637] ceph_put_wrbuffer_cap_refs+0xf7/0x240 [ceph] Jun 22 15:09:27 node kernel: [a060b170] writepages_finish+0x200/0x290 [ceph] Jun 22 15:09:27 node kernel: [a05e2731] handle_reply+0x4f1/0x640 [libceph] Jun 22 15:09:27 node kernel: [a05e3065] dispatch+0x85/0xa0 [libceph] Jun 22 15:09:27 node kernel: [a05d7ceb] process_message+0xab/0xd0 [libceph] Jun 22 15:09:27 node kernel: [a05db052] try_read+0x2d2/0x430 [libceph] Jun 22 15:09:27 node kernel: [a05db7e8] con_work+0x78/0x220 [libceph] Jun 22 15:09:27 node kernel: [8108c475] process_one_work+0x145/0x460 Jun 22 15:09:27 node kernel: [8108c8b2] worker_thread+0x122/0x420 Jun 22 15:09:27 node kernel: [8167fdb8] ? __schedule+0x398/0x840 Jun 22 15:09:27 node kernel: [8108c790] ? process_one_work+0x460/0x460 Jun 22 15:09:27 node kernel: [8108c790] ? process_one_work+0x460/0x460 Jun 22 15:09:27 node kernel: [8109170e] kthread+0xce/0xf0 Jun 22 15:09:27 node kernel: [81091640] ? kthread_freezable_should_stop+0x70/0x70 Jun 22 15:09:27 node kernel: [81683dd8] ret_from_fork+0x58/0x90 Jun 22 15:09:27 node kernel: [81091640] ? kthread_freezable_should_stop+0x70/0x70 Jun 22 15:11:27 node kernel: INFO: task kworker/2:1:40 blocked for more than 120 seconds. Jun 22 15:11:27 node kernel: Tainted: G I 4.0.4-1.el6.elrepo.x86_64 #1 Jun 22 15:11:27 node kernel: echo 0 /proc/sys/kernel/hung_task_timeout_secs disables this message. Jun 22 15:11:27 node kernel: kworker/2:1 D 881ff279f7f8 0 40 2 0x Jun 22 15:11:27 node kernel: Workqueue: ceph-msgr con_work [libceph] Jun 22 15:11:27 node kernel: 881ff279f7f8 881ff261c010 881ff2b67050 88207fd95270 Jun 22 15:11:27 node kernel: 881ff279c010 88207fd15200 7fff 0002 Jun 22 15:11:27 node kernel: 81680ae0 881ff279f818 816803ee 810ae63b -- To unsubscribe from this list: send the line unsubscribe ceph-devel in -- To unsubscribe from this list: send the line unsubscribe ceph-devel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: cephfs obsolescence and object location
Since you're only looking up the ID of the first object, it's really simple. It's just the hex printed inode number followed by .. That's not guaranteed to always be the case in the future, but it's likely to be true longer than the deprecated ioctls exist. If I was you, I would hard-code the object naming convention rather than writing in a dependency on the ioctl. As Greg says, you can also query all the layout stuff (via supported interfaces) and do the full calculation of object names for arbitrary offsets into the file if you need to. John On 22/06/2015 22:18, Bill Sharer wrote: I'm currently running giant on gentoo and was wondering about the stability of the api for mapping MDS files to rados objects. The cephfs binary complains that it is obsolete for getting layout information, but it also provides object location info. AFAICT this is the only way to map files in a cephfs filesystem to object locations if I want to take advantage of the UFO nature of ceph's stores in order to access via both cephfs and rados methods. I have a content store that scans files, calculates their sha1hash and then stores them on a cephfs filesystem tree with their filenames set to their sha1hash name. I can then build views of this content using an external local filesystem and symlinks pointing into the cephfs store. At the same time, I want to be able to use this store via rados either through the gateway or my own software that is rados aware. The store is being treated as a write-once, read-many style system. Towards this end, I started writing a QT4 based library that includes this little Location routine (which currently works) to grab the rados object location from a hash object in this store. I'm just wondering whether this is all going to break horribly in the future when ongoing MDS development decides to break the code I borrowed from cephfs :-) QString Shastore::Location(const QString hash) { QString result = ; QString cache_path = this-dbcache + / + hash.left(2) + / + hash.mid(2,2) + / + hash; QFile cache_file(cache_path); if (cache_file.exists()) { if (cache_file.open(QIODevice::ReadOnly)) { /* * Ripped from cephfs code, grab the handle and use the ceph version of ioctl to * rummage through the file's xattrs for rados location. cephfs whines about being * obsolete to get layout this way, but this appears to be only way to get location. * This may all break horribly in a future release since MDS is undergoing heavy development * * cephfs lets user pass file_offset in argv but it defaults to 0. Presumably this is the first * extent of the pile of extents (4mb each?) and shards for the file. If user wants to jump * elsewhere with a non-zero offset, the resulting rados object location may be different */ int fd = cache_file.handle(); struct ceph_ioctl_dataloc location; location.file_offset = 0; int err = ioctl(fd, CEPH_IOC_GET_DATALOC, (unsigned long)location); if (err) { qDebug() Location: Error getting rados location for cache_path; } else { result = QString(location.object_name); } cache_file.close(); } else { qDebug() Location: unable to open cache_path readonly; } } else { qDebug() Location: cache file cache_path does not exist; } return result; } Since you're only looking at -- To unsubscribe from this list: send the line unsubscribe ceph-devel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
RGW S3 Website hosting, non-clean code for early review
Hi, As an extension of earlier work done by Yehuda [1], I've gotten the great majority of the work done to support static website hosting in RGW, just like AmazonS3 [2]. I need to do some cleanups of the code prior to major review for submission, and solve one thorny problem first, have a few discussions about best courses of action, and then I'll be submitting this for more reviews before merging. ceph [3] s3-tests, unit tests [4] s3-tests, fuzzer tests [5] The thorny problem: --- One of the pieces of functionality in S3Website is the ability to serve any public object in the bucket as the content on a custom error page (think shiny 404 error). In some cases, like trivial 403/404 errors, we can determine this quite early, before we fetch the object, and redirect the request to the error object instead (provided that we also redo the ACL check on the error object). In more complex cases (eg 416 Range Unsatisfiable, 412 Precondition Failed), it happens very late in the RGW request processing, and the req_state struct seems to have been mangled/pre-filled with a lot of decisions that aren't solvable. Either I have to repeat a lot of code for it, which I'm not happy about, or I have to refactor RGWGetObj* to more safely made the second GET request for the error object (and make sure range headers etc are NOT used for the get of the error object). I'm leaning to the latter. Oh, and for added fun, if an error object is configured, but is missing or private, you get a similar but different than without any error object configured, and sometimes the error codes are in the headers, but not always. Discussion pieces: -- RGWRegion - presently has both endpoints and hostnames, but doesn't make clear which APIs (S3, Swift, S3Website) might be available at each; or allow combinations to dedicate a specific FQDN to a given API. I'd like to replace both structures with a map structure [6] Bucket existence privacy: - In general I agree with the goal that we should be closely compatible with AmazonS3, but with an eye to security, I'd like to consider a specific deviation: - In AmazonS3, you can enumerate buckets for existence, simply looking for 404 NoSuchBucket vs 403 AccessDenied. I'd like to offer a configuration option that returns 403 Forbidden or 401 Unauthorized on anonymous requests to non-existent buckets. - Testing some of functionality against AmazonS3 has been somewhat painful, as AmazonS3 only provides eventual consistency of the website configuration (with the highest time I've seen so far being about 30 seconds). New configuration options/changes: -- rgw_enable_apis: gains 's3website' mode rgw_dns_s3website_name: similar to rgw_dns_name, but for s3website endpoint RGWRegion having per-rgw-api hostnames Patch series breakdown plans: - Here's the breakdown of patch series I'm considering for the changes (net 2kLOC in ceph, 1kLOC in testcases). [TODO marks pieces not in these sets of commits yet, but will be soon). ceph - split Formatter.cc - JSON/XML/Table formatter are separator now - add header footer support for formatters - add knowledge of status - add HTML formatter - Add optional error handler hooks to RGWOp and RGWHandler for abort_early - Add optional retarget handler hooks - Add more flexible redirect handling - S3website code - x-amz-website-redirect-location handling (TODO: needs a bit more polish and testing) - TODO: Add more input validations to match S3, on stuff that's NOT documented but was discovered when I applied weirder testcases to AmazonS3: - 'Hostname' field has non-trivial validation (maybe borrow the outcome of wip-bucket_name_restrictions) - The 'Protocol' field for a redirect must be http/https, cannot be gopher or anything else. - The HttpRedirectCode field must contain one of: 301-305, 307, 308 The docs don't say this, and the error message says 'Any 3XX value except 300'. - First-match in RoutingRules wins; watch out with rules that match 4XX error codes. - Documentation - TODO: esp the parts missing from the S3 docs above s3-tests, unit tests - refactor for more requests - add new utiliities - add website tests s3-tests, fuzzer tests [5] Links for all the bits above [1] https://github.com/ceph/ceph/tree/wip-static-website [2] http://docs.aws.amazon.com/AmazonS3/latest/dev/WebsiteHosting.html [3] https://github.com/ceph/ceph/compare/master...robbat2:wip-static-website-robbat2-master [4] https://github.com/ceph/s3-tests/compare/master...robbat2:wip-static-website [5] https://github.com/ceph/s3-tests/compare/master...robbat2:wip-website-fuzzy [6] https://github.com/ceph/ceph/compare/master...robbat2:wip-static-website-robbat2-master#diff-ee7891a35944697538486c9269e0d65bR909 -- Robin Hugh Johnson Gentoo Linux: Developer, Infrastructure Lead E-Mail : robb...@gentoo.org GnuPG FP
Re: erasure pool with isa plugin
On Mon, Jun 22, 2015 at 10:34 PM, Loic Dachary l...@dachary.org wrote: Hi Tom, On 22/06/2015 17:10, Deneau, Tom wrote: If one has a cluster with some nodes that can run with the ISA plugin and some that cannot, is there a way to define a pool such that the ISA-capable nodes can use the ISA plugin and the others can use say the jerasure plugin? There is no way to do that, because there is no guarantee that an object encoded by jerasure can be decoded by isa and vice versa. Shouldn't we be able to set up something that *does* guarantee that, though? Either by combining them into a single plugin which dynamically configures to the fastest possible for that machine, or by running them against the same object corpus and requiring that they produce the same output? -Greg -- To unsubscribe from this list: send the line unsubscribe ceph-devel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: erasure pool with isa plugin
Hi, On 23/06/2015 06:06, Gregory Farnum wrote: On Mon, Jun 22, 2015 at 10:34 PM, Loic Dachary l...@dachary.org wrote: Hi Tom, On 22/06/2015 17:10, Deneau, Tom wrote: If one has a cluster with some nodes that can run with the ISA plugin and some that cannot, is there a way to define a pool such that the ISA-capable nodes can use the ISA plugin and the others can use say the jerasure plugin? There is no way to do that, because there is no guarantee that an object encoded by jerasure can be decoded by isa and vice versa. Shouldn't we be able to set up something that *does* guarantee that, though? Either by combining them into a single plugin which dynamically configures to the fastest possible for that machine, or by running them against the same object corpus and requiring that they produce the same output? I don't know enough about the maths involved to evaluate how difficult it is. Although both implement Reed Solomon using Vandermonde or Cauchy matrices, I think there are details that makes the output different. Cheers -Greg -- Loïc Dachary, Artisan Logiciel Libre signature.asc Description: OpenPGP digital signature
Re: rsyslogd
Did we end up creating a ticket for this? I saw this on one FS run as well. -Greg On Fri, Jun 19, 2015 at 8:03 AM, Gregory Farnum gfar...@redhat.com wrote: Not that I'm aware of. I don't mess with services at all in the log rotation stuff I did (we already disabled generic log rotation when tests were running). -Greg On Jun 19, 2015, at 12:13 AM, David Zafman dzaf...@redhat.com wrote: Greg, Have you changed anything (log rotation related?) that would uninstall or cause rsyslog to not be able to start? I'm sometimes seeing machines fail with this error probably in teuthology/nuke.py reset_syslog_dir(). CommandFailedError: Command failed on plana94 with status 1: 'sudo rm -f -- /etc/rsyslog.d/80-cephtest.conf sudo service rsyslog restart' David -- To unsubscribe from this list: send the line unsubscribe ceph-devel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html -- To unsubscribe from this list: send the line unsubscribe ceph-devel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Monitor clock skew on Teuthology testing
Hi Zack, I was testing on some home-made Teuthology cluster. Till now I can use the teuthology-suite to submit the test cases and start some workers to do the tests. From the Pulpito logs I can see most of the tests are passed except there was some error when aggregating the results in the last step. The error message was like: 2015-06-24 08:41:33.334317 mon.1 192.168.13.117:6789/0 4 : cluster [WRN] message from mon.0 was stamped 0.709253s in the future, clocks not synchronized in cluster log This was due to the time lag between monitors to my knowledge. I checked the clock settings in Teuthology and find the NTP server are defined in ceph-qa-chef/cookbooks/ceph-qa/files/default/ntp.conf. Is there any other settings on the clock side there? Thanks, -yuan -- To unsubscribe from this list: send the line unsubscribe ceph-devel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: RGW S3 Website hosting, non-clean code for early review
On Tue, Jun 23, 2015 at 04:30:19PM -0400, Yehuda Sadeh-Weinraub wrote: Either I have to repeat a lot of code for it, which I'm not happy about, or I have to refactor RGWGetObj* to more safely made the second GET request for the error object (and make sure range headers etc are NOT used for the get of the error object). I'm leaning to the latter. Is generating a new req_state a possibility? E.g., you catch the error at the top level, and restart most of the request processing with a newly created req_state? That was the path I was trying, but not completely succeeding. I think need to step it back further and have a partially customized copy of the RGWEnv from client_io-get_env(), so that I can build the modified req_info for req_state. It isn't a full new GET really, it's really just custom content for the body as well as some headers (mostly Content-Length, Content-Type), but ignore EPERM/EACCESS on trying to fetch that custom content, and if they are detected, consider that a success but with different HTML content. Great! I'll wait for the cleaned up pull request. Do you want pull requests per logical change of my proposed series split, or rather just one pull request with the full series? -- Robin Hugh Johnson Gentoo Linux: Developer, Infrastructure Lead E-Mail : robb...@gentoo.org GnuPG FP : 11ACBA4F 4778E3F6 E4EDF38E B27B944E 34884E85 -- To unsubscribe from this list: send the line unsubscribe ceph-devel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: provisioning teuthology targets with OpenStack
Hi Zack, I think it works and I'll start using it. For the record https://github.com/ceph/teuthology/compare/master...dachary:wip-6502-openstack?expand=1#diff-88b99bb28683bd5b7e3a204826ead112R318 is what I have right now. There are a few things that need cleaning but nothing major I hope. I'll clean them up after you had time to review the preliminary pull requests at https://github.com/ceph/teuthology/pulls. Cheers On 09/06/2015 14:58, Loic Dachary wrote: On 09/06/2015 18:25, Zack Cerza wrote: Hi Loic! I'm really happy to hear you have the time and motivation to work on this! It's something I've wanted to get to for a while, but other things keep getting in the way. I'd suggest looking at teuthology.provision.Downburst to investigate maybe pulling some of its functionality into a base class that a potential OpenStack subclass could also benefit from. I'd love to see us using a common interface for all of our provisioning. If that seems like a lot of work, I'd be willing to be the one to create the base class and work with you to make sure its API was useful for OpenStack as well. Thanks for the pointers. I'll write something that works and get back to you to make it acceptable :-) Thanks, Zack - Original Message - From: Loic Dachary l...@dachary.org To: Zack Cerza zce...@redhat.com Cc: Ceph Development ceph-devel@vger.kernel.org Sent: Sunday, June 7, 2015 11:04:52 AM Subject: provisioning teuthology targets with OpenStack Hi Zack, I'm motivated by my recent experience with hacking teuthology and containers :-) I'd like to give http://tracker.ceph.com/issues/6502 provision targets using a cloud API a shot with OpenStack in mind (because I know nothing of EC2 really). The idea is to make it relatively easy for someone with access to an OpenStack tenant to using instead of downburst. Not only would that simplify the development process for people without access to the lab, it would also allow the addition of more horsepower to the lab. What do you think ? Cheers -- Loïc Dachary, Artisan Logiciel Libre -- Loïc Dachary, Artisan Logiciel Libre signature.asc Description: OpenPGP digital signature
Setting up teuthology with OpenStack
Hi Ceph, There is an experimental OpenStack backend[1] for teuthology[2] (the Ceph integration tests toolbox) with instructions to set it up[3] and some integration tests to verify it works as it should. I've run them successfully on two different OpenStack clusters (http://the.re/ and https://entercloudsuite.com/). It would be great if someone was brave enough to give it a try at this very early stage :-) If you don't have access to an OpenStack cluster, https://entercloudsuite.com/ can provide an OpenStack tenant within the hour (I had a few weird glitches during the registration but they went away magically). Cheers [1] teuthology OpenStack backend http://tracker.ceph.com/issues/6502 [2] teuthology https://github.com/ceph/teuthology/ [3] setting up teuthology with OpenStack https://github.com/dachary/teuthology/tree/wip-6502-openstack#openstack-backend -- Loïc Dachary, Artisan Logiciel Libre signature.asc Description: OpenPGP digital signature
Re: RGW S3 Website hosting, non-clean code for early review
- Original Message - From: Robin H. Johnson robb...@gentoo.org To: Yehuda Sadeh-Weinraub yeh...@redhat.com Cc: ceph-devel@vger.kernel.org, Jonathan LaCour jonathan.lac...@dreamhost.com Sent: Tuesday, June 23, 2015 4:04:49 PM Subject: Re: RGW S3 Website hosting, non-clean code for early review On Tue, Jun 23, 2015 at 04:30:19PM -0400, Yehuda Sadeh-Weinraub wrote: Either I have to repeat a lot of code for it, which I'm not happy about, or I have to refactor RGWGetObj* to more safely made the second GET request for the error object (and make sure range headers etc are NOT used for the get of the error object). I'm leaning to the latter. Is generating a new req_state a possibility? E.g., you catch the error at the top level, and restart most of the request processing with a newly created req_state? That was the path I was trying, but not completely succeeding. I think need to step it back further and have a partially customized copy of the RGWEnv from client_io-get_env(), so that I can build the modified req_info for req_state. It isn't a full new GET really, it's really just custom content for the body as well as some headers (mostly Content-Length, Content-Type), but ignore EPERM/EACCESS on trying to fetch that custom content, and if they are detected, consider that a success but with different HTML content. Great! I'll wait for the cleaned up pull request. Do you want pull requests per logical change of my proposed series split, or rather just one pull request with the full series? One pull request for the full series. Yehuda -- To unsubscribe from this list: send the line unsubscribe ceph-devel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html