Re: Ceph hard lock Hammer 9.2

2015-06-23 Thread Barclay Jameson
Sure,
I guess it's actually a soft kernel lock since it's only the
filesystem that is hung with high IO wait.
The kernel is 4.0.4-1.el6.elrepo.x86.
The Ceph version is 0.94.2 (Sorry about the confusion I missed a 4
when I typed in the subject line).
I was testing copying 100,000 files from directory (dir1) to
(dir1-`hostname`) on three septate hosts.
2 of the hosts completed the job and the third one hung with the stack
trace in /var/log/messages.

On Tue, Jun 23, 2015 at 6:54 AM, Gregory Farnum g...@gregs42.com wrote:
 On Mon, Jun 22, 2015 at 9:45 PM, Barclay Jameson
 almightybe...@gmail.com wrote:
 Has anyone seen this?

 Can you describe the kernel you're using, the workload you were
 running, the Ceph cluster you're running against, etc?


 Jun 22 15:09:27 node kernel: Call Trace:
 Jun 22 15:09:27 node kernel: [816803ee] schedule+0x3e/0x90
 Jun 22 15:09:27 node kernel: [8168062e]
 schedule_preempt_disabled+0xe/0x10
 Jun 22 15:09:27 node kernel: [81681ce3]
 __mutex_lock_slowpath+0x93/0x100
 Jun 22 15:09:27 node kernel: [a060def8] ?
 __cap_is_valid+0x58/0x70 [ceph]
 Jun 22 15:09:27 node kernel: [81681d73] mutex_lock+0x23/0x40
 Jun 22 15:09:27 node kernel: [a0610f2d]
 ceph_check_caps+0x38d/0x780 [ceph]
 Jun 22 15:09:27 node kernel: [812f5a9b] ?
 __radix_tree_delete_node+0x7b/0x130
 Jun 22 15:09:27 node kernel: [a0612637]
 ceph_put_wrbuffer_cap_refs+0xf7/0x240 [ceph]
 Jun 22 15:09:27 node kernel: [a060b170]
 writepages_finish+0x200/0x290 [ceph]
 Jun 22 15:09:27 node kernel: [a05e2731]
 handle_reply+0x4f1/0x640 [libceph]
 Jun 22 15:09:27 node kernel: [a05e3065] dispatch+0x85/0xa0 
 [libceph]
 Jun 22 15:09:27 node kernel: [a05d7ceb]
 process_message+0xab/0xd0 [libceph]
 Jun 22 15:09:27 node kernel: [a05db052] try_read+0x2d2/0x430 
 [libceph]
 Jun 22 15:09:27 node kernel: [a05db7e8] con_work+0x78/0x220 
 [libceph]
 Jun 22 15:09:27 node kernel: [8108c475] 
 process_one_work+0x145/0x460
 Jun 22 15:09:27 node kernel: [8108c8b2] worker_thread+0x122/0x420
 Jun 22 15:09:27 node kernel: [8167fdb8] ? __schedule+0x398/0x840
 Jun 22 15:09:27 node kernel: [8108c790] ? 
 process_one_work+0x460/0x460
 Jun 22 15:09:27 node kernel: [8108c790] ? 
 process_one_work+0x460/0x460
 Jun 22 15:09:27 node kernel: [8109170e] kthread+0xce/0xf0
 Jun 22 15:09:27 node kernel: [81091640] ?
 kthread_freezable_should_stop+0x70/0x70
 Jun 22 15:09:27 node kernel: [81683dd8] ret_from_fork+0x58/0x90
 Jun 22 15:09:27 node kernel: [81091640] ?
 kthread_freezable_should_stop+0x70/0x70
 Jun 22 15:11:27 node kernel: INFO: task kworker/2:1:40 blocked for
 more than 120 seconds.
 Jun 22 15:11:27 node kernel:  Tainted: G  I
 4.0.4-1.el6.elrepo.x86_64 #1
 Jun 22 15:11:27 node kernel: echo 0 
 /proc/sys/kernel/hung_task_timeout_secs disables this message.
 Jun 22 15:11:27 node kernel: kworker/2:1 D 881ff279f7f8 0
   40  2 0x
 Jun 22 15:11:27 node kernel: Workqueue: ceph-msgr con_work [libceph]
 Jun 22 15:11:27 node kernel: 881ff279f7f8 881ff261c010
 881ff2b67050 88207fd95270
 Jun 22 15:11:27 node kernel: 881ff279c010 88207fd15200
 7fff 0002
 Jun 22 15:11:27 node kernel: 81680ae0 881ff279f818
 816803ee 810ae63b
 --
 To unsubscribe from this list: send the line unsubscribe ceph-devel in
--
To unsubscribe from this list: send the line unsubscribe ceph-devel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


librbd cacher lock protection?

2015-06-23 Thread Patelczyk, Maciej
Hi All,

I'm investigating librbd code related to caching (ObjectCacher).
What I cannot find is the data integrity protection while there is a 'cache 
miss' (full or partial). It looks like _readx exits with 'defer' and cache_lock 
is released (and locked again in LibWriteback). The BufferHead's are marked as 
'rx' but not protected against write. Writex is not skipping nor checking for 
any BH. It's just populating data in cache.
That confuses me. So where is the protection? How does the cache integrity 
protection actually work?

Thanks,
maciej
--
To unsubscribe from this list: send the line unsubscribe ceph-devel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: librbd cacher lock protection?

2015-06-23 Thread Jason Dillaman
You are correct that a rx BH can be overwritten by a write request -- this will 
then mark the BH as dirty.  If it's a partial overwrite, the BH will be split 
and only the affected section's state will be changed to dirty.  When the 
outstanding read request completes, it will invoke 
'ObjectCacher::bh_read_finish', which verifies that the BH is still in the rx 
state (and that the transaction ids match) before overwriting the data.  The 
pending client read request will then complete and will be provided the latest 
contents of the BH(s).

-- 

Jason Dillaman 
Red Hat 
dilla...@redhat.com 
http://www.redhat.com 


- Original Message -
 From: Maciej Patelczyk maciej.patelc...@intel.com
 To: ceph-devel@vger.kernel.org
 Sent: Tuesday, June 23, 2015 11:21:32 AM
 Subject: librbd cacher lock protection?
 
 Hi All,
 
 I'm investigating librbd code related to caching (ObjectCacher).
 What I cannot find is the data integrity protection while there is a 'cache
 miss' (full or partial). It looks like _readx exits with 'defer' and
 cache_lock is released (and locked again in LibWriteback). The BufferHead's
 are marked as 'rx' but not protected against write. Writex is not skipping
 nor checking for any BH. It's just populating data in cache.
 That confuses me. So where is the protection? How does the cache integrity
 protection actually work?
 
 Thanks,
 maciej
 --
 To unsubscribe from this list: send the line unsubscribe ceph-devel in
 the body of a message to majord...@vger.kernel.org
 More majordomo info at  http://vger.kernel.org/majordomo-info.html
 
--
To unsubscribe from this list: send the line unsubscribe ceph-devel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: OSD-Based Object Stubs

2015-06-23 Thread Gregory Farnum
On Sat, Jun 20, 2015 at 11:18 AM, Marcel Lauhoff m...@irq0.org wrote:

 Hi,

 thanks for the comments!

 Gregory Farnum g...@gregs42.com writes:

 On Thu, May 28, 2015 at 3:01 AM, Marcel Lauhoff m...@irq0.org wrote:

 Gregory Farnum g...@gregs42.com writes:

 Do you have a shorter summary than the code of how these stub and
 unstub operations relate to the object redirects? We didn't make a
 great deal of use of them but the basic data structures are mostly
 present in the codebase, are interpreted in at least some of the right
 places, and were definitely intended to cover this kind of use case.
 :)
 -Greg

 As far as I understood the redirect feature it is about pointing to
 other objects inside the Ceph cluster. The stubs feature allows
 pointing to anything. An HTTP server in concept code.

 Then stubs use an IMHO simpler approach to getting objects back: It's
 the task of the OSD. Stubbed objects just take longer to access, due to
 unstubbing it first.
 Redirects on the other hand leave this to the client: Object redirected
 - Tell client to retrieve it elsewhere.

 Ah, of course.

 I got a chance to look at this briefly today. Some notes:

 * You're using synchronous reads. That will prevent use of stubbing on
 EC pools (which only do async reads, as they might need to hit another
 OSD for the data), which seems sad.
 Good point. I didn't look at how EC pools work, yet. I assumed that
 a stub feature would be quite different for both pool types and tried
 the replicated first.

I'm not sure that will be necessary, actually. The advantage of only
doing GET/PUT (unstub/stub) is that you're doing only full-object
reads and writes; it doesn't require any of the features EC pools
don't provide.

 * There seems to be a race if you need to unstub an op for two
 separate requests that come in simultaneously, with nothing preventing
 both of them from initiating the unstub.
 Right. I should probably add some in flight states there.

 * You can inject an unstub for read ops, but that turns them into a
 write. That will cause problems in various cases where the object
 isn't writeable yet.
 I thought I fixed that by doing ctx-op-set_write() in the implicit
 unstub code.

No, the implicit unstub will have to be more involved than that. :(
RADOS writes aren't allowed to return any data to the user except for
a return code, and I believe that's enforced at the end by clearing
out/ignoring any of the return bufferlists we would otherwise pack up.
This is because we have to be able to return the exact same stuff on
replayed ops, in case the acting set of OSDs changes without the
client getting a response. Now, the unstub is a bit different in that
the data doesn't change in response to the user requiring an unstub,
but I think it still has some parallelism issues in that scenario.


 * Why does a delete need the object data?
 That was just a short cut: In the quite simplistic Remote API there is
 only put and get. A unstub before delete also deletes the remote object.

 * You definitely wouldn't want to unstub data for scrubbing.
 What's the alternative? The remote should do scrubbing or just skip the
 stubbed object?

I think you'd want to scrub both the full and stub metadata for
the object, but rely on the stub target to keep the actual bundle of
bytes safe.


 * There's a CEPH_OSD_OP_STAT which looks at what's in the object info;
 that is broken here because you're using the normal truncation path.
 There probably needs to be more cleverness or machinery distinguishing
 between the local size used and the size of the object represented.
 Of course.

 * I think snapshots are probably busted with this; did you check how
 they interact?
 With this implementation I think they really are. Stubs+Snapshouts could
 be a nice thing for backups. Just stub a read only snapshot.

Right, so all of these things will need to be worked out well before
we contemplate merging, and some of them are complicated enough that
they might require changing the core implementation to handle. You
probably don't want to delay it. :)
-Greg
--
To unsubscribe from this list: send the line unsubscribe ceph-devel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


librados clone_range

2015-06-23 Thread Samuel Just
ObjectWriteOperations currently allow you to perform a clone_range from another 
object with the same object locator.  Years ago, rgw used this as part of 
multipart upload.  Today, the implementation complicates the OSD considerably, 
and it doesn't appear to have any users left.  Is there anyone who would be sad 
to see it removed from the librados interface?
-Sam
--
To unsubscribe from this list: send the line unsubscribe ceph-devel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: RGW S3 Website hosting, non-clean code for early review

2015-06-23 Thread Yehuda Sadeh-Weinraub


- Original Message -
 From: Robin H. Johnson robb...@gentoo.org
 To: ceph-devel@vger.kernel.org
 Cc: Jonathan LaCour jonathan.lac...@dreamhost.com
 Sent: Tuesday, June 23, 2015 2:33:25 AM
 Subject: RGW S3 Website hosting, non-clean code for early review
 
 Hi,
 
 As an extension of earlier work done by Yehuda [1], I've gotten the
 great majority of the work done to support static website hosting in
 RGW, just like AmazonS3 [2].
 
 I need to do some cleanups of the code prior to major review for
 submission, and solve one thorny problem first, have a few discussions
 about best courses of action, and then I'll be submitting this for more
 reviews before merging.
 
 ceph [3]
 s3-tests, unit tests [4]
 s3-tests, fuzzer tests [5]
 
 The thorny problem:
 ---
 One of the pieces of functionality in S3Website is the ability to serve
 any public object in the bucket as the content on a custom error page
 (think shiny 404 error). In some cases, like trivial 403/404 errors, we
 can determine this quite early, before we fetch the object, and redirect
 the request to the error object instead (provided that we also redo the
 ACL check on the error object).
 
 In more complex cases (eg 416 Range Unsatisfiable, 412 Precondition
 Failed), it happens very late in the RGW request processing, and the
 req_state struct seems to have been mangled/pre-filled with a lot of
 decisions that aren't solvable.
 
 Either I have to repeat a lot of code for it, which I'm not happy about,
 or I have to refactor RGWGetObj* to more safely made the second GET
 request for the error object (and make sure range headers etc are NOT
 used for the get of the error object). I'm leaning to the latter.

Is generating a new req_state a possibility? E.g., you catch the error at the 
top level, and restart most of the request processing with a newly created 
req_state?

 
 Oh, and for added fun, if an error object is configured, but is missing
 or private, you get a similar but different than without any error
 object configured, and sometimes the error codes are in the headers, but
 not always.
 
 Discussion pieces:
 --
 RGWRegion
 - presently has both endpoints and hostnames, but doesn't make clear
   which APIs (S3, Swift, S3Website) might be available at each; or allow
   combinations to dedicate a specific FQDN to a given API.
   I'd like to replace both structures with a map structure [6]

Makes sense.

 Bucket existence privacy:
 - In general I agree with the goal that we should be closely compatible
   with AmazonS3, but with an eye to security, I'd like to consider a specific
   deviation:
 - In AmazonS3, you can enumerate buckets for existence, simply looking
   for 404 NoSuchBucket vs 403 AccessDenied. I'd like to offer a
   configuration option that returns 403 Forbidden or 401 Unauthorized on
   anonymous requests to non-existent buckets.

As long as it's configurable.

 - Testing some of functionality against AmazonS3 has been somewhat
   painful, as AmazonS3 only provides eventual consistency of the website
   configuration (with the highest time I've seen so far being about 30
   seconds).

Yup.

 
 New configuration options/changes:
 --
 rgw_enable_apis: gains 's3website' mode
 rgw_dns_s3website_name: similar to rgw_dns_name, but for s3website endpoint
 RGWRegion having per-rgw-api hostnames
 
 Patch series breakdown plans:
 -
 Here's the breakdown of patch series I'm considering for the changes
 (net 2kLOC in ceph, 1kLOC in testcases).
 [TODO marks pieces not in these sets of commits yet, but will be soon).
 
 ceph
 - split Formatter.cc
   - JSON/XML/Table formatter are separator now
   - add header  footer support for formatters
   - add knowledge of status
   - add HTML formatter
 - Add optional error handler hooks to RGWOp and RGWHandler for abort_early
 - Add optional retarget handler hooks
 - Add more flexible redirect handling
 - S3website code
 - x-amz-website-redirect-location handling (TODO: needs a bit more polish and
 testing)
 - TODO: Add more input validations to match S3, on stuff that's NOT
   documented but was discovered when I applied weirder testcases to
   AmazonS3:
   - 'Hostname' field has non-trivial validation (maybe borrow the
 outcome of wip-bucket_name_restrictions)
   - The 'Protocol' field for a redirect must be http/https, cannot be
 gopher or anything else.
   - The HttpRedirectCode field must contain one of: 301-305, 307, 308
 The docs don't say this, and the error message says 'Any 3XX value
 except 300'.
   - First-match in RoutingRules wins; watch out with rules that match
 4XX error codes.
 - Documentation
   - TODO: esp the parts missing from the S3 docs above
 
 s3-tests, unit tests
 - refactor for more requests
 - add new utiliities
 - add website tests
 s3-tests, fuzzer tests [5]
 
 Links for all the bits above
 
 [1] 

Re: cephfs obsolescence and object location

2015-06-23 Thread Gregory Farnum
On Mon, Jun 22, 2015 at 10:18 PM, Bill Sharer bsha...@sharerland.com wrote:
 I'm currently running giant on gentoo and was wondering about the stability
 of the api for mapping MDS files to rados objects.  The cephfs binary
 complains that it is obsolete for getting layout information, but it also
 provides object location info.  AFAICT this is the only way to map files in
 a cephfs filesystem to object locations if I want to take advantage of the
 UFO nature of ceph's stores in order to access via both cephfs and rados
 methods.

 I have a content store that scans files, calculates their sha1hash and then
 stores them on a cephfs filesystem tree with their filenames set to their
 sha1hash name.  I can then build views of this content using an external
 local filesystem and symlinks pointing into the cephfs store.  At the same
 time, I want to be able to use this store via rados either through the
 gateway or my own software that is rados aware.  The store is being treated
 as a write-once, read-many style system.

 Towards this end, I started writing a QT4 based library that includes this
 little Location routine (which currently works) to grab the rados object
 location from a hash object in this store. I'm just wondering whether this
 is all going to break horribly in the future when ongoing MDS development
 decides to break the code I borrowed from cephfs :-)

I don't know when things will break exactly, but it will probably be
when we remove the ioctl rather than when the MDS stops supporting it.
This particular one is implemented entirely on the client without
talking to the MDS. :)

You can also do this yourself in userspace: get the layout structure
information on the file via the virtual xattrs (ceph.layout, I
believe?). Use that to map to the specific object you're interested in
(you can look at the kernel's fs/ceph/ioctl.c
ceph_ioctl_get_dataloc() function, or any of the userspace stuff
that does it). The tricky bit is that finding locations does require
an up-to-date cluster map, but libcephfs will do that for you (and it
looks to me like you really just want object names, not their
locations).
-Greg




 QString Shastore::Location(const QString hash) {
 QString result = ;
 QString cache_path = this-dbcache + / + hash.left(2) + / +
 hash.mid(2,2) + / + hash;
 QFile cache_file(cache_path);
 if (cache_file.exists()) {
 if (cache_file.open(QIODevice::ReadOnly)) {
 /*
  * Ripped from cephfs code, grab the handle and use the ceph
 version of ioctl to
  * rummage through the file's xattrs for rados location.  cephfs
 whines about being
  * obsolete to get layout this way, but this appears to be only
 way to get location.
  * This may all break horribly in a future release since MDS is
 undergoing heavy development
  *
  *  cephfs lets user pass file_offset in argv but it defaults to
 0.  Presumably this is the first
  *  extent of the pile of extents (4mb each?) and shards for the
 file.  If user wants to jump
  *  elsewhere with a non-zero offset, the resulting rados object
 location may be different
  */
 int fd = cache_file.handle();
 struct ceph_ioctl_dataloc location;
 location.file_offset = 0;
 int err = ioctl(fd, CEPH_IOC_GET_DATALOC, (unsigned
 long)location);
 if (err) {
 qDebug()  Location: Error getting rados location for  
 cache_path;
 } else {
 result = QString(location.object_name);
 }
 cache_file.close();
 } else {
 qDebug()  Location: unable to open   cache_path  
 readonly;
 }
 } else {
 qDebug()  Location: cache file   cache_path   does not
 exist;
 }
 return result;
 }

 --
 To unsubscribe from this list: send the line unsubscribe ceph-devel in
--
To unsubscribe from this list: send the line unsubscribe ceph-devel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Ceph hard lock Hammer 9.2

2015-06-23 Thread Gregory Farnum
On Mon, Jun 22, 2015 at 9:45 PM, Barclay Jameson
almightybe...@gmail.com wrote:
 Has anyone seen this?

Can you describe the kernel you're using, the workload you were
running, the Ceph cluster you're running against, etc?


 Jun 22 15:09:27 node kernel: Call Trace:
 Jun 22 15:09:27 node kernel: [816803ee] schedule+0x3e/0x90
 Jun 22 15:09:27 node kernel: [8168062e]
 schedule_preempt_disabled+0xe/0x10
 Jun 22 15:09:27 node kernel: [81681ce3]
 __mutex_lock_slowpath+0x93/0x100
 Jun 22 15:09:27 node kernel: [a060def8] ?
 __cap_is_valid+0x58/0x70 [ceph]
 Jun 22 15:09:27 node kernel: [81681d73] mutex_lock+0x23/0x40
 Jun 22 15:09:27 node kernel: [a0610f2d]
 ceph_check_caps+0x38d/0x780 [ceph]
 Jun 22 15:09:27 node kernel: [812f5a9b] ?
 __radix_tree_delete_node+0x7b/0x130
 Jun 22 15:09:27 node kernel: [a0612637]
 ceph_put_wrbuffer_cap_refs+0xf7/0x240 [ceph]
 Jun 22 15:09:27 node kernel: [a060b170]
 writepages_finish+0x200/0x290 [ceph]
 Jun 22 15:09:27 node kernel: [a05e2731]
 handle_reply+0x4f1/0x640 [libceph]
 Jun 22 15:09:27 node kernel: [a05e3065] dispatch+0x85/0xa0 [libceph]
 Jun 22 15:09:27 node kernel: [a05d7ceb]
 process_message+0xab/0xd0 [libceph]
 Jun 22 15:09:27 node kernel: [a05db052] try_read+0x2d2/0x430 
 [libceph]
 Jun 22 15:09:27 node kernel: [a05db7e8] con_work+0x78/0x220 
 [libceph]
 Jun 22 15:09:27 node kernel: [8108c475] process_one_work+0x145/0x460
 Jun 22 15:09:27 node kernel: [8108c8b2] worker_thread+0x122/0x420
 Jun 22 15:09:27 node kernel: [8167fdb8] ? __schedule+0x398/0x840
 Jun 22 15:09:27 node kernel: [8108c790] ? 
 process_one_work+0x460/0x460
 Jun 22 15:09:27 node kernel: [8108c790] ? 
 process_one_work+0x460/0x460
 Jun 22 15:09:27 node kernel: [8109170e] kthread+0xce/0xf0
 Jun 22 15:09:27 node kernel: [81091640] ?
 kthread_freezable_should_stop+0x70/0x70
 Jun 22 15:09:27 node kernel: [81683dd8] ret_from_fork+0x58/0x90
 Jun 22 15:09:27 node kernel: [81091640] ?
 kthread_freezable_should_stop+0x70/0x70
 Jun 22 15:11:27 node kernel: INFO: task kworker/2:1:40 blocked for
 more than 120 seconds.
 Jun 22 15:11:27 node kernel:  Tainted: G  I
 4.0.4-1.el6.elrepo.x86_64 #1
 Jun 22 15:11:27 node kernel: echo 0 
 /proc/sys/kernel/hung_task_timeout_secs disables this message.
 Jun 22 15:11:27 node kernel: kworker/2:1 D 881ff279f7f8 0
   40  2 0x
 Jun 22 15:11:27 node kernel: Workqueue: ceph-msgr con_work [libceph]
 Jun 22 15:11:27 node kernel: 881ff279f7f8 881ff261c010
 881ff2b67050 88207fd95270
 Jun 22 15:11:27 node kernel: 881ff279c010 88207fd15200
 7fff 0002
 Jun 22 15:11:27 node kernel: 81680ae0 881ff279f818
 816803ee 810ae63b
 --
 To unsubscribe from this list: send the line unsubscribe ceph-devel in
--
To unsubscribe from this list: send the line unsubscribe ceph-devel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: cephfs obsolescence and object location

2015-06-23 Thread John Spray


Since you're only looking up the ID of the first object, it's really 
simple.  It's just the hex printed inode number followed by 
..  That's not guaranteed to always be the case in the future, 
but it's likely to be true longer than the deprecated ioctls exist.  If 
I was you, I would hard-code the object naming convention rather than 
writing in a dependency on the ioctl.


As Greg says, you can also query all the layout stuff (via supported 
interfaces) and do the full calculation of object names for arbitrary 
offsets into the file if you need to.


John


On 22/06/2015 22:18, Bill Sharer wrote:
I'm currently running giant on gentoo and was wondering about the 
stability of the api for mapping MDS files to rados objects.  The 
cephfs binary complains that it is obsolete for getting layout 
information, but it also provides object location info.  AFAICT this 
is the only way to map files in a cephfs filesystem to object 
locations if I want to take advantage of the UFO nature of ceph's 
stores in order to access via both cephfs and rados methods.


I have a content store that scans files, calculates their sha1hash and 
then stores them on a cephfs filesystem tree with their filenames set 
to their sha1hash name.  I can then build views of this content using 
an external local filesystem and symlinks pointing into the cephfs 
store.  At the same time, I want to be able to use this store via 
rados either through the gateway or my own software that is rados 
aware.  The store is being treated as a write-once, read-many style 
system.


Towards this end, I started writing a QT4 based library that includes 
this little Location routine (which currently works) to grab the rados 
object location from a hash object in this store. I'm just wondering 
whether this is all going to break horribly in the future when ongoing 
MDS development decides to break the code I borrowed from cephfs :-)




QString Shastore::Location(const QString hash) {
QString result = ;
QString cache_path = this-dbcache + / + hash.left(2) + / + 
hash.mid(2,2) + / + hash;

QFile cache_file(cache_path);
if (cache_file.exists()) {
if (cache_file.open(QIODevice::ReadOnly)) {
/*
 * Ripped from cephfs code, grab the handle and use the 
ceph version of ioctl to
 * rummage through the file's xattrs for rados location.  
cephfs whines about being
 * obsolete to get layout this way, but this appears to be 
only way to get location.
 * This may all break horribly in a future release since 
MDS is undergoing heavy development

 *
 *  cephfs lets user pass file_offset in argv but it 
defaults to 0.  Presumably this is the first
 *  extent of the pile of extents (4mb each?) and shards 
for the file.  If user wants to jump
 *  elsewhere with a non-zero offset, the resulting rados 
object location may be different

 */
int fd = cache_file.handle();
struct ceph_ioctl_dataloc location;
location.file_offset = 0;
int err = ioctl(fd, CEPH_IOC_GET_DATALOC, (unsigned 
long)location);

if (err) {
qDebug()  Location: Error getting rados location 
for   cache_path;

} else {
result = QString(location.object_name);
}
cache_file.close();
} else {
qDebug()  Location: unable to open   cache_path   
readonly;

}
} else {
qDebug()  Location: cache file   cache_path   does 
not exist;

}
return result;
}


Since you're only looking at

--
To unsubscribe from this list: send the line unsubscribe ceph-devel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


RGW S3 Website hosting, non-clean code for early review

2015-06-23 Thread Robin H. Johnson
Hi,

As an extension of earlier work done by Yehuda [1], I've gotten the
great majority of the work done to support static website hosting in
RGW, just like AmazonS3 [2].

I need to do some cleanups of the code prior to major review for
submission, and solve one thorny problem first, have a few discussions
about best courses of action, and then I'll be submitting this for more
reviews before merging.

ceph [3]
s3-tests, unit tests [4] 
s3-tests, fuzzer tests [5]

The thorny problem:
---
One of the pieces of functionality in S3Website is the ability to serve
any public object in the bucket as the content on a custom error page
(think shiny 404 error). In some cases, like trivial 403/404 errors, we
can determine this quite early, before we fetch the object, and redirect
the request to the error object instead (provided that we also redo the
ACL check on the error object).

In more complex cases (eg 416 Range Unsatisfiable, 412 Precondition
Failed), it happens very late in the RGW request processing, and the
req_state struct seems to have been mangled/pre-filled with a lot of
decisions that aren't solvable.

Either I have to repeat a lot of code for it, which I'm not happy about,
or I have to refactor RGWGetObj* to more safely made the second GET
request for the error object (and make sure range headers etc are NOT
used for the get of the error object). I'm leaning to the latter.

Oh, and for added fun, if an error object is configured, but is missing
or private, you get a similar but different than without any error
object configured, and sometimes the error codes are in the headers, but
not always.

Discussion pieces:
--
RGWRegion
- presently has both endpoints and hostnames, but doesn't make clear
  which APIs (S3, Swift, S3Website) might be available at each; or allow
  combinations to dedicate a specific FQDN to a given API.
  I'd like to replace both structures with a map structure [6]
Bucket existence privacy:
- In general I agree with the goal that we should be closely compatible
  with AmazonS3, but with an eye to security, I'd like to consider a specific
  deviation:
- In AmazonS3, you can enumerate buckets for existence, simply looking
  for 404 NoSuchBucket vs 403 AccessDenied. I'd like to offer a
  configuration option that returns 403 Forbidden or 401 Unauthorized on
  anonymous requests to non-existent buckets.
- Testing some of functionality against AmazonS3 has been somewhat
  painful, as AmazonS3 only provides eventual consistency of the website
  configuration (with the highest time I've seen so far being about 30
  seconds).

New configuration options/changes:
--
rgw_enable_apis: gains 's3website' mode
rgw_dns_s3website_name: similar to rgw_dns_name, but for s3website endpoint
RGWRegion having per-rgw-api hostnames

Patch series breakdown plans:
-
Here's the breakdown of patch series I'm considering for the changes
(net 2kLOC in ceph, 1kLOC in testcases).
[TODO marks pieces not in these sets of commits yet, but will be soon).

ceph
- split Formatter.cc
  - JSON/XML/Table formatter are separator now
  - add header  footer support for formatters
  - add knowledge of status
  - add HTML formatter
- Add optional error handler hooks to RGWOp and RGWHandler for abort_early
- Add optional retarget handler hooks
- Add more flexible redirect handling
- S3website code
- x-amz-website-redirect-location handling (TODO: needs a bit more polish and 
testing)
- TODO: Add more input validations to match S3, on stuff that's NOT
  documented but was discovered when I applied weirder testcases to
  AmazonS3:
  - 'Hostname' field has non-trivial validation (maybe borrow the
outcome of wip-bucket_name_restrictions)
  - The 'Protocol' field for a redirect must be http/https, cannot be
gopher or anything else.
  - The HttpRedirectCode field must contain one of: 301-305, 307, 308
The docs don't say this, and the error message says 'Any 3XX value
except 300'.
  - First-match in RoutingRules wins; watch out with rules that match
4XX error codes.
- Documentation
  - TODO: esp the parts missing from the S3 docs above

s3-tests, unit tests
- refactor for more requests
- add new utiliities
- add website tests
s3-tests, fuzzer tests [5]

Links for all the bits above

[1] https://github.com/ceph/ceph/tree/wip-static-website
[2] http://docs.aws.amazon.com/AmazonS3/latest/dev/WebsiteHosting.html
[3] 
https://github.com/ceph/ceph/compare/master...robbat2:wip-static-website-robbat2-master
[4] https://github.com/ceph/s3-tests/compare/master...robbat2:wip-static-website
[5] https://github.com/ceph/s3-tests/compare/master...robbat2:wip-website-fuzzy
[6] 
https://github.com/ceph/ceph/compare/master...robbat2:wip-static-website-robbat2-master#diff-ee7891a35944697538486c9269e0d65bR909

-- 
Robin Hugh Johnson
Gentoo Linux: Developer, Infrastructure Lead
E-Mail : robb...@gentoo.org
GnuPG FP 

Re: erasure pool with isa plugin

2015-06-23 Thread Gregory Farnum
On Mon, Jun 22, 2015 at 10:34 PM, Loic Dachary l...@dachary.org wrote:
 Hi Tom,

 On 22/06/2015 17:10, Deneau, Tom wrote:
 If one has a cluster with some nodes that can run with the ISA plugin
 and some that cannot, is there a way to define a pool such that the
 ISA-capable nodes can use the ISA plugin and the others can use say
 the jerasure plugin?

 There is no way to do that, because there is no guarantee that an object 
 encoded by jerasure can be decoded by isa and vice versa.

Shouldn't we be able to set up something that *does* guarantee that,
though? Either by combining them into a single plugin which
dynamically configures to the fastest possible for that machine, or by
running them against the same object corpus and requiring that they
produce the same output?
-Greg
--
To unsubscribe from this list: send the line unsubscribe ceph-devel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: erasure pool with isa plugin

2015-06-23 Thread Loic Dachary
Hi,

On 23/06/2015 06:06, Gregory Farnum wrote:
 On Mon, Jun 22, 2015 at 10:34 PM, Loic Dachary l...@dachary.org wrote:
 Hi Tom,

 On 22/06/2015 17:10, Deneau, Tom wrote:
 If one has a cluster with some nodes that can run with the ISA plugin
 and some that cannot, is there a way to define a pool such that the
 ISA-capable nodes can use the ISA plugin and the others can use say
 the jerasure plugin?

 There is no way to do that, because there is no guarantee that an object 
 encoded by jerasure can be decoded by isa and vice versa.
 
 Shouldn't we be able to set up something that *does* guarantee that,
 though? Either by combining them into a single plugin which
 dynamically configures to the fastest possible for that machine, or by
 running them against the same object corpus and requiring that they
 produce the same output?

I don't know enough about the maths involved to evaluate how difficult it is. 
Although both implement Reed Solomon using Vandermonde or Cauchy matrices, I 
think there are details that makes the output different. 

Cheers

 -Greg
 

-- 
Loïc Dachary, Artisan Logiciel Libre



signature.asc
Description: OpenPGP digital signature


Re: rsyslogd

2015-06-23 Thread Gregory Farnum
Did we end up creating a ticket for this? I saw this on one FS run as well.
-Greg

On Fri, Jun 19, 2015 at 8:03 AM, Gregory Farnum gfar...@redhat.com wrote:
 Not that I'm aware of. I don't mess with services at all in the log rotation 
 stuff I did (we already disabled generic log rotation when tests were 
 running).
 -Greg

 On Jun 19, 2015, at 12:13 AM, David Zafman dzaf...@redhat.com wrote:


 Greg,

 Have you changed anything (log rotation related?) that would uninstall or  
 cause rsyslog to not be able to start?

 I'm sometimes seeing machines fail with this error probably in 
 teuthology/nuke.py reset_syslog_dir().

 CommandFailedError: Command failed on plana94 with status 1: 'sudo rm -f -- 
 /etc/rsyslog.d/80-cephtest.conf  sudo service rsyslog restart'


 David



 --
 To unsubscribe from this list: send the line unsubscribe ceph-devel in
 the body of a message to majord...@vger.kernel.org
 More majordomo info at  http://vger.kernel.org/majordomo-info.html
--
To unsubscribe from this list: send the line unsubscribe ceph-devel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Monitor clock skew on Teuthology testing

2015-06-23 Thread Zhou, Yuan
Hi Zack,

I was testing on some home-made Teuthology cluster. Till now I can use the 
teuthology-suite to submit the test cases and start some workers to do the 
tests. From the Pulpito logs I can see most of the tests are passed except 
there was some error when aggregating the results in the last step. The error 
message was like:

2015-06-24 08:41:33.334317 mon.1 192.168.13.117:6789/0 4 : cluster [WRN] 
message from mon.0 was stamped 0.709253s in the future, clocks not 
synchronized in cluster log  

This was due to the time lag between monitors to my knowledge. I checked the 
clock settings in Teuthology and find the NTP server are defined in 
ceph-qa-chef/cookbooks/ceph-qa/files/default/ntp.conf. Is there any other 
settings on the clock side there?

Thanks, -yuan

--
To unsubscribe from this list: send the line unsubscribe ceph-devel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: RGW S3 Website hosting, non-clean code for early review

2015-06-23 Thread Robin H. Johnson

On Tue, Jun 23, 2015 at 04:30:19PM -0400, Yehuda Sadeh-Weinraub wrote:
  Either I have to repeat a lot of code for it, which I'm not happy about,
  or I have to refactor RGWGetObj* to more safely made the second GET
  request for the error object (and make sure range headers etc are NOT
  used for the get of the error object). I'm leaning to the latter.
 Is generating a new req_state a possibility? E.g., you catch the error
 at the top level, and restart most of the request processing with a
 newly created req_state?
That was the path I was trying, but not completely succeeding. 
I think need to step it back further and have a partially customized
copy of the RGWEnv from client_io-get_env(), so that I can build the
modified req_info for req_state.

It isn't a full new GET really, it's really just custom content for the
body as well as some headers (mostly Content-Length, Content-Type), but
ignore EPERM/EACCESS on trying to fetch that custom content, and if they
are detected, consider that a success but with different HTML content.

 Great! I'll wait for the cleaned up pull request.
Do you want pull requests per logical change of my proposed series
split, or rather just one pull request with the full series?

-- 
Robin Hugh Johnson
Gentoo Linux: Developer, Infrastructure Lead
E-Mail : robb...@gentoo.org
GnuPG FP   : 11ACBA4F 4778E3F6 E4EDF38E B27B944E 34884E85
--
To unsubscribe from this list: send the line unsubscribe ceph-devel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: provisioning teuthology targets with OpenStack

2015-06-23 Thread Loic Dachary
Hi Zack,

I think it works and I'll start using it. For the record

https://github.com/ceph/teuthology/compare/master...dachary:wip-6502-openstack?expand=1#diff-88b99bb28683bd5b7e3a204826ead112R318

is what I have right now. There are a few things that need cleaning but nothing 
major I hope. I'll clean them up after you had time to review the preliminary 
pull requests at https://github.com/ceph/teuthology/pulls.

Cheers

On 09/06/2015 14:58, Loic Dachary wrote:
 
 
 On 09/06/2015 18:25, Zack Cerza wrote:
 Hi Loic!

 I'm really happy to hear you have the time and motivation to work on this! 
 It's something I've wanted to get to for a while, but other things keep 
 getting in the way.

 I'd suggest looking at teuthology.provision.Downburst to investigate maybe 
 pulling some of its functionality into a base class that a potential 
 OpenStack subclass could also benefit from. I'd love to see us using a 
 common interface for all of our provisioning. If that seems like a lot of 
 work, I'd be willing to be the one to create the base class and work with 
 you to make sure its API was useful for OpenStack as well.

 
 Thanks for the pointers. I'll write something that works and get back to you 
 to make it acceptable :-)
 
 Thanks,
 Zack

 - Original Message -
 From: Loic Dachary l...@dachary.org
 To: Zack Cerza zce...@redhat.com
 Cc: Ceph Development ceph-devel@vger.kernel.org
 Sent: Sunday, June 7, 2015 11:04:52 AM
 Subject: provisioning teuthology targets with OpenStack

 Hi Zack,

 I'm motivated by my recent experience with hacking teuthology and containers
 :-) I'd like to give

 http://tracker.ceph.com/issues/6502 provision targets using a cloud API

 a shot with OpenStack in mind (because I know nothing of EC2 really). The
 idea is to make it relatively easy for someone with access to an OpenStack
 tenant to using instead of downburst. Not only would that simplify the
 development process for people without access to the lab, it would also
 allow the addition of more horsepower to the lab. What do you think ?

 Cheers
 --
 Loïc Dachary, Artisan Logiciel Libre


 

-- 
Loïc Dachary, Artisan Logiciel Libre



signature.asc
Description: OpenPGP digital signature


Setting up teuthology with OpenStack

2015-06-23 Thread Loic Dachary
Hi Ceph,

There is an experimental OpenStack backend[1] for teuthology[2] (the Ceph 
integration tests toolbox) with instructions to set it up[3] and some 
integration tests to verify it works as it should. I've run them successfully 
on two different OpenStack clusters (http://the.re/ and 
https://entercloudsuite.com/). It would be great if someone was brave enough to 
give it a try at this very early stage :-) If you don't have access to an 
OpenStack cluster, https://entercloudsuite.com/ can provide an OpenStack tenant 
within the hour (I had a few weird glitches during the registration but they 
went away magically).

Cheers

[1] teuthology OpenStack backend http://tracker.ceph.com/issues/6502
[2] teuthology https://github.com/ceph/teuthology/
[3] setting up teuthology with OpenStack 
https://github.com/dachary/teuthology/tree/wip-6502-openstack#openstack-backend

-- 
Loïc Dachary, Artisan Logiciel Libre



signature.asc
Description: OpenPGP digital signature


Re: RGW S3 Website hosting, non-clean code for early review

2015-06-23 Thread Yehuda Sadeh-Weinraub


- Original Message -
 From: Robin H. Johnson robb...@gentoo.org
 To: Yehuda Sadeh-Weinraub yeh...@redhat.com
 Cc: ceph-devel@vger.kernel.org, Jonathan LaCour 
 jonathan.lac...@dreamhost.com
 Sent: Tuesday, June 23, 2015 4:04:49 PM
 Subject: Re: RGW S3 Website hosting, non-clean code for early review
 
 
 On Tue, Jun 23, 2015 at 04:30:19PM -0400, Yehuda Sadeh-Weinraub wrote:
   Either I have to repeat a lot of code for it, which I'm not happy about,
   or I have to refactor RGWGetObj* to more safely made the second GET
   request for the error object (and make sure range headers etc are NOT
   used for the get of the error object). I'm leaning to the latter.
  Is generating a new req_state a possibility? E.g., you catch the error
  at the top level, and restart most of the request processing with a
  newly created req_state?
 That was the path I was trying, but not completely succeeding.
 I think need to step it back further and have a partially customized
 copy of the RGWEnv from client_io-get_env(), so that I can build the
 modified req_info for req_state.
 
 It isn't a full new GET really, it's really just custom content for the
 body as well as some headers (mostly Content-Length, Content-Type), but
 ignore EPERM/EACCESS on trying to fetch that custom content, and if they
 are detected, consider that a success but with different HTML content.
 
  Great! I'll wait for the cleaned up pull request.
 Do you want pull requests per logical change of my proposed series
 split, or rather just one pull request with the full series?
 

One pull request for the full series.

Yehuda
--
To unsubscribe from this list: send the line unsubscribe ceph-devel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html