Re: [PATCH] mark rbd requiring stable pages
On 10/22/15, 11:52 AM, Ilya Dryomov wrote: On Thu, Oct 22, 2015 at 5:37 PM, Mike Christiewrote: On 10/22/2015 06:20 AM, Ilya Dryomov wrote: If we are just talking about if stable pages are not used, and someone is re-writing data to a page after the page has already been submitted to the block layer (I mean the page is on some bio which is on a request which is on some request_queue scheduler list or basically anywhere in the block layer), then I was saying this can occur with any block driver. There is nothing that is preventing this from happening with a FC driver or nvme or cciss or in dm or whatever. The app/user can rewrite as late as when we are in the make_request_fn/request_fn. I think I am misunderstanding your question because I thought this is expected behavior, and there is nothing drivers can do if the app is not doing a flush/sync between these types of write sequences. I don't see a problem with rewriting as late as when we are in request_fn() (or in a wq after being put there by request_fn()). Where I thought there *might* be an issue is rewriting after sendpage(), if sendpage() is used - perhaps some sneaky sequence similar to that retransmit bug that would cause us to *transmit* incorrect bytes (as opposed to *re*transmit) or something of that nature? Just to make sure we are on the same page. Are you concerned about the tcp/net layer retransmitting due to it detecting a issue as part of the tcp protocol, or are you concerned about rbd/libceph initiating a retry like with the nfs issue? The former, tcp/net layer. I'm just conjecturing though. For iscsi, we normally use the sendpage path. Data digests are off by default and some distros do not even allow you to turn them on, so our sendpage path has got a lot of testing and we have not seen any corruptions. Not saying it is not possible, but just saying we have not seen any. It could be due to a recent change. Ronny, tell us about the workload and I will check iscsi. Oh yeah, for the tcp/net retransmission case, I had said offlist, I thought there might be a issue with iscsi but I guess I was wrong, so I have not seen any issues with that either. iSCSI just has that bug I mentioned offlist where we close the socket and fail commands upwards in the wrong order. That is a iscsi specific bug though. -- To unsubscribe from this list: send the line "unsubscribe ceph-devel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: newstore direction
On Thu, 2015-10-22 at 02:12 +, Allen Samuels wrote: > One of the biggest changes that flash is making in the storage world is that > the way basic trade-offs in storage management software architecture are > being affected. In the HDD world CPU time per IOP was relatively > inconsequential, i.e., it had little effect on overall performance which was > limited by the physics of the hard drive. Flash is now inverting that > situation. When you look at the performance levels being delivered in the > latest generation of NVMe SSDs you rapidly see that that storage itself is > generally no longer the bottleneck (speaking about BW, not latency of course) > but rather it's the system sitting in front of the storage that is the > bottleneck. Generally it's the CPU cost of an IOP. > > When Sandisk first starting working with Ceph (Dumpling) the design of > librados and the OSD lead to the situation that the CPU cost of an IOP was > dominated by context switches and network socket handling. Over time, much of > that has been addressed. The socket handling code has been re-written (more > than once!) some of the internal queueing in the OSD (and the associated > context switches) have been eliminated. As the CPU costs have dropped, > performance on flash has improved accordingly. > > Because we didn't want to completely re-write the OSD (time-to-market and > stability drove that decision), we didn't move it from the current "thread > per IOP" model into a truly asynchronous "thread per CPU core" model that > essentially eliminates context switches in the IO path. But a fully optimized > OSD would go down that path (at least part-way). I believe it's been proposed > in the past. Perhaps a hybrid "fast-path" style could get most of the > benefits while preserving much of the legacy code. > +1 It not just reducing context switches but also about removing contention and data copies and getting better cache utilization. Scylladb just did this to cassandra (using seastar library): http://www.zdnet.com/article/kvm-creators-open-source-fast-cassandra-drop-in-replacement-scylla/ Orit > I believe this trend toward thread-per-core software development will also > tend to support the "do it in user-space" trend. That's because most of the > kernel and file-system interface is architected around the blocking > "thread-per-IOP" model and is unlikely to change in the future. > > > Allen Samuels > Software Architect, Fellow, Systems and Software Solutions > > 2880 Junction Avenue, San Jose, CA 95134 > T: +1 408 801 7030| M: +1 408 780 6416 > allen.samu...@sandisk.com > > -Original Message- > From: Martin Millnert [mailto:mar...@millnert.se] > Sent: Thursday, October 22, 2015 6:20 AM > To: Mark Nelson> Cc: Ric Wheeler ; Allen Samuels > ; Sage Weil ; > ceph-devel@vger.kernel.org > Subject: Re: newstore direction > > Adding 2c > > On Wed, 2015-10-21 at 14:37 -0500, Mark Nelson wrote: > > My thought is that there is some inflection point where the userland > > kvstore/block approach is going to be less work, for everyone I think, > > than trying to quickly discover, understand, fix, and push upstream > > patches that sometimes only really benefit us. I don't know if we've > > truly hit that that point, but it's tough for me to find flaws with > > Sage's argument. > > Regarding the userland / kernel land aspect of the topic, there are further > aspects AFAIK not yet addressed in the thread: > In the networking world, there's been development on memory mapped (multiple > approaches exist) userland networking, which for packet management has the > benefit of - for very, very specific applications of networking code - > avoiding e.g. per-packet context switches etc, and streamlining processor > cache management performance. People have gone as far as removing CPU cores > from CPU scheduler to completely dedicate them to the networking task at hand > (cache optimizations). There are various latency/throughput (bulking) > optimizations applicable, but at the end of the day, it's about keeping the > CPU bus busy with "revenue" bus traffic. > > Granted, storage IO operations may be much heavier in cycle counts for > context switches to ever appear as a problem in themselves, certainly for > slower SSDs and HDDs. However, when going for truly high performance IO, > *every* hurdle in the data path counts toward the total latency. > (And really, high performance random IO characteristics approaches the > networking, per-packet handling characteristics). Now, I'm not really > suggesting memory-mapping a storage device to user space, not at all, but > having better control over the data path for a very specific use case, > reduces dependency on the code that works as best as possible for the general > case, and allows for very purpose-built code, to address a narrow set of > requirements.
Re: newstore direction
On Wed, Oct 21, 2015 at 10:30:28AM -0700, Sage Weil wrote: > For example: we need to do an overwrite of an existing object that is > atomic with respect to a larger ceph transaction (we're updating a bunch > of other metadata at the same time, possibly overwriting or appending to > multiple files, etc.). XFS and ext4 aren't cow file systems, so plugging > into the transaction infrastructure isn't really an option (and even after > several years of trying to do it with btrfs it proved to be impractical). Not that I'm disagreeing with most of your points, but we can do things like that with swapext-like hacks. Below is my half year old prototype of an O_ATOMIC implementation for XFS that gives you atomic out of place writes. diff --git a/fs/fcntl.c b/fs/fcntl.c index ee85cd4..001dd49 100644 --- a/fs/fcntl.c +++ b/fs/fcntl.c @@ -740,7 +740,7 @@ static int __init fcntl_init(void) * Exceptions: O_NONBLOCK is a two bit define on parisc; O_NDELAY * is defined as O_NONBLOCK on some platforms and not on others. */ - BUILD_BUG_ON(21 - 1 /* for O_RDONLY being 0 */ != HWEIGHT32( + BUILD_BUG_ON(22 - 1 /* for O_RDONLY being 0 */ != HWEIGHT32( O_RDONLY| O_WRONLY | O_RDWR| O_CREAT | O_EXCL| O_NOCTTY | O_TRUNC | O_APPEND | /* O_NONBLOCK | */ @@ -748,6 +748,7 @@ static int __init fcntl_init(void) O_DIRECT| O_LARGEFILE | O_DIRECTORY | O_NOFOLLOW | O_NOATIME | O_CLOEXEC | __FMODE_EXEC| O_PATH| __O_TMPFILE | + O_ATOMIC| __FMODE_NONOTIFY )); diff --git a/fs/xfs/libxfs/xfs_bmap.c b/fs/xfs/libxfs/xfs_bmap.c index aeffeaa..8eafca6 100644 --- a/fs/xfs/libxfs/xfs_bmap.c +++ b/fs/xfs/libxfs/xfs_bmap.c @@ -4681,14 +4681,14 @@ xfs_bmap_del_extent( xfs_btree_cur_t *cur, /* if null, not a btree */ xfs_bmbt_irec_t *del, /* data to remove from extents */ int *logflagsp, /* inode logging flags */ - int whichfork) /* data or attr fork */ + int whichfork, /* data or attr fork */ + boolfree_blocks) /* free extent at end of routine */ { xfs_filblks_t da_new; /* new delay-alloc indirect blocks */ xfs_filblks_t da_old; /* old delay-alloc indirect blocks */ xfs_fsblock_t del_endblock=0; /* first block past del */ xfs_fileoff_t del_endoff; /* first offset past del */ int delay; /* current block is delayed allocated */ - int do_fx; /* free extent at end of routine */ xfs_bmbt_rec_host_t *ep;/* current extent entry pointer */ int error; /* error return value */ int flags; /* inode logging flags */ @@ -4712,8 +4712,8 @@ xfs_bmap_del_extent( mp = ip->i_mount; ifp = XFS_IFORK_PTR(ip, whichfork); - ASSERT((*idx >= 0) && (*idx < ifp->if_bytes / - (uint)sizeof(xfs_bmbt_rec_t))); + ASSERT(*idx >= 0); + ASSERT(*idx < ifp->if_bytes / sizeof(xfs_bmbt_rec_t)); ASSERT(del->br_blockcount > 0); ep = xfs_iext_get_ext(ifp, *idx); xfs_bmbt_get_all(ep, ); @@ -4746,10 +4746,13 @@ xfs_bmap_del_extent( len = del->br_blockcount; do_div(bno, mp->m_sb.sb_rextsize); do_div(len, mp->m_sb.sb_rextsize); - error = xfs_rtfree_extent(tp, bno, (xfs_extlen_t)len); - if (error) - goto done; - do_fx = 0; + if (free_blocks) { + error = xfs_rtfree_extent(tp, bno, + (xfs_extlen_t)len); + if (error) + goto done; + free_blocks = 0; + } nblks = len * mp->m_sb.sb_rextsize; qfield = XFS_TRANS_DQ_RTBCOUNT; } @@ -4757,7 +4760,6 @@ xfs_bmap_del_extent( * Ordinary allocation. */ else { - do_fx = 1; nblks = del->br_blockcount; qfield = XFS_TRANS_DQ_BCOUNT; } @@ -4777,7 +4779,7 @@ xfs_bmap_del_extent( da_old = startblockval(got.br_startblock); da_new = 0; nblks = 0; - do_fx = 0; + free_blocks = 0; } /* * Set flag value to use in switch statement. @@ -4963,7 +4965,7 @@ xfs_bmap_del_extent( /* * If we
Re: [PATCH] mark rbd requiring stable pages
On Thu, Oct 22, 2015 at 7:22 PM, Mike Christiewrote: > On 10/22/15, 11:52 AM, Ilya Dryomov wrote: >> >> On Thu, Oct 22, 2015 at 5:37 PM, Mike Christie >> wrote: >>> >>> On 10/22/2015 06:20 AM, Ilya Dryomov wrote: >> >> If we are just talking about if stable pages are not used, and someone >> is re-writing data to a page after the page has already been submitted >> to the block layer (I mean the page is on some bio which is on a >> request >> which is on some request_queue scheduler list or basically anywhere in >> the block layer), then I was saying this can occur with any block >> driver. There is nothing that is preventing this from happening with a >> FC driver or nvme or cciss or in dm or whatever. The app/user can >> rewrite as late as when we are in the make_request_fn/request_fn. >> >> I think I am misunderstanding your question because I thought this is >> expected behavior, and there is nothing drivers can do if the app is >> not >> doing a flush/sync between these types of write sequences. I don't see a problem with rewriting as late as when we are in request_fn() (or in a wq after being put there by request_fn()). Where I thought there *might* be an issue is rewriting after sendpage(), if sendpage() is used - perhaps some sneaky sequence similar to that retransmit bug that would cause us to *transmit* incorrect bytes (as opposed to *re*transmit) or something of that nature? >>> >>> >>> >>> Just to make sure we are on the same page. >>> >>> Are you concerned about the tcp/net layer retransmitting due to it >>> detecting a issue as part of the tcp protocol, or are you concerned >>> about rbd/libceph initiating a retry like with the nfs issue? >> >> >> The former, tcp/net layer. I'm just conjecturing though. >> > > For iscsi, we normally use the sendpage path. Data digests are off by > default and some distros do not even allow you to turn them on, so our > sendpage path has got a lot of testing and we have not seen any corruptions. > Not saying it is not possible, but just saying we have not seen any. Great, that's reassuring. > > It could be due to a recent change. Ronny, tell us about the workload and I > will check iscsi. > > Oh yeah, for the tcp/net retransmission case, I had said offlist, I thought > there might be a issue with iscsi but I guess I was wrong, so I have not > seen any issues with that either. I'll drop my concerns then. Those corruptions could be a bug in ceph reconnect code or something else - regardless, that's separate from the issue at hand. Thanks, Ilya -- To unsubscribe from this list: send the line "unsubscribe ceph-devel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] mark rbd requiring stable pages
On Thu, Oct 22, 2015 at 6:07 AM, Mike Christiewrote: > On 10/21/2015 03:57 PM, Ilya Dryomov wrote: >> On Wed, Oct 21, 2015 at 10:51 PM, Ilya Dryomov wrote: >>> On Fri, Oct 16, 2015 at 1:09 PM, Ilya Dryomov wrote: Hmm... On the one hand, yes, we do compute CRCs, but that's optional, so enabling this unconditionally is probably too harsh. OTOH we are talking to the network, which means all sorts of delays, retransmission issues, etc, so I wonder how exactly "unstable" pages behave when, say, added to an skb - you can't write anything to a page until networking is fully done with it and expect it to work. It's particularly alarming that you've seen corruptions. Currently the only users of this flag are block integrity stuff and md-raid5, which makes me wonder what iscsi, nfs and others do in this area. There's an old ticket on this topic somewhere on the tracker, so I'll need to research this. Thanks for bringing this up! >>> >>> Hi Mike, >>> >>> I was hoping to grab you for a few minutes, but you weren't there... >>> >>> I spent a better part of today reading code and mailing lists on this >>> topic. It is of course a bug that we use sendpage() which inlines >>> pages into an skb and do nothing to keep those pages stable. We have >>> csums enabled by default, so setting BDI_CAP_STABLE_WRITES in the crc >>> case is an obvious fix. >>> >>> I looked at drbd and iscsi and I think iscsi could do the same - ditch >>> the fallback to sock_no_sendpage() in the datadgst_en case (and get rid >>> of iscsi_sw_tcp_conn::sendpage member while at it). Using stable pages >>> rather than having a roll-your-own implementation which doesn't close >>> the race but only narrows it sounds like a win, unless copying through >>> sendmsg() is for some reason cheaper than stable-waiting? > > Yeah, that is what I was saying on the call the other day, but the > reception was bad. We only have the sendmsg code path when digest are on > because that code came before stable pages. When stable pages were > created, it was on by default but did not cover all the cases, so we > left the code. It then handled most scenarios, but I just never got > around to removing old the code. However, it was set to off by default > so I left it and made this patch for iscsi to turn on stable pages: > > [this patch only enabled stable pages when digests/crcs are on and dif > not remove the code yet] > https://groups.google.com/forum/#!topic/open-iscsi/n4jvWK7BPYM > > I did not really like the layering so I have not posted it for inclusion. Good to know I got it right ;) > > > >>> >>> drbd still needs the non-zero-copy version for its async protocol for >>> when they free the pages before the NIC has chance to put them on the >>> wire. md-raid5 it turns out has an option to essentially disable most >>> of its stripe cache and so it sets BDI_CAP_STABLE_WRITES to compensate >>> if that option is enabled. >>> >>> What I'm worried about is the !crc (!datadgst_en) case. I'm failing to >>> convince myself that mucking with sendpage()ed pages while they sit in >>> the TCP queue (or anywhere in the networking stack, really), is safe - >>> there is nothing to prevent pages from being modified after sendpage() >>> returned and Ronny reports data corruptions that pretty much went away >>> with BDI_CAP_STABLE_WRITES set. I may be, after prolonged staring at >>> this, starting to confuse fs with block, though. How does that work in >>> iscsi land? > > This is what I was trying to ask about in the call the other day. Where > is the corruption that Ronny was seeing. Was it checksum mismatches on > data being written, or is incorrect meta data being written, etc? Well, checksum mismatches are to be expected given what we are doing now, but I wouldn't expect any data corruptions. Ronny writes that he saw frequent ext4 corruptions on krbd devices before he enabled stable pages, which leads me to believe that the !crc case, for which we won't be setting BDI_CAP_STABLE_WRITES, is going to be/remain broken. Ronny, could you describe it in more detail and maybe share some of those osd logs with bad crc messages? > > If we are just talking about if stable pages are not used, and someone > is re-writing data to a page after the page has already been submitted > to the block layer (I mean the page is on some bio which is on a request > which is on some request_queue scheduler list or basically anywhere in > the block layer), then I was saying this can occur with any block > driver. There is nothing that is preventing this from happening with a > FC driver or nvme or cciss or in dm or whatever. The app/user can > rewrite as late as when we are in the make_request_fn/request_fn. > > I think I am misunderstanding your question because I thought this is > expected behavior, and there is nothing drivers can do if the app is not > doing a
keyring issues, 9.1.0
My current situation as I upgrade to v9.1.0 is that client.admin keyring seems to work fine, for instance for ceph status command. But commands that use client.bootstrap-osd such as /usr/bin/ceph --cluster ceph --name client.bootstrap-osd --keyring /var/lib/ceph/bootstrap-osd/ceph.keyring osd create --concise a428120d-99ec-4a73-999f-75d8a6bfcb2e are getting "EACCES: access denied" with log entries in ceph.audit.log such as 2015-10-22 13:50:24.070249 mon.0 10.0.2.132:6789/0 33 : audit [INF] from='client.? 10.0.2.132:0/263577121' entity='client.bootstrap-osd' cmd=[{"prefix": "osd create", "uuid": "a428120d-99ec-4a73-999f-75d8a6bfcb2e"}]: access denied I tried setting debug auth = 0 in ceph.conf but couldn't tell anything from that output. Is there anything special I should look for here? Note: I do have /var/lib/ceph and subdirectories owned by ceph:ceph -- Tom -- To unsubscribe from this list: send the line "unsubscribe ceph-devel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
RE: newstore direction
Hi Sage and other fellow cephers, I truly share the pains with you all about filesystem while I am working on objectstore to improve the performance. As mentioned , there is nothing wrong with filesystem. Just the Ceph as one of use case need more supports but not provided in near future by filesystem no matter what reasons. There are so many techniques pop out which can help to improve performance of OSD. User space driver(DPDK from Intel) is one of them. It not only gives you the storage allocator, also gives you the thread scheduling support, CPU affinity , NUMA friendly, polling which might fundamentally change the performance of objectstore. It should not be hard to improve CPU utilization 3x~5x times, higher IOPS etc. I totally agreed that goal of filestore is to gives enough support for filesystem with either 1 ,1b, or 2 solutions. In my humble opinion , The new design goal of objectstore should focus on giving the best performance for OSD with new techniques. These two goals are not going to conflict with each other. They are just for different purposes to make Ceph not only more stable but also better. Scylla mentioned by Orit is a good example . Thanks all. Regards, James -Original Message- From: ceph-devel-ow...@vger.kernel.org [mailto:ceph-devel-ow...@vger.kernel.org] On Behalf Of Sage Weil Sent: Thursday, October 22, 2015 5:50 AM To: Ric Wheeler Cc: Orit Wasserman; ceph-devel@vger.kernel.org Subject: Re: newstore direction On Wed, 21 Oct 2015, Ric Wheeler wrote: > You will have to trust me on this as the Red Hat person who spoke to > pretty much all of our key customers about local file systems and > storage - customers all have migrated over to using normal file systems under > Oracle/DB2. > Typically, they use XFS or ext4. I don't know of any non-standard > file systems and only have seen one account running on a raw block > store in 8 years > :) > > If you have a pre-allocated file and write using O_DIRECT, your IO > path is identical in terms of IO's sent to the device. > > If we are causing additional IO's, then we really need to spend some > time talking to the local file system gurus about this in detail. I > can help with that conversation. If the file is truly preallocated (that is, prewritten with zeros... fallocate doesn't help here because the extents is marked unwritten), then sure: there is very little change in the data path. But at that point, what is the point? This only works if you have one (or a few) huge files and the user space app already has all the complexity of a filesystem-like thing (with its own internal journal, allocators, garbage collection, etc.). Do they just do this to ease administrative tasks like backup? This is the fundamental tradeoff: 1) We have a file per object. We fsync like crazy and the fact that there are two independent layers journaling and managing different types of consistency penalizes us. 1b) We get clever and start using obscure and/or custom ioctls in the file system to work around what it is used to: we swap extents to avoid write-ahead (see Christoph's patch), O_NOMTIME, unprivileged open-by-handle, batch fsync, O_ATOMIC, setext ioctl, etc. 2) We preallocate huge files and write a user-space object system that lives within it (pretending the file is a block device). The file system rarely gets in the way (assuming the file is prewritten and we don't do anything stupid). But it doesn't give us anything a block device wouldn't, and it doesn't save us any complexity in our code. At the end of the day, 1 and 1b are always going to be slower than 2. And although 1b performs a bit better than 1, it has similar (user-space) complexity to 2. On the other hand, if you step back and view teh entire stack (ceph-osd + XFS), 1 and 1b are *significantly* more complex than 2... and yet still slower. Given we ultimately have to support both (both as an upstream and as a distro), that's not very attractive. Also note that every time we have strayed off the reservation from the beaten path (1) to anything mildly exotic (1b) we have been bitten by obscure file systems bugs. And that's assume we get everything we need upstream... which is probably a year's endeavour. Don't get me wrong: I'm all for making changes to file systems to better support systems like Ceph. Things like O_NOCMTIME and O_ATOMIC make a huge amount of sense of a ton of different systems. But our situations is a bit different: we always own the entire device (and often the server), so there is no need to share with other users or apps (and when you do, you just use the existing FileStore backend). And as you know performance is a huge pain point. We are already handicapped by virtue of being distributed and strongly consistent; we can't afford to give away more to a storage layer that isn't providing us much (or the right) value. And I'm tired of half
tracker.ceph.com downtime today
tracker.ceph.com will be brought down today for upgrade and move to a new host. I plan to do this at about 4PM PST (40 minutes from now). Expect a downtime of about 15-20 minutes. More notification to follow. -- Dan Mick Red Hat, Inc. Ceph docs: http://ceph.com/docs -- To unsubscribe from this list: send the line "unsubscribe ceph-devel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: tracker.ceph.com downtime today
It's back. New DNS info is propagating its way around. If you absolutely must get to it, newtracker.ceph.com is the new address, but please don't bookmark that, as it will be going away after the transition. Please let me know of any problems you have. On 10/22/2015 04:09 PM, Dan Mick wrote: > tracker.ceph.com down now > > On 10/22/2015 03:20 PM, Dan Mick wrote: >> tracker.ceph.com will be brought down today for upgrade and move to a >> new host. I plan to do this at about 4PM PST (40 minutes from now). >> Expect a downtime of about 15-20 minutes. More notification to follow. >> > > -- Dan Mick Red Hat, Inc. Ceph docs: http://ceph.com/docs -- To unsubscribe from this list: send the line "unsubscribe ceph-devel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: newstore direction
Since the changes which moved the pg log and the pg info into the pg object space, I think it's now the case that any transaction submitted to the objectstore updates a disjoint range of objects determined by the sequencer. It might be easier to exploit that parallelism if we control allocation and allocation related metadata. We could split the store into N pieces which partition the pg space (one additional one for the meta sequencer?) with one rocksdb instance for each. Space could then be parcelled out in large pieces (small frequency of global allocation decisions) and managed more finely within each partition. The main challenge would be avoiding internal fragmentation of those, but at least defragmentation can be managed on a per-partition basis. Such parallelism is probably necessary to exploit the full throughput of some ssds. -Sam On Thu, Oct 22, 2015 at 10:42 AM, James (Fei) Liu-SSIwrote: > Hi Sage and other fellow cephers, > I truly share the pains with you all about filesystem while I am working > on objectstore to improve the performance. As mentioned , there is nothing > wrong with filesystem. Just the Ceph as one of use case need more supports > but not provided in near future by filesystem no matter what reasons. > >There are so many techniques pop out which can help to improve > performance of OSD. User space driver(DPDK from Intel) is one of them. It > not only gives you the storage allocator, also gives you the thread > scheduling support, CPU affinity , NUMA friendly, polling which might > fundamentally change the performance of objectstore. It should not be hard > to improve CPU utilization 3x~5x times, higher IOPS etc. > I totally agreed that goal of filestore is to gives enough support for > filesystem with either 1 ,1b, or 2 solutions. In my humble opinion , The new > design goal of objectstore should focus on giving the best performance for > OSD with new techniques. These two goals are not going to conflict with each > other. They are just for different purposes to make Ceph not only more > stable but also better. > > Scylla mentioned by Orit is a good example . > > Thanks all. > > Regards, > James > > -Original Message- > From: ceph-devel-ow...@vger.kernel.org > [mailto:ceph-devel-ow...@vger.kernel.org] On Behalf Of Sage Weil > Sent: Thursday, October 22, 2015 5:50 AM > To: Ric Wheeler > Cc: Orit Wasserman; ceph-devel@vger.kernel.org > Subject: Re: newstore direction > > On Wed, 21 Oct 2015, Ric Wheeler wrote: >> You will have to trust me on this as the Red Hat person who spoke to >> pretty much all of our key customers about local file systems and >> storage - customers all have migrated over to using normal file systems >> under Oracle/DB2. >> Typically, they use XFS or ext4. I don't know of any non-standard >> file systems and only have seen one account running on a raw block >> store in 8 years >> :) >> >> If you have a pre-allocated file and write using O_DIRECT, your IO >> path is identical in terms of IO's sent to the device. >> >> If we are causing additional IO's, then we really need to spend some >> time talking to the local file system gurus about this in detail. I >> can help with that conversation. > > If the file is truly preallocated (that is, prewritten with zeros... > fallocate doesn't help here because the extents is marked unwritten), then > sure: there is very little change in the data path. > > But at that point, what is the point? This only works if you have one (or a > few) huge files and the user space app already has all the complexity of a > filesystem-like thing (with its own internal journal, allocators, garbage > collection, etc.). Do they just do this to ease administrative tasks like > backup? > > > This is the fundamental tradeoff: > > 1) We have a file per object. We fsync like crazy and the fact that there > are two independent layers journaling and managing different types of > consistency penalizes us. > > 1b) We get clever and start using obscure and/or custom ioctls in the file > system to work around what it is used to: we swap extents to avoid > write-ahead (see Christoph's patch), O_NOMTIME, unprivileged open-by-handle, > batch fsync, O_ATOMIC, setext ioctl, etc. > > 2) We preallocate huge files and write a user-space object system that lives > within it (pretending the file is a block device). The file system rarely > gets in the way (assuming the file is prewritten and we don't do anything > stupid). But it doesn't give us anything a block device wouldn't, and it > doesn't save us any complexity in our code. > > At the end of the day, 1 and 1b are always going to be slower than 2. > And although 1b performs a bit better than 1, it has similar (user-space) > complexity to 2. On the other hand, if you step back and view teh entire > stack (ceph-osd + XFS), 1 and 1b are *significantly* more complex than 2... > and yet still
Re: newstore direction
Ah, except for the snapmapper. We can split the snapmapper in the same way, though, as long as we are careful with the name. -Sam On Thu, Oct 22, 2015 at 4:42 PM, Samuel Justwrote: > Since the changes which moved the pg log and the pg info into the pg > object space, I think it's now the case that any transaction submitted > to the objectstore updates a disjoint range of objects determined by > the sequencer. It might be easier to exploit that parallelism if we > control allocation and allocation related metadata. We could split > the store into N pieces which partition the pg space (one additional > one for the meta sequencer?) with one rocksdb instance for each. > Space could then be parcelled out in large pieces (small frequency of > global allocation decisions) and managed more finely within each > partition. The main challenge would be avoiding internal > fragmentation of those, but at least defragmentation can be managed on > a per-partition basis. Such parallelism is probably necessary to > exploit the full throughput of some ssds. > -Sam > > On Thu, Oct 22, 2015 at 10:42 AM, James (Fei) Liu-SSI > wrote: >> Hi Sage and other fellow cephers, >> I truly share the pains with you all about filesystem while I am working >> on objectstore to improve the performance. As mentioned , there is nothing >> wrong with filesystem. Just the Ceph as one of use case need more supports >> but not provided in near future by filesystem no matter what reasons. >> >>There are so many techniques pop out which can help to improve >> performance of OSD. User space driver(DPDK from Intel) is one of them. It >> not only gives you the storage allocator, also gives you the thread >> scheduling support, CPU affinity , NUMA friendly, polling which might >> fundamentally change the performance of objectstore. It should not be hard >> to improve CPU utilization 3x~5x times, higher IOPS etc. >> I totally agreed that goal of filestore is to gives enough support for >> filesystem with either 1 ,1b, or 2 solutions. In my humble opinion , The new >> design goal of objectstore should focus on giving the best performance for >> OSD with new techniques. These two goals are not going to conflict with each >> other. They are just for different purposes to make Ceph not only more >> stable but also better. >> >> Scylla mentioned by Orit is a good example . >> >> Thanks all. >> >> Regards, >> James >> >> -Original Message- >> From: ceph-devel-ow...@vger.kernel.org >> [mailto:ceph-devel-ow...@vger.kernel.org] On Behalf Of Sage Weil >> Sent: Thursday, October 22, 2015 5:50 AM >> To: Ric Wheeler >> Cc: Orit Wasserman; ceph-devel@vger.kernel.org >> Subject: Re: newstore direction >> >> On Wed, 21 Oct 2015, Ric Wheeler wrote: >>> You will have to trust me on this as the Red Hat person who spoke to >>> pretty much all of our key customers about local file systems and >>> storage - customers all have migrated over to using normal file systems >>> under Oracle/DB2. >>> Typically, they use XFS or ext4. I don't know of any non-standard >>> file systems and only have seen one account running on a raw block >>> store in 8 years >>> :) >>> >>> If you have a pre-allocated file and write using O_DIRECT, your IO >>> path is identical in terms of IO's sent to the device. >>> >>> If we are causing additional IO's, then we really need to spend some >>> time talking to the local file system gurus about this in detail. I >>> can help with that conversation. >> >> If the file is truly preallocated (that is, prewritten with zeros... >> fallocate doesn't help here because the extents is marked unwritten), then >> sure: there is very little change in the data path. >> >> But at that point, what is the point? This only works if you have one (or a >> few) huge files and the user space app already has all the complexity of a >> filesystem-like thing (with its own internal journal, allocators, garbage >> collection, etc.). Do they just do this to ease administrative tasks like >> backup? >> >> >> This is the fundamental tradeoff: >> >> 1) We have a file per object. We fsync like crazy and the fact that there >> are two independent layers journaling and managing different types of >> consistency penalizes us. >> >> 1b) We get clever and start using obscure and/or custom ioctls in the file >> system to work around what it is used to: we swap extents to avoid >> write-ahead (see Christoph's patch), O_NOMTIME, unprivileged open-by-handle, >> batch fsync, O_ATOMIC, setext ioctl, etc. >> >> 2) We preallocate huge files and write a user-space object system that lives >> within it (pretending the file is a block device). The file system rarely >> gets in the way (assuming the file is prewritten and we don't do anything >> stupid). But it doesn't give us anything a block device wouldn't, and it >> doesn't save us any complexity in our code. >> >> At the
Re: [PATCH] mark rbd requiring stable pages
On Thursday 22 October 2015, Ilya Dryomov wrote: > Well, checksum mismatches are to be expected given what we are doing > now, but I wouldn't expect any data corruptions. Ronny writes that he > saw frequent ext4 corruptions on krbd devices before he enabled stable > pages, which leads me to believe that the !crc case, for which we won't > be setting BDI_CAP_STABLE_WRITES, is going to be/remain broken. Ronny, > could you describe it in more detail and maybe share some of those osd > logs with bad crc messages? > This is from a 10 minute period from one of the OSDs. 23:11:02.423728 ce5dfb70 0 bad crc in data 1657725429 != exp 496797267 23:11:37.586411 ce5dfb70 0 bad crc in data 1216602498 != exp 111161 23:12:07.805675 cc3ffb70 0 bad crc in data 3140625666 != exp 2614069504 23:12:44.485713 c96ffb70 0 bad crc in data 1712148977 != exp 3239079328 23:13:24.746217 ce5dfb70 0 bad crc in data 144620426 != exp 3156694286 23:13:52.792367 ce5dfb70 0 bad crc in data 4033880920 != exp 4159672481 23:14:22.958999 c96ffb70 0 bad crc in data 847688321 != exp 1551499144 23:16:35.015629 ce5dfb70 0 bad crc in data 2790209714 != exp 3779604715 23:17:48.482049 c96ffb70 0 bad crc in data 1563466764 != exp 528198494 23:19:28.925357 cc3ffb70 0 bad crc in data 1764275395 != exp 2075504274 23:19:59.039843 cc3ffb70 0 bad crc in data 2960172683 != exp 1215950691 The filesystem corruptions are usually ones with messages of EXT4-fs error (device rbd4): ext4_mb_generate_buddy:757: group 155, block bitmap and bg descriptor inconsistent: 23625 vs 23660 free clusters These were pretty common, at least every other day, often multiple times a day. Sometimes there was a additional JBD2: Spotted dirty metadata buffer (dev = rbd4, blocknr = 0). There's a risk of filesystem corruption in case of system crash. Another type of Filesystem corruption i experienced through kernel compilations, that lead to the following messages. EXT4-fs warning (device rbd3): empty_dir:2488: bad directory (dir #282221) - no `.' or `..' EXT4-fs warning (device rbd3): empty_dir:2488: bad directory (dir #273062) - no `.' or `..' EXT4-fs warning (device rbd3): empty_dir:2488: bad directory (dir #272270) - no `.' or `..' EXT4-fs warning (device rbd3): empty_dir:2488: bad directory (dir #282254) - no `.' or `..' EXT4-fs warning (device rbd3): empty_dir:2488: bad directory (dir #273070) - no `.' or `..' EXT4-fs warning (device rbd3): empty_dir:2488: bad directory (dir #272308) - no `.' or `..' EXT4-fs error (device rbd3): ext4_lookup:1417: inode #270033: comm rm: deleted inode referenced: 270039 last message repeated 2 times EXT4-fs warning (device rbd3): empty_dir:2488: bad directory (dir #271534) - no `.' or `..' EXT4-fs warning (device rbd3): empty_dir:2488: bad directory (dir #271275) - no `.' or `..' EXT4-fs warning (device rbd3): empty_dir:2488: bad directory (dir #282290) - no `.' or `..' EXT4-fs warning (device rbd3): empty_dir:2488: bad directory (dir #281914) - no `.' or `..' EXT4-fs error (device rbd3): ext4_lookup:1417: inode #270033: comm rm: deleted inode referenced: 270039 last message repeated 2 times kernel: EXT4-fs error (device rbd3): ext4_lookup:1417: inode #273018: comm rm: deleted inode referenced: 282221 EXT4-fs error (device rbd3): ext4_lookup:1417: inode #273018: comm rm: deleted inode referenced: 282221 EXT4-fs error (device rbd3): ext4_lookup:1417: inode #273018: comm rm: deleted inode referenced: 281914 EXT4-fs error (device rbd3): ext4_lookup:1417: inode #273018: comm rm: deleted inode referenced: 281914 EXT4-fs error: 243 callbacks suppressed EXT4-fs error (device rbd3): ext4_lookup:1417: inode #282002: comm cp: deleted inode referenced: 45375 kernel: EXT4-fs error (device rbd3): ext4_lookup:1417: inode #282002: comm cp: deleted inode referenced: 45371 The result was that various files and directories in the kernel sourcedir couldn't be accessed anymore and even fsck couldn't repair it, so i had to finally delete it. But these ones were pretty rare. Another issue were the data-corruptions in the files itself, that happened independently from the filesystem-corruptions. These happened on most days, sometimes only once, sometimes multiple times a day. Newly written files that contained corrupted data seem to always have it only at one place. These corrupt data replaced the original data from the file, but never changed the file-size. The position of this corruptions in the files were always different. Interesting part is that this corrupted parts always followed the same pattern. First some few hundred 0x0 bytes, then a few kb (10-30) of random binary data, that finished again with a few hundred bytes of 0x0. In a few cases i could trace this data back to origin from another file that was read at the same time from the same programm. But that might be accidentially, because other corruptions that happened in the same scenario I couldn't trace back this way. In other cases that
Re: [PATCH] mark rbd requiring stable pages
On Thursday 22 October 2015, you wrote: > It could be due to a recent change. Ronny, tell us about the workload > and I will check iscsi. I guess the best testcase is a kernel compilation in a make clean; make -j (> 1); loop. The data-corruptions usually happen in the generated .cmd files, which breaks the build immediatelly and makes the corruption easy to spot. Beside that i have seen data corruptions in other simple circumstances. Copying data from non-rbd to rbd device, from rbd to rbd device, scp data from another machine to the rbd. Also i have mounted the rbds on the same machines im running the OSD, which might be a contributing factor. Unfortunatly there seems to be nothing that increases the likelyhood of the corruption to happen. I tried all kinds of things with no success. Another part of the corruption might have been the amount of free memory. Before i added the flag for stable patches i regularly had warnings like. Since the use of stable pages for rbd these warnings are gone too. kernel: swapper/1: page allocation failure: order:0, mode:0x20 kernel: 88012fc83b68 8143f171 kernel: 0020 88012fc83bf8 81127fda 88012fff9838 kernel: 880109bc7100 01ff88012fc83be8 8164aa40 0020 kernel: Call Trace: kernel:[] dump_stack+0x48/0x5f kernel: [] warn_alloc_failed+0xea/0x130 kernel: [] __alloc_pages_nodemask+0x69a/0x910 kernel: [] ? br_handle_frame_finish+0x500/0x500 [bridge] kernel: [] alloc_pages_current+0xa7/0x170 kernel: [] atl1c_alloc_rx_buffer+0x36c/0x430 [atl1c] kernel: [] atl1c_clean+0x212/0x3b0 [atl1c] kernel: [] net_rx_action+0x15f/0x320 kernel: [] __do_softirq+0x123/0x2e0 kernel: [] irq_exit+0x96/0xc0 kernel: [] do_IRQ+0x65/0x110 kernel: [] common_interrupt+0x72/0x72 kernel:[] ? retint_restore_args+0x13/0x13 kernel: [] ? mwait_idle+0x72/0xb0 kernel: [] ? mwait_idle+0x69/0xb0 kernel: [] arch_cpu_idle+0xf/0x20 kernel: [] cpu_startup_entry+0x22b/0x3e0 kernel: [] start_secondary+0x156/0x180 kernel: Mem-Info: kernel: Node 0 DMA per-cpu: kernel: CPU0: hi:0, btch: 1 usd: 0 kernel: CPU1: hi:0, btch: 1 usd: 0 kernel: CPU2: hi:0, btch: 1 usd: 0 kernel: CPU3: hi:0, btch: 1 usd: 0 kernel: Node 0 DMA32 per-cpu: kernel: CPU0: hi: 186, btch: 31 usd: 182 kernel: CPU1: hi: 186, btch: 31 usd: 179 kernel: CPU2: hi: 186, btch: 31 usd: 156 kernel: CPU3: hi: 186, btch: 31 usd: 170 kernel: Node 0 Normal per-cpu: kernel: CPU0: hi: 186, btch: 31 usd: 138 kernel: CPU1: hi: 186, btch: 31 usd: 130 kernel: CPU2: hi: 186, btch: 31 usd: 73 kernel: CPU3: hi: 186, btch: 31 usd: 122 kernel: active_anon:499711 inactive_anon:128139 isolated_anon:0 kernel: active_file:132181 inactive_file:145093 isolated_file:22 kernel: unevictable:4083 dirty:1526 writeback:15597 unstable:0 kernel: free:5225 slab_reclaimable:23735 slab_unreclaimable:29775 kernel: mapped:11742 shmem:18846 pagetables:3946 bounce:0 kernel: free_cma:0 kernel: Node 0 DMA free:15284kB min:32kB low:40kB high:48kB active_anon:0kB inactive_anon:96kB active_file:232kB inactive_file:80kB unevictable:0kB isolated(anon):0kB isolated(file):0kB present:15992kB managed:15908kB mlocked:0kB dirty:0kB writeback:0kB mapped:12kB shmem:0kB slab_reclaimable:52kB slab_unreclaimable:80kB kernel_stack:16kB pagetables:0kB unstable:0kB bounce:0kB free_cma:0kB writeback_tmp:0kB pages_scanned:88 all_unreclaimable? no kernel: lowmem_reserve[]: 0 3107 3818 3818 kernel: Node 0 DMA32 free:5064kB min:6420kB low:8024kB high:9628kB active_anon:1718524kB inactive_anon:365504kB active_file:418964kB inactive_file:469748kB unevictable:0kB isolated(anon):0kB isolated(file):88kB present:3257216kB managed:3183616kB mlocked:0kB dirty:5900kB writeback:48264kB mapped:39204kB shmem:54364kB slab_reclaimable:76256kB slab_unreclaimable:93456kB kernel_stack:6240kB pagetables:12280kB unstable:0kB bounce:0kB free_cma:0kB writeback_tmp:0kB pages_scanned:0 all_unreclaimable? no kernel: lowmem_reserve[]: 0 0 710 710 kernel: Node 0 Normal free:552kB min:1468kB low:1832kB high:2200kB active_anon:280320kB inactive_anon:146956kB active_file:109528kB inactive_file:110544kB unevictable:16332kB isolated(anon):0kB isolated(file):0kB present:786432kB managed:728012kB mlocked:0kB dirty:204kB writeback:14124kB mapped:7752kB shmem:21020kB slab_reclaimable:18632kB slab_unreclaimable:25564kB kernel_stack:2432kB pagetables:3504kB unstable:0kB bounce:0kB free_cma:0kB writeback_tmp:0kB pages_scanned:608 all_unreclaimable? no kernel: lowmem_reserve[]: 0 0 0 0 kernel: Node 0 DMA: 4*4kB (UE) 4*8kB (UEM) 2*16kB (UE) 5*32kB (UEM) 3*64kB (UM) 2*128kB (UE) 1*256kB (E) 2*512kB (EM) 3*1024kB (UEM) 3*2048kB (UEM) 1*4096kB (R) = 15280kB kernel: Node 0 DMA32: 0*4kB 1*8kB (R) 0*16kB 0*32kB 1*64kB (R) 1*128kB (R) 1*256kB (R) 3*512kB (R) 1*1024kB (R) 1*2048kB (R) 0*4096kB = 5064kB kernel:
Re: tracker.ceph.com downtime today
Fixed a configuration problem preventing updating issues, and switched the mailer to use ipv4; if you updated and failed, or missed an email notification, that may have been why. On 10/22/2015 04:51 PM, Dan Mick wrote: > It's back. New DNS info is propagating its way around. If you > absolutely must get to it, newtracker.ceph.com is the new address, but > please don't bookmark that, as it will be going away after the transition. > > Please let me know of any problems you have. -- To unsubscribe from this list: send the line "unsubscribe ceph-devel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
RE: newstore direction
How would this kind of split affect small transactions? Will each split be separately transactionally consistent or is there some kind of meta-transaction that synchronizes each of the splits? Allen Samuels Software Architect, Fellow, Systems and Software Solutions 2880 Junction Avenue, San Jose, CA 95134 T: +1 408 801 7030| M: +1 408 780 6416 allen.samu...@sandisk.com -Original Message- From: ceph-devel-ow...@vger.kernel.org [mailto:ceph-devel-ow...@vger.kernel.org] On Behalf Of Samuel Just Sent: Friday, October 23, 2015 8:42 AM To: James (Fei) Liu-SSICc: Sage Weil ; Ric Wheeler ; Orit Wasserman ; ceph-devel@vger.kernel.org Subject: Re: newstore direction Since the changes which moved the pg log and the pg info into the pg object space, I think it's now the case that any transaction submitted to the objectstore updates a disjoint range of objects determined by the sequencer. It might be easier to exploit that parallelism if we control allocation and allocation related metadata. We could split the store into N pieces which partition the pg space (one additional one for the meta sequencer?) with one rocksdb instance for each. Space could then be parcelled out in large pieces (small frequency of global allocation decisions) and managed more finely within each partition. The main challenge would be avoiding internal fragmentation of those, but at least defragmentation can be managed on a per-partition basis. Such parallelism is probably necessary to exploit the full throughput of some ssds. -Sam On Thu, Oct 22, 2015 at 10:42 AM, James (Fei) Liu-SSI wrote: > Hi Sage and other fellow cephers, > I truly share the pains with you all about filesystem while I am working > on objectstore to improve the performance. As mentioned , there is nothing > wrong with filesystem. Just the Ceph as one of use case need more supports > but not provided in near future by filesystem no matter what reasons. > >There are so many techniques pop out which can help to improve > performance of OSD. User space driver(DPDK from Intel) is one of them. It > not only gives you the storage allocator, also gives you the thread > scheduling support, CPU affinity , NUMA friendly, polling which might > fundamentally change the performance of objectstore. It should not be hard > to improve CPU utilization 3x~5x times, higher IOPS etc. > I totally agreed that goal of filestore is to gives enough support for > filesystem with either 1 ,1b, or 2 solutions. In my humble opinion , The new > design goal of objectstore should focus on giving the best performance for > OSD with new techniques. These two goals are not going to conflict with each > other. They are just for different purposes to make Ceph not only more > stable but also better. > > Scylla mentioned by Orit is a good example . > > Thanks all. > > Regards, > James > > -Original Message- > From: ceph-devel-ow...@vger.kernel.org > [mailto:ceph-devel-ow...@vger.kernel.org] On Behalf Of Sage Weil > Sent: Thursday, October 22, 2015 5:50 AM > To: Ric Wheeler > Cc: Orit Wasserman; ceph-devel@vger.kernel.org > Subject: Re: newstore direction > > On Wed, 21 Oct 2015, Ric Wheeler wrote: >> You will have to trust me on this as the Red Hat person who spoke to >> pretty much all of our key customers about local file systems and >> storage - customers all have migrated over to using normal file systems >> under Oracle/DB2. >> Typically, they use XFS or ext4. I don't know of any non-standard >> file systems and only have seen one account running on a raw block >> store in 8 years >> :) >> >> If you have a pre-allocated file and write using O_DIRECT, your IO >> path is identical in terms of IO's sent to the device. >> >> If we are causing additional IO's, then we really need to spend some >> time talking to the local file system gurus about this in detail. I >> can help with that conversation. > > If the file is truly preallocated (that is, prewritten with zeros... > fallocate doesn't help here because the extents is marked unwritten), > then > sure: there is very little change in the data path. > > But at that point, what is the point? This only works if you have one (or a > few) huge files and the user space app already has all the complexity of a > filesystem-like thing (with its own internal journal, allocators, garbage > collection, etc.). Do they just do this to ease administrative tasks like > backup? > > > This is the fundamental tradeoff: > > 1) We have a file per object. We fsync like crazy and the fact that there > are two independent layers journaling and managing different types of > consistency penalizes us. > > 1b) We get clever and start using obscure and/or custom ioctls in the file > system to work around what it is used to: we swap extents to avoid > write-ahead (see
Re: tracker.ceph.com downtime today
tracker.ceph.com down now On 10/22/2015 03:20 PM, Dan Mick wrote: > tracker.ceph.com will be brought down today for upgrade and move to a > new host. I plan to do this at about 4PM PST (40 minutes from now). > Expect a downtime of about 15-20 minutes. More notification to follow. > -- Dan Mick Red Hat, Inc. Ceph docs: http://ceph.com/docs -- To unsubscribe from this list: send the line "unsubscribe ceph-devel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: tracker.ceph.com downtime today
I tried to open a new issue and got this error: Internal error An error occurred on the page you were trying to access. If you continue to experience problems please contact your Redmine administrator for assistance. If you are the Redmine administrator, check your log files for details about the error. On Thu, Oct 22, 2015 at 6:15 PM, Dan Mickwrote: > Fixed a configuration problem preventing updating issues, and switched > the mailer to use ipv4; if you updated and failed, or missed an email > notification, that may have been why. > > On 10/22/2015 04:51 PM, Dan Mick wrote: >> It's back. New DNS info is propagating its way around. If you >> absolutely must get to it, newtracker.ceph.com is the new address, but >> please don't bookmark that, as it will be going away after the transition. >> >> Please let me know of any problems you have. > > --- > Note: This list is intended for discussions relating to Red Hat Storage > products, customers and/or support. Discussions on GlusterFS and Ceph > architecture, design and engineering should go to relevant upstream mailing > lists. -- Kyle Bader - Red Hat Senior Solution Architect Ceph Storage Architectures -- To unsubscribe from this list: send the line "unsubscribe ceph-devel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
when an osd is started up, IO will be blocked
Hi all, When an osd is started, relative IO will be blocked. According to the test result,the larger iops the clients send , the longer it will take to elapse. Adjustment on all the parameters associate with recovery operations was also found useless. How to reduce the impact of this process on the IO ? Thanks and Regards, WangSongbo -- To unsubscribe from this list: send the line "unsubscribe ceph-devel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: tracker.ceph.com downtime today
Found that issue; reverted the database to the non-backlog-plugin state, created a test bug. Retry? On 10/22/2015 06:54 PM, Dan Mick wrote: > I see that too. I suspect this is because of leftover database columns > from the backlogs plugin, which is removed. Looking into it. > > On 10/22/2015 06:43 PM, Kyle Bader wrote: >> I tried to open a new issue and got this error: >> >> Internal error >> >> An error occurred on the page you were trying to access. >> If you continue to experience problems please contact your Redmine >> administrator for assistance. >> >> If you are the Redmine administrator, check your log files for details >> about the error. >> >> >> On Thu, Oct 22, 2015 at 6:15 PM, Dan Mickwrote: >>> Fixed a configuration problem preventing updating issues, and switched >>> the mailer to use ipv4; if you updated and failed, or missed an email >>> notification, that may have been why. >>> >>> On 10/22/2015 04:51 PM, Dan Mick wrote: It's back. New DNS info is propagating its way around. If you absolutely must get to it, newtracker.ceph.com is the new address, but please don't bookmark that, as it will be going away after the transition. Please let me know of any problems you have. >>> >>> --- >>> Note: This list is intended for discussions relating to Red Hat Storage >>> products, customers and/or support. Discussions on GlusterFS and Ceph >>> architecture, design and engineering should go to relevant upstream mailing >>> lists. >> >> >> > -- To unsubscribe from this list: send the line "unsubscribe ceph-devel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: tracker.ceph.com downtime today
I see that too. I suspect this is because of leftover database columns from the backlogs plugin, which is removed. Looking into it. On 10/22/2015 06:43 PM, Kyle Bader wrote: > I tried to open a new issue and got this error: > > Internal error > > An error occurred on the page you were trying to access. > If you continue to experience problems please contact your Redmine > administrator for assistance. > > If you are the Redmine administrator, check your log files for details > about the error. > > > On Thu, Oct 22, 2015 at 6:15 PM, Dan Mickwrote: >> Fixed a configuration problem preventing updating issues, and switched >> the mailer to use ipv4; if you updated and failed, or missed an email >> notification, that may have been why. >> >> On 10/22/2015 04:51 PM, Dan Mick wrote: >>> It's back. New DNS info is propagating its way around. If you >>> absolutely must get to it, newtracker.ceph.com is the new address, but >>> please don't bookmark that, as it will be going away after the transition. >>> >>> Please let me know of any problems you have. >> >> --- >> Note: This list is intended for discussions relating to Red Hat Storage >> products, customers and/or support. Discussions on GlusterFS and Ceph >> architecture, design and engineering should go to relevant upstream mailing >> lists. > > > -- To unsubscribe from this list: send the line "unsubscribe ceph-devel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: newstore direction
I disagree with your point still - your argument was that customers don't like to update their code so we cannot rely on them moving to better file system code. Those same customers would be *just* as reluctant to upgrade OSD code. Been there, done that in pure block storage, pure object storage and in file system code (customers just don't care about the protocol, the conservative nature is consistent). Not a casual observation, I have been building storage systems since the mid-80's. Regards, Ric On 10/21/2015 09:22 PM, Allen Samuels wrote: I agree. My only point was that you still have to factor this time into the argument that by continuing to put NewStore on top of a file system you'll get to a stable system much sooner than the longer development path of doing your own raw storage allocator. IMO, once you factor that into the equation the "on top of an FS" path doesn't look like such a clear winner. Allen Samuels Software Architect, Fellow, Systems and Software Solutions 2880 Junction Avenue, San Jose, CA 95134 T: +1 408 801 7030| M: +1 408 780 6416 allen.samu...@sandisk.com -Original Message- From: Ric Wheeler [mailto:rwhee...@redhat.com] Sent: Thursday, October 22, 2015 10:17 AM To: Allen Samuels; Sage Weil ; ceph-devel@vger.kernel.org Subject: Re: newstore direction On 10/21/2015 08:53 PM, Allen Samuels wrote: Fixing the bug doesn't take a long time. Getting it deployed is where the delay is. Many companies standardize on a particular release of a particular distro. Getting them to switch to a new release -- even a "bug fix" point release -- is a major undertaking that often is a complete roadblock. Just my experience. YMMV. Customers do control the pace that they upgrade their machines, but we put out fixes on a very regular pace. A lot of customers will get fixes without having to qualify a full new release (i.e., fixes come out between major and minor releases are easy). If someone is deploying a critical server for storage, then it falls back on the storage software team to help guide them and encourage them to update when needed (and no promises of success, but people move if the win is big. If it is not, they can wait). ric PLEASE NOTE: The information contained in this electronic mail message is intended only for the use of the designated recipient(s) named above. If the reader of this message is not the intended recipient, you are hereby notified that you have received this message in error and that any review, dissemination, distribution, or copying of this message is strictly prohibited. If you have received this communication in error, please notify the sender by telephone or e-mail (as shown above) immediately and destroy any and all copies of this message in your possession (whether hard copies or electronically stored copies). -- To unsubscribe from this list: send the line "unsubscribe ceph-devel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: newstore direction
On 10/22/2015 08:50 AM, Sage Weil wrote: On Wed, 21 Oct 2015, Ric Wheeler wrote: You will have to trust me on this as the Red Hat person who spoke to pretty much all of our key customers about local file systems and storage - customers all have migrated over to using normal file systems under Oracle/DB2. Typically, they use XFS or ext4. I don't know of any non-standard file systems and only have seen one account running on a raw block store in 8 years :) If you have a pre-allocated file and write using O_DIRECT, your IO path is identical in terms of IO's sent to the device. If we are causing additional IO's, then we really need to spend some time talking to the local file system gurus about this in detail. I can help with that conversation. If the file is truly preallocated (that is, prewritten with zeros... fallocate doesn't help here because the extents is marked unwritten), then sure: there is very little change in the data path. But at that point, what is the point? This only works if you have one (or a few) huge files and the user space app already has all the complexity of a filesystem-like thing (with its own internal journal, allocators, garbage collection, etc.). Do they just do this to ease administrative tasks like backup? I think that the key here is that if we fsync() like crazy - regardless of writing to a file system or to some new, yet to be define block device primitive store - we are limited to the IOP's of that particular block device. Ignoring exotic hardware configs for people who can ignore all SSD devices, we will have rotating, high capacity, slow spinning drives for *a long time* as the eventual tier. Given that assumption, we need to do better then to be limited to synchronous IOP's for a slow drive. When we have commodity pricing for things like persistent DRAM, then I agree that writing directly to that medium makes sense (but you can do that with DAX by effectively mapping that into the process address space). Specifically, moving from a file system with some inefficiencies will only boost performance from say 20-30 IOP's to roughly 40-50 IOP's. The way this has been handled traditionally for things like databases, etc is: * batch up the transactions that need to be destaged * issue an O_DIRECT async IO for all of the elements that need to be written (bypassed the page cache, direct to the backing store) * wait for completion We should probably add to that sequence an fsync() of the directory (or a file in the file system) to insure that any volatile write cache is invalidated, but there is *no* reason to fsync() each file. I think that we need to look at why the write pattern is so heavily synchronous and single threaded if we are hoping to extract from any given storage tier its maximum performance. Doing this can raise your file creations per second (or allocations per second) from a few dozen to a few hundred or more per second. The complexity that writing a new block level allocation strategy that you save is: * if you lay out a lot of small objects on the block store that can grow, we will quickly end up doing very complicated techniques that file systems solved a long time ago (pre-allocation, etc) * multi-stream aware allocation if you have multiple processes writing to the same store * tracking things like allocated but unwritten (can happen if some process "pokes" a hole in an object, common with things like virtual machine images) One we end up handling all of that in new, untested code, I think that we end up with a lot of pain and only minimal gain in terms of performance. ric This is the fundamental tradeoff: 1) We have a file per object. We fsync like crazy and the fact that there are two independent layers journaling and managing different types of consistency penalizes us. 1b) We get clever and start using obscure and/or custom ioctls in the file system to work around what it is used to: we swap extents to avoid write-ahead (see Christoph's patch), O_NOMTIME, unprivileged open-by-handle, batch fsync, O_ATOMIC, setext ioctl, etc. 2) We preallocate huge files and write a user-space object system that lives within it (pretending the file is a block device). The file system rarely gets in the way (assuming the file is prewritten and we don't do anything stupid). But it doesn't give us anything a block device wouldn't, and it doesn't save us any complexity in our code. At the end of the day, 1 and 1b are always going to be slower than 2. And although 1b performs a bit better than 1, it has similar (user-space) complexity to 2. On the other hand, if you step back and view teh entire stack (ceph-osd + XFS), 1 and 1b are *significantly* more complex than 2... and yet still slower. Given we ultimately have to support both (both as an upstream and as a distro), that's not very attractive. Also note that every time we have strayed off the reservation from the beaten path (1) to anything mildly exotic
Re: newstore direction
Milosz Tanski adfin.com> writes: > > On Tue, Oct 20, 2015 at 4:00 PM, Sage Weil redhat.com> wrote: > > On Tue, 20 Oct 2015, John Spray wrote: > >> On Mon, Oct 19, 2015 at 8:49 PM, Sage Weil redhat.com> wrote: > >> > - We have to size the kv backend storage (probably still an XFS > >> > partition) vs the block storage. Maybe we do this anyway (put metadata on > >> > SSD!) so it won't matter. But what happens when we are storing gobs of > >> > rgw index data or cephfs metadata? Suddenly we are pulling storage out of > >> > a different pool and those aren't currently fungible. > >> > >> This is the concerning bit for me -- the other parts one "just" has to > >> get the code right, but this problem could linger and be something we > >> have to keep explaining to users indefinitely. It reminds me of cases > >> in other systems where users had to make an educated guess about inode > >> size up front, depending on whether you're expecting to efficiently > >> store a lot of xattrs. > >> > >> In practice it's rare for users to make these kinds of decisions well > >> up-front: it really needs to be adjustable later, ideally > >> automatically. That could be pretty straightforward if the KV part > >> was stored directly on block storage, instead of having XFS in the > >> mix. I'm not quite up with the state of the art in this area: are > >> there any reasonable alternatives for the KV part that would consume > >> some defined range of a block device from userspace, instead of > >> sitting on top of a filesystem? > > > > I agree: this is my primary concern with the raw block approach. > > > > There are some KV alternatives that could consume block, but the problem > > would be similar: we need to dynamically size up or down the kv portion of > > the device. > > > > I see two basic options: > > > > 1) Wire into the Env abstraction in rocksdb to provide something just > > smart enough to let rocksdb work. It isn't much: named files (not that > > many--we could easily keep the file table in ram), always written > > sequentially, to be read later with random access. All of the code is > > written around abstractions of SequentialFileWriter so that everything > > posix is neatly hidden in env_posix (and there are various other env > > implementations for in-memory mock tests etc.). > > > > 2) Use something like dm-thin to sit between the raw block device and XFS > > (for rocksdb) and the block device consumed by newstore. As long as XFS > > doesn't fragment horrifically (it shouldn't, given we *always* write ~4mb > > files in their entirety) we can fstrim and size down the fs portion. If > > we similarly make newstores allocator stick to large blocks only we would > > be able to size down the block portion as well. Typical dm-thin block > > sizes seem to range from 64KB to 512KB, which seems reasonable enough to > > me. In fact, we could likely just size the fs volume at something > > conservatively large (like 90%) and rely on -o discard or periodic fstrim > > to keep its actual utilization in check. > > > > I think you could prototype a raw block device OSD store using LMDB as > a starting point. I know there's been some experiments using LMDB as > KV store before with positive read numbers and not great write > numbers. > > 1. It mmaps, just mmap the raw disk device / partition. I've done this > as an experiment before, I can dig up a patch for LMDB. > 2. It already has a free space management strategy. I'm prob it's not > right for the OSDs in the long term but there's something to start > there with. > 3. It's already supports transactions / COW. > 4. LMDB isn't a huge code base so it might be a good place to start / > evolve code from. > 5. You're not starting a multi-year effort at the 0 point. > > As to the not great write performance, that could be addressed by > write transaction merging (what mysql implemented a few years ago). We have a heavily hacked version of LMDB contributed by VMware that implements a WAL. In my preliminary testing it performs synchronous writes 30x faster (on average) than current LMDB. Their version unfortunately slashed'n'burned a lot of LMDB features that other folks actually need, so we can't use it as-is. Currently working on rationalizing the approach and merging it into mdb.master. The reasons for the WAL approach: 1) obviously sequential writes are cheaper than random writes. 2) fsync() of a small log file will always be faster than fsync() of a large DB. I.e., fsync() latency is proportional to the total number of pages in the file, not just the number of dirty pages. LMDB on a raw block device is a simpler proposition, and one we intend to integrate soon as well. (Milosz, did you ever submit your changes?) > Here you have an opportunity to do it two days. One, you can do it in > the application layer while waiting for the fsync from transaction to > complete. This is probably the easier route. Two, you can do it in the > DB layer (the LMDB
Re: MDS stuck in a crash loop
On Wed, Oct 21, 2015 at 5:33 PM, John Spraywrote: > On Wed, Oct 21, 2015 at 10:33 PM, John Spray wrote: >>> John, I know you've got >>> https://github.com/ceph/ceph-qa-suite/pull/647. I think that's >>> supposed to be for this, but I'm not sure if you spotted any issues >>> with it or if we need to do some more diagnosing? >> >> That test path is just verifying that we do handle dirs without dying >> in at least one case -- it passes with the existing ceph code, so it's >> not reproducing this issue. > > Clicked send to soon, I was about to add... > > Milosz mentioned that they don't have the data from the system in the > broken state, so I don't have any bright ideas about learning more > about what went wrong here unfortunately. > Sorry about that, wasn't thinking at the time and just wanted to get this up and going as quickly as possible :( If this happens next time I'll be more careful to keep more evidence. I think multi-fs in the same rados namespace support would actually helped here, since it makes it easier to create a newfs and leave the other one around (for investigation) But makes me wonder that the broken dir scenario can probably be replicated by hand using rados calls. There's a pretty generic ticket there for don't die on dir errors, but I imagine the code can be audited and steps to cause a synthetic error can be produced. -- Milosz Tanski CTO 16 East 34th Street, 15th floor New York, NY 10016 p: 646-253-9055 e: mil...@adfin.com -- To unsubscribe from this list: send the line "unsubscribe ceph-devel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: newstore direction
On Wed, 21 Oct 2015, Ric Wheeler wrote: > You will have to trust me on this as the Red Hat person who spoke to pretty > much all of our key customers about local file systems and storage - customers > all have migrated over to using normal file systems under Oracle/DB2. > Typically, they use XFS or ext4. I don't know of any non-standard file > systems and only have seen one account running on a raw block store in 8 years > :) > > If you have a pre-allocated file and write using O_DIRECT, your IO path is > identical in terms of IO's sent to the device. > > If we are causing additional IO's, then we really need to spend some time > talking to the local file system gurus about this in detail. I can help with > that conversation. If the file is truly preallocated (that is, prewritten with zeros... fallocate doesn't help here because the extents is marked unwritten), then sure: there is very little change in the data path. But at that point, what is the point? This only works if you have one (or a few) huge files and the user space app already has all the complexity of a filesystem-like thing (with its own internal journal, allocators, garbage collection, etc.). Do they just do this to ease administrative tasks like backup? This is the fundamental tradeoff: 1) We have a file per object. We fsync like crazy and the fact that there are two independent layers journaling and managing different types of consistency penalizes us. 1b) We get clever and start using obscure and/or custom ioctls in the file system to work around what it is used to: we swap extents to avoid write-ahead (see Christoph's patch), O_NOMTIME, unprivileged open-by-handle, batch fsync, O_ATOMIC, setext ioctl, etc. 2) We preallocate huge files and write a user-space object system that lives within it (pretending the file is a block device). The file system rarely gets in the way (assuming the file is prewritten and we don't do anything stupid). But it doesn't give us anything a block device wouldn't, and it doesn't save us any complexity in our code. At the end of the day, 1 and 1b are always going to be slower than 2. And although 1b performs a bit better than 1, it has similar (user-space) complexity to 2. On the other hand, if you step back and view teh entire stack (ceph-osd + XFS), 1 and 1b are *significantly* more complex than 2... and yet still slower. Given we ultimately have to support both (both as an upstream and as a distro), that's not very attractive. Also note that every time we have strayed off the reservation from the beaten path (1) to anything mildly exotic (1b) we have been bitten by obscure file systems bugs. And that's assume we get everything we need upstream... which is probably a year's endeavour. Don't get me wrong: I'm all for making changes to file systems to better support systems like Ceph. Things like O_NOCMTIME and O_ATOMIC make a huge amount of sense of a ton of different systems. But our situations is a bit different: we always own the entire device (and often the server), so there is no need to share with other users or apps (and when you do, you just use the existing FileStore backend). And as you know performance is a huge pain point. We are already handicapped by virtue of being distributed and strongly consistent; we can't afford to give away more to a storage layer that isn't providing us much (or the right) value. And I'm tired of half measures. I want the OSD to be as fast as we can make it given the architectural constraints (RADOS consistency and ordering semantics). This is truly low-hanging fruit: it's modular, self-contained, pluggable, and this will be my third time around this particular block. sage -- To unsubscribe from this list: send the line "unsubscribe ceph-devel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: MDS stuck in a crash loop
On Thu, 22 Oct 2015, John Spray wrote: > On Thu, Oct 22, 2015 at 1:43 PM, Milosz Tanskiwrote: > > On Wed, Oct 21, 2015 at 5:33 PM, John Spray wrote: > >> On Wed, Oct 21, 2015 at 10:33 PM, John Spray wrote: > John, I know you've got > https://github.com/ceph/ceph-qa-suite/pull/647. I think that's > supposed to be for this, but I'm not sure if you spotted any issues > with it or if we need to do some more diagnosing? > >>> > >>> That test path is just verifying that we do handle dirs without dying > >>> in at least one case -- it passes with the existing ceph code, so it's > >>> not reproducing this issue. > >> > >> Clicked send to soon, I was about to add... > >> > >> Milosz mentioned that they don't have the data from the system in the > >> broken state, so I don't have any bright ideas about learning more > >> about what went wrong here unfortunately. > >> > > > > Sorry about that, wasn't thinking at the time and just wanted to get > > this up and going as quickly as possible :( > > > > If this happens next time I'll be more careful to keep more evidence. > > I think multi-fs in the same rados namespace support would actually > > helped here, since it makes it easier to create a newfs and leave the > > other one around (for investigation) > > Yep, good point. I am a known enthusiast for multi-filesystem support :-) A rados pool export on the metadata pool would have helped, too. That doesn't include data object backtrace metadata, though. I wonder if we should make a cephfs metadata imager tool to capture the metadata state of the file system (similar to the tools that are available for xfs) that captures both. On the data pool side it'd just record the object names, xattrs, and object size, ignoring the data. It wouldn't anonymize filenames (that is tricky without breaking the mds dir hashing), but it excludes data and would probably be sufficient for most users... sage -- To unsubscribe from this list: send the line "unsubscribe ceph-devel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: newstore direction
On Tue, Oct 20, 2015 at 4:00 PM, Sage Weilwrote: > On Tue, 20 Oct 2015, John Spray wrote: >> On Mon, Oct 19, 2015 at 8:49 PM, Sage Weil wrote: >> > - We have to size the kv backend storage (probably still an XFS >> > partition) vs the block storage. Maybe we do this anyway (put metadata on >> > SSD!) so it won't matter. But what happens when we are storing gobs of >> > rgw index data or cephfs metadata? Suddenly we are pulling storage out of >> > a different pool and those aren't currently fungible. >> >> This is the concerning bit for me -- the other parts one "just" has to >> get the code right, but this problem could linger and be something we >> have to keep explaining to users indefinitely. It reminds me of cases >> in other systems where users had to make an educated guess about inode >> size up front, depending on whether you're expecting to efficiently >> store a lot of xattrs. >> >> In practice it's rare for users to make these kinds of decisions well >> up-front: it really needs to be adjustable later, ideally >> automatically. That could be pretty straightforward if the KV part >> was stored directly on block storage, instead of having XFS in the >> mix. I'm not quite up with the state of the art in this area: are >> there any reasonable alternatives for the KV part that would consume >> some defined range of a block device from userspace, instead of >> sitting on top of a filesystem? > > I agree: this is my primary concern with the raw block approach. > > There are some KV alternatives that could consume block, but the problem > would be similar: we need to dynamically size up or down the kv portion of > the device. > > I see two basic options: > > 1) Wire into the Env abstraction in rocksdb to provide something just > smart enough to let rocksdb work. It isn't much: named files (not that > many--we could easily keep the file table in ram), always written > sequentially, to be read later with random access. All of the code is > written around abstractions of SequentialFileWriter so that everything > posix is neatly hidden in env_posix (and there are various other env > implementations for in-memory mock tests etc.). > > 2) Use something like dm-thin to sit between the raw block device and XFS > (for rocksdb) and the block device consumed by newstore. As long as XFS > doesn't fragment horrifically (it shouldn't, given we *always* write ~4mb > files in their entirety) we can fstrim and size down the fs portion. If > we similarly make newstores allocator stick to large blocks only we would > be able to size down the block portion as well. Typical dm-thin block > sizes seem to range from 64KB to 512KB, which seems reasonable enough to > me. In fact, we could likely just size the fs volume at something > conservatively large (like 90%) and rely on -o discard or periodic fstrim > to keep its actual utilization in check. > I think you could prototype a raw block device OSD store using LMDB as a starting point. I know there's been some experiments using LMDB as KV store before with positive read numbers and not great write numbers. 1. It mmaps, just mmap the raw disk device / partition. I've done this as an experiment before, I can dig up a patch for LMDB. 2. It already has a free space management strategy. I'm prob it's not right for the OSDs in the long term but there's something to start there with. 3. It's already supports transactions / COW. 4. LMDB isn't a huge code base so it might be a good place to start / evolve code from. 5. You're not starting a multi-year effort at the 0 point. As to the not great write performance, that could be addressed by write transaction merging (what mysql implemented a few years ago). Here you have an opportunity to do it two days. One, you can do it in the application layer while waiting for the fsync from transaction to complete. This is probably the easier route. Two, you can do it in the DB layer (the LMDB transaction handling / locking) where you're already started processing the following transactions using the currently committing transaction (COW) as a starting point. This is harder mostly because of the synchronization needed or involved. I've actually spend some time thinking about doing LMDB write transaction merging outside the OSD context. This was for another project. My 2 cents. -- Milosz Tanski CTO 16 East 34th Street, 15th floor New York, NY 10016 p: 646-253-9055 e: mil...@adfin.com -- To unsubscribe from this list: send the line "unsubscribe ceph-devel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: MDS stuck in a crash loop
On Thu, Oct 22, 2015 at 1:43 PM, Milosz Tanskiwrote: > On Wed, Oct 21, 2015 at 5:33 PM, John Spray wrote: >> On Wed, Oct 21, 2015 at 10:33 PM, John Spray wrote: John, I know you've got https://github.com/ceph/ceph-qa-suite/pull/647. I think that's supposed to be for this, but I'm not sure if you spotted any issues with it or if we need to do some more diagnosing? >>> >>> That test path is just verifying that we do handle dirs without dying >>> in at least one case -- it passes with the existing ceph code, so it's >>> not reproducing this issue. >> >> Clicked send to soon, I was about to add... >> >> Milosz mentioned that they don't have the data from the system in the >> broken state, so I don't have any bright ideas about learning more >> about what went wrong here unfortunately. >> > > Sorry about that, wasn't thinking at the time and just wanted to get > this up and going as quickly as possible :( > > If this happens next time I'll be more careful to keep more evidence. > I think multi-fs in the same rados namespace support would actually > helped here, since it makes it easier to create a newfs and leave the > other one around (for investigation) Yep, good point. I am a known enthusiast for multi-filesystem support :-) > But makes me wonder that the broken dir scenario can probably be > replicated by hand using rados calls. There's a pretty generic ticket > there for don't die on dir errors, but I imagine the code can be > audited and steps to cause a synthetic error can be produced. Yes, that part I have done (and will build into the automated tests in due course) -- the bit that is still a mystery is how the damage occurred to begin with. John > > -- > Milosz Tanski > CTO > 16 East 34th Street, 15th floor > New York, NY 10016 > > p: 646-253-9055 > e: mil...@adfin.com -- To unsubscribe from this list: send the line "unsubscribe ceph-devel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH 0/3] Minor cleanup for locks API
NFS has recently been moving things around to cope with the situation where a struct file may not be available during an unlock. That work has presented an opportunity to do a minor cleanup on the locks API. Users of posix_lock_file_wait() (for FL_POSIX style locks) and flock_lock_file_wait() (for FL_FLOCK style locks) can instead call locks_lock_file_wait() for both lock types. Because the passed-in file_lock specifies its own type, the correct function can be selected on behalf of the user. This work allows further cleanup within NFS and lockd which will be submitted separately. Benjamin Coddington (3): locks: introduce locks_lock_inode_wait() Move locks API users to locks_lock_inode_wait() locks: cleanup posix_lock_inode_wait and flock_lock_inode_wait drivers/staging/lustre/lustre/llite/file.c |8 +- fs/9p/vfs_file.c |4 +- fs/ceph/locks.c|4 +- fs/cifs/file.c |2 +- fs/dlm/plock.c |4 +- fs/fuse/file.c |2 +- fs/gfs2/file.c |8 +++--- fs/lockd/clntproc.c| 13 +-- fs/locks.c | 31 +++ fs/nfs/file.c | 13 +-- fs/nfs/nfs4proc.c | 13 +-- fs/ocfs2/locks.c |8 +++--- include/linux/fs.h | 21 +++--- 13 files changed, 51 insertions(+), 80 deletions(-) -- To unsubscribe from this list: send the line "unsubscribe ceph-devel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH 3/3] locks: cleanup posix_lock_inode_wait and flock_lock_inode_wait
All callers use locks_lock_inode_wait() instead. Signed-off-by: Benjamin Coddington--- fs/locks.c |5 + include/linux/fs.h | 24 2 files changed, 1 insertions(+), 28 deletions(-) diff --git a/fs/locks.c b/fs/locks.c index 94d50d3..b6f3c92 100644 --- a/fs/locks.c +++ b/fs/locks.c @@ -1167,8 +1167,7 @@ EXPORT_SYMBOL(posix_lock_file); * @inode: inode of file to which lock request should be applied * @fl: The lock to be applied * - * Variant of posix_lock_file_wait that does not take a filp, and so can be - * used after the filp has already been torn down. + * Apply a POSIX style lock request to an inode. */ int posix_lock_inode_wait(struct inode *inode, struct file_lock *fl) { @@ -1187,7 +1186,6 @@ int posix_lock_inode_wait(struct inode *inode, struct file_lock *fl) } return error; } -EXPORT_SYMBOL(posix_lock_inode_wait); /** * locks_mandatory_locked - Check for an active lock @@ -1873,7 +1871,6 @@ int flock_lock_inode_wait(struct inode *inode, struct file_lock *fl) } return error; } -EXPORT_SYMBOL(flock_lock_inode_wait); /** * locks_lock_inode_wait - Apply a lock to an inode diff --git a/include/linux/fs.h b/include/linux/fs.h index 2e283b7..05b07c9 100644 --- a/include/linux/fs.h +++ b/include/linux/fs.h @@ -1053,12 +1053,10 @@ extern void locks_remove_file(struct file *); extern void locks_release_private(struct file_lock *); extern void posix_test_lock(struct file *, struct file_lock *); extern int posix_lock_file(struct file *, struct file_lock *, struct file_lock *); -extern int posix_lock_inode_wait(struct inode *, struct file_lock *); extern int posix_unblock_lock(struct file_lock *); extern int vfs_test_lock(struct file *, struct file_lock *); extern int vfs_lock_file(struct file *, unsigned int, struct file_lock *, struct file_lock *); extern int vfs_cancel_lock(struct file *filp, struct file_lock *fl); -extern int flock_lock_inode_wait(struct inode *inode, struct file_lock *fl); extern int locks_lock_inode_wait(struct inode *inode, struct file_lock *fl); extern int __break_lease(struct inode *inode, unsigned int flags, unsigned int type); extern void lease_get_mtime(struct inode *, struct timespec *time); @@ -1145,12 +1143,6 @@ static inline int posix_lock_file(struct file *filp, struct file_lock *fl, return -ENOLCK; } -static inline int posix_lock_inode_wait(struct inode *inode, - struct file_lock *fl) -{ - return -ENOLCK; -} - static inline int posix_unblock_lock(struct file_lock *waiter) { return -ENOENT; @@ -1172,12 +1164,6 @@ static inline int vfs_cancel_lock(struct file *filp, struct file_lock *fl) return 0; } -static inline int flock_lock_inode_wait(struct inode *inode, - struct file_lock *request) -{ - return -ENOLCK; -} - static inline int locks_lock_file_wait(struct file *filp, struct file_lock *fl) { return -ENOLCK; @@ -1221,16 +1207,6 @@ static inline struct inode *file_inode(const struct file *f) return f->f_inode; } -static inline int posix_lock_file_wait(struct file *filp, struct file_lock *fl) -{ - return posix_lock_inode_wait(file_inode(filp), fl); -} - -static inline int flock_lock_file_wait(struct file *filp, struct file_lock *fl) -{ - return flock_lock_inode_wait(file_inode(filp), fl); -} - static inline int locks_lock_file_wait(struct file *filp, struct file_lock *fl) { return locks_lock_inode_wait(file_inode(filp), fl); -- 1.7.1 -- To unsubscribe from this list: send the line "unsubscribe ceph-devel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: MDS stuck in a crash loop
On Thu, Oct 22, 2015 at 8:48 AM, John Spraywrote: > On Thu, Oct 22, 2015 at 1:43 PM, Milosz Tanski wrote: >> On Wed, Oct 21, 2015 at 5:33 PM, John Spray wrote: >>> On Wed, Oct 21, 2015 at 10:33 PM, John Spray wrote: > John, I know you've got > https://github.com/ceph/ceph-qa-suite/pull/647. I think that's > supposed to be for this, but I'm not sure if you spotted any issues > with it or if we need to do some more diagnosing? That test path is just verifying that we do handle dirs without dying in at least one case -- it passes with the existing ceph code, so it's not reproducing this issue. >>> >>> Clicked send to soon, I was about to add... >>> >>> Milosz mentioned that they don't have the data from the system in the >>> broken state, so I don't have any bright ideas about learning more >>> about what went wrong here unfortunately. >>> >> >> Sorry about that, wasn't thinking at the time and just wanted to get >> this up and going as quickly as possible :( >> >> If this happens next time I'll be more careful to keep more evidence. >> I think multi-fs in the same rados namespace support would actually >> helped here, since it makes it easier to create a newfs and leave the >> other one around (for investigation) > > Yep, good point. I am a known enthusiast for multi-filesystem support :-) > >> But makes me wonder that the broken dir scenario can probably be >> replicated by hand using rados calls. There's a pretty generic ticket >> there for don't die on dir errors, but I imagine the code can be >> audited and steps to cause a synthetic error can be produced. > > Yes, that part I have done (and will build into the automated tests in > due course) -- the bit that is still a mystery is how the damage > occurred to begin with. John, my money is on me somehow fumbling the recovery process. And, without the bash history falling off I'm going to assume that. -- Milosz Tanski CTO 16 East 34th Street, 15th floor New York, NY 10016 p: 646-253-9055 e: mil...@adfin.com -- To unsubscribe from this list: send the line "unsubscribe ceph-devel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] mark rbd requiring stable pages
On 10/22/2015 06:20 AM, Ilya Dryomov wrote: > >> > >> > If we are just talking about if stable pages are not used, and someone >> > is re-writing data to a page after the page has already been submitted >> > to the block layer (I mean the page is on some bio which is on a request >> > which is on some request_queue scheduler list or basically anywhere in >> > the block layer), then I was saying this can occur with any block >> > driver. There is nothing that is preventing this from happening with a >> > FC driver or nvme or cciss or in dm or whatever. The app/user can >> > rewrite as late as when we are in the make_request_fn/request_fn. >> > >> > I think I am misunderstanding your question because I thought this is >> > expected behavior, and there is nothing drivers can do if the app is not >> > doing a flush/sync between these types of write sequences. > I don't see a problem with rewriting as late as when we are in > request_fn() (or in a wq after being put there by request_fn()). Where > I thought there *might* be an issue is rewriting after sendpage(), if > sendpage() is used - perhaps some sneaky sequence similar to that > retransmit bug that would cause us to *transmit* incorrect bytes (as > opposed to *re*transmit) or something of that nature? Just to make sure we are on the same page. Are you concerned about the tcp/net layer retransmitting due to it detecting a issue as part of the tcp protocol, or are you concerned about rbd/libceph initiating a retry like with the nfs issue? -- To unsubscribe from this list: send the line "unsubscribe ceph-devel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH] net: ceph: osd_client: change osd_req_op_data() macro
This patch changes the osd_req_op_data() macro to not evaluate parameters more than once in order to follow the kernel coding style. Signed-off-by: Ioana CiorneiReviewed-by: Alex Elder --- net/ceph/osd_client.c | 10 ++ 1 file changed, 6 insertions(+), 4 deletions(-) diff --git a/net/ceph/osd_client.c b/net/ceph/osd_client.c index a362d7e..856e8f8 100644 --- a/net/ceph/osd_client.c +++ b/net/ceph/osd_client.c @@ -120,10 +120,12 @@ static void ceph_osd_data_bio_init(struct ceph_osd_data *osd_data, } #endif /* CONFIG_BLOCK */ -#define osd_req_op_data(oreq, whch, typ, fld) \ - ({ \ - BUG_ON(whch >= (oreq)->r_num_ops); \ - &(oreq)->r_ops[whch].typ.fld; \ +#define osd_req_op_data(oreq, whch, typ, fld)\ + ({\ + struct ceph_osd_request *__oreq = (oreq); \ + unsigned int __whch = (whch); \ + BUG_ON(__whch >= __oreq->r_num_ops); \ + &__oreq->r_ops[__whch].typ.fld; \ }) static struct ceph_osd_data * -- 2.1.4 -- To unsubscribe from this list: send the line "unsubscribe ceph-devel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH 2/3] Move locks API users to locks_lock_inode_wait()
Instead of having users check for FL_POSIX or FL_FLOCK to call the correct locks API function, use the check within locks_lock_inode_wait(). This allows for some later cleanup. Signed-off-by: Benjamin Coddington--- drivers/staging/lustre/lustre/llite/file.c |8 ++-- fs/9p/vfs_file.c |4 ++-- fs/ceph/locks.c|4 ++-- fs/cifs/file.c |2 +- fs/dlm/plock.c |4 ++-- fs/fuse/file.c |2 +- fs/gfs2/file.c |8 fs/lockd/clntproc.c| 13 + fs/locks.c |2 +- fs/nfs/file.c | 13 + fs/nfs/nfs4proc.c | 13 + fs/ocfs2/locks.c |8 12 files changed, 22 insertions(+), 59 deletions(-) diff --git a/drivers/staging/lustre/lustre/llite/file.c b/drivers/staging/lustre/lustre/llite/file.c index dcd0c6d..4edbf46 100644 --- a/drivers/staging/lustre/lustre/llite/file.c +++ b/drivers/staging/lustre/lustre/llite/file.c @@ -2763,13 +2763,9 @@ ll_file_flock(struct file *file, int cmd, struct file_lock *file_lock) rc = md_enqueue(sbi->ll_md_exp, , NULL, op_data, , , 0, NULL /* req */, flags); - if ((file_lock->fl_flags & FL_FLOCK) && - (rc == 0 || file_lock->fl_type == F_UNLCK)) - rc2 = flock_lock_file_wait(file, file_lock); - if ((file_lock->fl_flags & FL_POSIX) && - (rc == 0 || file_lock->fl_type == F_UNLCK) && + if ((rc == 0 || file_lock->fl_type == F_UNLCK) && !(flags & LDLM_FL_TEST_LOCK)) - rc2 = posix_lock_file_wait(file, file_lock); + rc2 = locks_lock_file_wait(file, file_lock); if (rc2 && file_lock->fl_type != F_UNLCK) { einfo.ei_mode = LCK_NL; diff --git a/fs/9p/vfs_file.c b/fs/9p/vfs_file.c index 3abc447..f23fd86 100644 --- a/fs/9p/vfs_file.c +++ b/fs/9p/vfs_file.c @@ -161,7 +161,7 @@ static int v9fs_file_do_lock(struct file *filp, int cmd, struct file_lock *fl) if ((fl->fl_flags & FL_POSIX) != FL_POSIX) BUG(); - res = posix_lock_file_wait(filp, fl); + res = locks_lock_file_wait(filp, fl); if (res < 0) goto out; @@ -231,7 +231,7 @@ out_unlock: if (res < 0 && fl->fl_type != F_UNLCK) { fl_type = fl->fl_type; fl->fl_type = F_UNLCK; - res = posix_lock_file_wait(filp, fl); + res = locks_lock_file_wait(filp, fl); fl->fl_type = fl_type; } out: diff --git a/fs/ceph/locks.c b/fs/ceph/locks.c index 6706bde..a2cb0c2 100644 --- a/fs/ceph/locks.c +++ b/fs/ceph/locks.c @@ -228,12 +228,12 @@ int ceph_flock(struct file *file, int cmd, struct file_lock *fl) err = ceph_lock_message(CEPH_LOCK_FLOCK, CEPH_MDS_OP_SETFILELOCK, file, lock_cmd, wait, fl); if (!err) { - err = flock_lock_file_wait(file, fl); + err = locks_lock_file_wait(file, fl); if (err) { ceph_lock_message(CEPH_LOCK_FLOCK, CEPH_MDS_OP_SETFILELOCK, file, CEPH_LOCK_UNLOCK, 0, fl); - dout("got %d on flock_lock_file_wait, undid lock", err); + dout("got %d on locks_lock_file_wait, undid lock", err); } } return err; diff --git a/fs/cifs/file.c b/fs/cifs/file.c index e2a6af1..6afdad7 100644 --- a/fs/cifs/file.c +++ b/fs/cifs/file.c @@ -1553,7 +1553,7 @@ cifs_setlk(struct file *file, struct file_lock *flock, __u32 type, out: if (flock->fl_flags & FL_POSIX && !rc) - rc = posix_lock_file_wait(file, flock); + rc = locks_lock_file_wait(file, flock); return rc; } diff --git a/fs/dlm/plock.c b/fs/dlm/plock.c index 5532f09..3585cc0 100644 --- a/fs/dlm/plock.c +++ b/fs/dlm/plock.c @@ -172,7 +172,7 @@ int dlm_posix_lock(dlm_lockspace_t *lockspace, u64 number, struct file *file, rv = op->info.rv; if (!rv) { - if (posix_lock_file_wait(file, fl) < 0) + if (locks_lock_file_wait(file, fl) < 0) log_error(ls, "dlm_posix_lock: vfs lock error %llx", (unsigned long long)number); } @@ -262,7 +262,7 @@ int dlm_posix_unlock(dlm_lockspace_t *lockspace, u64 number, struct file *file, /* cause the vfs unlock to return ENOENT if lock is not found */ fl->fl_flags |= FL_EXISTS; - rv = posix_lock_file_wait(file, fl); + rv = locks_lock_file_wait(file, fl); if (rv == -ENOENT) { rv = 0; goto
[PATCH 1/3] locks: introduce locks_lock_inode_wait()
Users of the locks API commonly call either posix_lock_file_wait() or flock_lock_file_wait() depending upon the lock type. Add a new function locks_lock_inode_wait() which will check and call the correct function for the type of lock passed in. Signed-off-by: Benjamin Coddington--- fs/locks.c | 24 include/linux/fs.h | 11 +++ 2 files changed, 35 insertions(+), 0 deletions(-) diff --git a/fs/locks.c b/fs/locks.c index 2a54c80..68b1784 100644 --- a/fs/locks.c +++ b/fs/locks.c @@ -1876,6 +1876,30 @@ int flock_lock_inode_wait(struct inode *inode, struct file_lock *fl) EXPORT_SYMBOL(flock_lock_inode_wait); /** + * locks_lock_inode_wait - Apply a lock to an inode + * @inode: inode of the file to apply to + * @fl: The lock to be applied + * + * Apply a POSIX or FLOCK style lock request to an inode. + */ +int locks_lock_inode_wait(struct inode *inode, struct file_lock *fl) +{ + int res = 0; + switch (fl->fl_flags & (FL_POSIX|FL_FLOCK)) { + case FL_POSIX: + res = posix_lock_inode_wait(inode, fl); + break; + case FL_FLOCK: + res = flock_lock_inode_wait(inode, fl); + break; + default: + BUG(); + } + return res; +} +EXPORT_SYMBOL(locks_lock_inode_wait); + +/** * sys_flock: - flock() system call. * @fd: the file descriptor to lock. * @cmd: the type of lock to apply. diff --git a/include/linux/fs.h b/include/linux/fs.h index 72d8a84..2e283b7 100644 --- a/include/linux/fs.h +++ b/include/linux/fs.h @@ -1059,6 +1059,7 @@ extern int vfs_test_lock(struct file *, struct file_lock *); extern int vfs_lock_file(struct file *, unsigned int, struct file_lock *, struct file_lock *); extern int vfs_cancel_lock(struct file *filp, struct file_lock *fl); extern int flock_lock_inode_wait(struct inode *inode, struct file_lock *fl); +extern int locks_lock_inode_wait(struct inode *inode, struct file_lock *fl); extern int __break_lease(struct inode *inode, unsigned int flags, unsigned int type); extern void lease_get_mtime(struct inode *, struct timespec *time); extern int generic_setlease(struct file *, long, struct file_lock **, void **priv); @@ -1177,6 +1178,11 @@ static inline int flock_lock_inode_wait(struct inode *inode, return -ENOLCK; } +static inline int locks_lock_file_wait(struct file *filp, struct file_lock *fl) +{ + return -ENOLCK; +} + static inline int __break_lease(struct inode *inode, unsigned int mode, unsigned int type) { return 0; @@ -1225,6 +1231,11 @@ static inline int flock_lock_file_wait(struct file *filp, struct file_lock *fl) return flock_lock_inode_wait(file_inode(filp), fl); } +static inline int locks_lock_file_wait(struct file *filp, struct file_lock *fl) +{ + return locks_lock_inode_wait(file_inode(filp), fl); +} + struct fasync_struct { spinlock_t fa_lock; int magic; -- 1.7.1 -- To unsubscribe from this list: send the line "unsubscribe ceph-devel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 3/3] locks: cleanup posix_lock_inode_wait and flock_lock_inode_wait
Hi Benjamin, [auto build test WARNING on jlayton/linux-next -- if it's inappropriate base, please suggest rules for selecting the more suitable base] url: https://github.com/0day-ci/linux/commits/Benjamin-Coddington/locks-introduce-locks_lock_inode_wait/20151022-233848 reproduce: # apt-get install sparse make ARCH=x86_64 allmodconfig make C=1 CF=-D__CHECK_ENDIAN__ sparse warnings: (new ones prefixed by >>) >> fs/locks.c:1176:5: sparse: symbol 'posix_lock_inode_wait' was not declared. >> Should it be static? >> fs/locks.c:1863:5: sparse: symbol 'flock_lock_inode_wait' was not declared. >> Should it be static? Please review and possibly fold the followup patch. --- 0-DAY kernel test infrastructureOpen Source Technology Center https://lists.01.org/pipermail/kbuild-all Intel Corporation -- To unsubscribe from this list: send the line "unsubscribe ceph-devel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[RFC PATCH] locks: posix_lock_inode_wait() can be static
Signed-off-by: Fengguang Wu--- locks.c |4 ++-- 1 file changed, 2 insertions(+), 2 deletions(-) diff --git a/fs/locks.c b/fs/locks.c index daf4664..0d2b326 100644 --- a/fs/locks.c +++ b/fs/locks.c @@ -1173,7 +1173,7 @@ EXPORT_SYMBOL(posix_lock_file); * * Apply a POSIX style lock request to an inode. */ -int posix_lock_inode_wait(struct inode *inode, struct file_lock *fl) +static int posix_lock_inode_wait(struct inode *inode, struct file_lock *fl) { int error; might_sleep (); @@ -1860,7 +1860,7 @@ int fcntl_setlease(unsigned int fd, struct file *filp, long arg) * * Apply a FLOCK style lock request to an inode. */ -int flock_lock_inode_wait(struct inode *inode, struct file_lock *fl) +static int flock_lock_inode_wait(struct inode *inode, struct file_lock *fl) { int error; might_sleep(); -- To unsubscribe from this list: send the line "unsubscribe ceph-devel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Ceph erasure coding
Hi, I have a question about the capabilities of the erasure coding API in Ceph. Let's say that I have 10 data disks and 4 parity disks, is it possible to create an erasure coding plugin which creates 20 data chunks and 8 parity chunks, and then places two chunks on each osd? Or said maybe a bit simpler is it possible for two or more chunks from the same encode operation to be placed on the same osd? - Kjetil Babington -- To unsubscribe from this list: send the line "unsubscribe ceph-devel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 1/3] locks: introduce locks_lock_inode_wait()
On Thu, 22 Oct 2015, Benjamin Coddington wrote: > Users of the locks API commonly call either posix_lock_file_wait() or > flock_lock_file_wait() depending upon the lock type. Add a new function > locks_lock_inode_wait() which will check and call the correct function for > the type of lock passed in. > > Signed-off-by: Benjamin Coddington> --- > fs/locks.c | 24 > include/linux/fs.h | 11 +++ > 2 files changed, 35 insertions(+), 0 deletions(-) > > diff --git a/fs/locks.c b/fs/locks.c > index 2a54c80..68b1784 100644 > --- a/fs/locks.c > +++ b/fs/locks.c > @@ -1876,6 +1876,30 @@ int flock_lock_inode_wait(struct inode *inode, struct > file_lock *fl) > EXPORT_SYMBOL(flock_lock_inode_wait); > > /** > + * locks_lock_inode_wait - Apply a lock to an inode > + * @inode: inode of the file to apply to > + * @fl: The lock to be applied > + * > + * Apply a POSIX or FLOCK style lock request to an inode. > + */ > +int locks_lock_inode_wait(struct inode *inode, struct file_lock *fl) > +{ > + int res = 0; > + switch (fl->fl_flags & (FL_POSIX|FL_FLOCK)) { > + case FL_POSIX: > + res = posix_lock_inode_wait(inode, fl); > + break; > + case FL_FLOCK: > + res = flock_lock_inode_wait(inode, fl); > + break; > + default: > + BUG(); > + } > + return res; > +} > +EXPORT_SYMBOL(locks_lock_inode_wait); > + > +/** > * sys_flock: - flock() system call. > * @fd: the file descriptor to lock. > * @cmd: the type of lock to apply. > diff --git a/include/linux/fs.h b/include/linux/fs.h > index 72d8a84..2e283b7 100644 > --- a/include/linux/fs.h > +++ b/include/linux/fs.h > @@ -1059,6 +1059,7 @@ extern int vfs_test_lock(struct file *, struct > file_lock *); > extern int vfs_lock_file(struct file *, unsigned int, struct file_lock *, > struct file_lock *); > extern int vfs_cancel_lock(struct file *filp, struct file_lock *fl); > extern int flock_lock_inode_wait(struct inode *inode, struct file_lock *fl); > +extern int locks_lock_inode_wait(struct inode *inode, struct file_lock *fl); > extern int __break_lease(struct inode *inode, unsigned int flags, unsigned > int type); > extern void lease_get_mtime(struct inode *, struct timespec *time); > extern int generic_setlease(struct file *, long, struct file_lock **, void > **priv); > @@ -1177,6 +1178,11 @@ static inline int flock_lock_inode_wait(struct inode > *inode, > return -ENOLCK; > } > > +static inline int locks_lock_file_wait(struct file *filp, struct file_lock > *fl) > +{ > + return -ENOLCK; > +} > + So, this is obviously wrong - thank you 0-day robot. Yes, I did build and test against these patches, but went back and added this after I realized it should work w/o CONFIG_FILE_LOCKING. I'll re-send. Ben > static inline int __break_lease(struct inode *inode, unsigned int mode, > unsigned int type) > { > return 0; > @@ -1225,6 +1231,11 @@ static inline int flock_lock_file_wait(struct file > *filp, struct file_lock *fl) > return flock_lock_inode_wait(file_inode(filp), fl); > } > > +static inline int locks_lock_file_wait(struct file *filp, struct file_lock > *fl) > +{ > + return locks_lock_inode_wait(file_inode(filp), fl); > +} > + > struct fasync_struct { > spinlock_t fa_lock; > int magic; > -- > 1.7.1 > > -- > To unsubscribe from this list: send the line "unsubscribe linux-nfs" in > the body of a message to majord...@vger.kernel.org > More majordomo info at http://vger.kernel.org/majordomo-info.html > -- To unsubscribe from this list: send the line "unsubscribe ceph-devel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 1/3] locks: introduce locks_lock_inode_wait()
Hi Benjamin, [auto build test ERROR on jlayton/linux-next -- if it's inappropriate base, please suggest rules for selecting the more suitable base] url: https://github.com/0day-ci/linux/commits/Benjamin-Coddington/locks-introduce-locks_lock_inode_wait/20151022-233848 config: x86_64-allnoconfig (attached as .config) reproduce: # save the attached .config to linux build tree make ARCH=x86_64 All errors (new ones prefixed by >>): In file included from include/linux/cgroup.h:17:0, from include/linux/memcontrol.h:22, from include/linux/swap.h:8, from include/linux/suspend.h:4, from arch/x86/kernel/asm-offsets.c:12: >> include/linux/fs.h:1234:19: error: redefinition of 'locks_lock_file_wait' static inline int locks_lock_file_wait(struct file *filp, struct file_lock *fl) ^ include/linux/fs.h:1181:19: note: previous definition of 'locks_lock_file_wait' was here static inline int locks_lock_file_wait(struct file *filp, struct file_lock *fl) ^ include/linux/fs.h: In function 'locks_lock_file_wait': >> include/linux/fs.h:1236:9: error: implicit declaration of function >> 'locks_lock_inode_wait' [-Werror=implicit-function-declaration] return locks_lock_inode_wait(file_inode(filp), fl); ^ cc1: some warnings being treated as errors make[2]: *** [arch/x86/kernel/asm-offsets.s] Error 1 make[2]: Target '__build' not remade because of errors. make[1]: *** [prepare0] Error 2 make[1]: Target 'prepare' not remade because of errors. make: *** [sub-make] Error 2 vim +/locks_lock_file_wait +1234 include/linux/fs.h 1228 1229 static inline int flock_lock_file_wait(struct file *filp, struct file_lock *fl) 1230 { 1231 return flock_lock_inode_wait(file_inode(filp), fl); 1232 } 1233 > 1234 static inline int locks_lock_file_wait(struct file *filp, struct > file_lock *fl) 1235 { > 1236 return locks_lock_inode_wait(file_inode(filp), fl); 1237 } 1238 1239 struct fasync_struct { --- 0-DAY kernel test infrastructureOpen Source Technology Center https://lists.01.org/pipermail/kbuild-all Intel Corporation .config.gz Description: Binary data
Re: [PATCH] mark rbd requiring stable pages
On Thu, Oct 22, 2015 at 5:37 PM, Mike Christiewrote: > On 10/22/2015 06:20 AM, Ilya Dryomov wrote: >> >>> > >>> > If we are just talking about if stable pages are not used, and someone >>> > is re-writing data to a page after the page has already been submitted >>> > to the block layer (I mean the page is on some bio which is on a request >>> > which is on some request_queue scheduler list or basically anywhere in >>> > the block layer), then I was saying this can occur with any block >>> > driver. There is nothing that is preventing this from happening with a >>> > FC driver or nvme or cciss or in dm or whatever. The app/user can >>> > rewrite as late as when we are in the make_request_fn/request_fn. >>> > >>> > I think I am misunderstanding your question because I thought this is >>> > expected behavior, and there is nothing drivers can do if the app is not >>> > doing a flush/sync between these types of write sequences. >> I don't see a problem with rewriting as late as when we are in >> request_fn() (or in a wq after being put there by request_fn()). Where >> I thought there *might* be an issue is rewriting after sendpage(), if >> sendpage() is used - perhaps some sneaky sequence similar to that >> retransmit bug that would cause us to *transmit* incorrect bytes (as >> opposed to *re*transmit) or something of that nature? > > > Just to make sure we are on the same page. > > Are you concerned about the tcp/net layer retransmitting due to it > detecting a issue as part of the tcp protocol, or are you concerned > about rbd/libceph initiating a retry like with the nfs issue? The former, tcp/net layer. I'm just conjecturing though. (We don't have the nfs issue, because even if the client sends such a retransmit (which it won't), the primary OSD will reject it as a dup.) Thanks, Ilya -- To unsubscribe from this list: send the line "unsubscribe ceph-devel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Ceph erasure coding
Not on purpose... out of curiosity, why do you want to do that? -Sam On Thu, Oct 22, 2015 at 9:44 AM, Kjetil Babingtonwrote: > Hi, > > I have a question about the capabilities of the erasure coding API in > Ceph. Let's say that I have 10 data disks and 4 parity disks, is it > possible to create an erasure coding plugin which creates 20 data > chunks and 8 parity chunks, and then places two chunks on each osd? > > Or said maybe a bit simpler is it possible for two or more chunks from > the same encode operation to be placed on the same osd? > > - Kjetil Babington > -- > To unsubscribe from this list: send the line "unsubscribe ceph-devel" in > the body of a message to majord...@vger.kernel.org > More majordomo info at http://vger.kernel.org/majordomo-info.html > > -- To unsubscribe from this list: send the line "unsubscribe ceph-devel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Ceph erasure coding
Hi, On 22/10/2015 18:44, Kjetil Babington wrote: > Hi, > > I have a question about the capabilities of the erasure coding API in > Ceph. Let's say that I have 10 data disks and 4 parity disks, is it > possible to create an erasure coding plugin which creates 20 data > chunks and 8 parity chunks, and then places two chunks on each osd? > > Or said maybe a bit simpler is it possible for two or more chunks from > the same encode operation to be placed on the same osd? This is more a question of creating a crush ruleset that does it. The erasure code plugin encodes chunks but the crush ruleset decides where they are placed. Cheers > > - Kjetil Babington > -- > To unsubscribe from this list: send the line "unsubscribe ceph-devel" in > the body of a message to majord...@vger.kernel.org > More majordomo info at http://vger.kernel.org/majordomo-info.html > -- Loïc Dachary, Artisan Logiciel Libre signature.asc Description: OpenPGP digital signature