Re: [PATCH] mark rbd requiring stable pages

2015-10-22 Thread Mike Christie

On 10/22/15, 11:52 AM, Ilya Dryomov wrote:

On Thu, Oct 22, 2015 at 5:37 PM, Mike Christie  wrote:

On 10/22/2015 06:20 AM, Ilya Dryomov wrote:




If we are just talking about if stable pages are not used, and someone
is re-writing data to a page after the page has already been submitted
to the block layer (I mean the page is on some bio which is on a request
which is on some request_queue scheduler list or basically anywhere in
the block layer), then I was saying this can occur with any block
driver. There is nothing that is preventing this from happening with a
FC driver or nvme or cciss or in dm or whatever. The app/user can
rewrite as late as when we are in the make_request_fn/request_fn.

I think I am misunderstanding your question because I thought this is
expected behavior, and there is nothing drivers can do if the app is not
doing a flush/sync between these types of write sequences.

I don't see a problem with rewriting as late as when we are in
request_fn() (or in a wq after being put there by request_fn()).  Where
I thought there *might* be an issue is rewriting after sendpage(), if
sendpage() is used - perhaps some sneaky sequence similar to that
retransmit bug that would cause us to *transmit* incorrect bytes (as
opposed to *re*transmit) or something of that nature?



Just to make sure we are on the same page.

Are you concerned about the tcp/net layer retransmitting due to it
detecting a issue as part of the tcp protocol, or are you concerned
about rbd/libceph initiating a retry like with the nfs issue?


The former, tcp/net layer.  I'm just conjecturing though.



For iscsi, we normally use the sendpage path. Data digests are off by 
default and some distros do not even allow you to turn them on, so our 
sendpage path has got a lot of testing and we have not seen any 
corruptions. Not saying it is not possible, but just saying we have not 
seen any.


It could be due to a recent change. Ronny, tell us about the workload 
and I will check iscsi.


Oh yeah, for the tcp/net retransmission case, I had said offlist, I 
thought there might be a issue with iscsi but I guess I was wrong, so I 
have not seen any issues with that either.


iSCSI just has that bug I mentioned offlist where we close the socket 
and fail commands upwards in the wrong order. That is a iscsi specific 
bug though.

--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: newstore direction

2015-10-22 Thread Orit Wasserman
On Thu, 2015-10-22 at 02:12 +, Allen Samuels wrote:
> One of the biggest changes that flash is making in the storage world is that 
> the way basic trade-offs in storage management software architecture are 
> being affected. In the HDD world CPU time per IOP was relatively 
> inconsequential, i.e., it had little effect on overall performance which was 
> limited by the physics of the hard drive. Flash is now inverting that 
> situation. When you look at the performance levels being delivered in the 
> latest generation of NVMe SSDs you rapidly see that that storage itself is 
> generally no longer the bottleneck (speaking about BW, not latency of course) 
> but rather it's the system sitting in front of the storage that is the 
> bottleneck. Generally it's the CPU cost of an IOP.
> 
> When Sandisk first starting working with Ceph (Dumpling) the design of 
> librados and the OSD lead to the situation that the CPU cost of an IOP was 
> dominated by context switches and network socket handling. Over time, much of 
> that has been addressed. The socket handling code has been re-written (more 
> than once!) some of the internal queueing in the OSD (and the associated 
> context switches) have been eliminated. As the CPU costs have dropped, 
> performance on flash has improved accordingly.
> 
> Because we didn't want to completely re-write the OSD (time-to-market and 
> stability drove that decision), we didn't move it from the current "thread 
> per IOP" model into a truly asynchronous "thread per CPU core" model that 
> essentially eliminates context switches in the IO path. But a fully optimized 
> OSD would go down that path (at least part-way). I believe it's been proposed 
> in the past. Perhaps a hybrid "fast-path" style could get most of the 
> benefits while preserving much of the legacy code.
> 

+1
It not just reducing context switches but also about removing contention
and data copies and getting better cache utilization.

Scylladb just did this to cassandra (using seastar library):
http://www.zdnet.com/article/kvm-creators-open-source-fast-cassandra-drop-in-replacement-scylla/

Orit

> I believe this trend toward thread-per-core software development will also 
> tend to support the "do it in user-space" trend. That's because most of the 
> kernel and file-system interface is architected around the blocking 
> "thread-per-IOP" model and is unlikely to change in the future.
> 
> 
> Allen Samuels
> Software Architect, Fellow, Systems and Software Solutions
> 
> 2880 Junction Avenue, San Jose, CA 95134
> T: +1 408 801 7030| M: +1 408 780 6416
> allen.samu...@sandisk.com
> 
> -Original Message-
> From: Martin Millnert [mailto:mar...@millnert.se]
> Sent: Thursday, October 22, 2015 6:20 AM
> To: Mark Nelson 
> Cc: Ric Wheeler ; Allen Samuels 
> ; Sage Weil ; 
> ceph-devel@vger.kernel.org
> Subject: Re: newstore direction
> 
> Adding 2c
> 
> On Wed, 2015-10-21 at 14:37 -0500, Mark Nelson wrote:
> > My thought is that there is some inflection point where the userland
> > kvstore/block approach is going to be less work, for everyone I think,
> > than trying to quickly discover, understand, fix, and push upstream
> > patches that sometimes only really benefit us.  I don't know if we've
> > truly hit that that point, but it's tough for me to find flaws with
> > Sage's argument.
> 
> Regarding the userland / kernel land aspect of the topic, there are further 
> aspects AFAIK not yet addressed in the thread:
> In the networking world, there's been development on memory mapped (multiple 
> approaches exist) userland networking, which for packet management has the 
> benefit of - for very, very specific applications of networking code - 
> avoiding e.g. per-packet context switches etc, and streamlining processor 
> cache management performance. People have gone as far as removing CPU cores 
> from CPU scheduler to completely dedicate them to the networking task at hand 
> (cache optimizations). There are various latency/throughput (bulking) 
> optimizations applicable, but at the end of the day, it's about keeping the 
> CPU bus busy with "revenue" bus traffic.
> 
> Granted, storage IO operations may be much heavier in cycle counts for 
> context switches to ever appear as a problem in themselves, certainly for 
> slower SSDs and HDDs. However, when going for truly high performance IO, 
> *every* hurdle in the data path counts toward the total latency.
> (And really, high performance random IO characteristics approaches the 
> networking, per-packet handling characteristics).  Now, I'm not really 
> suggesting memory-mapping a storage device to user space, not at all, but 
> having better control over the data path for a very specific use case, 
> reduces dependency on the code that works as best as possible for the general 
> case, and allows for very purpose-built code, to address a narrow set of 
> requirements. 

Re: newstore direction

2015-10-22 Thread Christoph Hellwig
On Wed, Oct 21, 2015 at 10:30:28AM -0700, Sage Weil wrote:
> For example: we need to do an overwrite of an existing object that is 
> atomic with respect to a larger ceph transaction (we're updating a bunch 
> of other metadata at the same time, possibly overwriting or appending to 
> multiple files, etc.).  XFS and ext4 aren't cow file systems, so plugging 
> into the transaction infrastructure isn't really an option (and even after 
> several years of trying to do it with btrfs it proved to be impractical).  

Not that I'm disagreeing with most of your points, but we can do things
like that with swapext-like hacks.  Below is my half year old prototype
of an O_ATOMIC implementation for XFS that gives you atomic out of place
writes.

diff --git a/fs/fcntl.c b/fs/fcntl.c
index ee85cd4..001dd49 100644
--- a/fs/fcntl.c
+++ b/fs/fcntl.c
@@ -740,7 +740,7 @@ static int __init fcntl_init(void)
 * Exceptions: O_NONBLOCK is a two bit define on parisc; O_NDELAY
 * is defined as O_NONBLOCK on some platforms and not on others.
 */
-   BUILD_BUG_ON(21 - 1 /* for O_RDONLY being 0 */ != HWEIGHT32(
+   BUILD_BUG_ON(22 - 1 /* for O_RDONLY being 0 */ != HWEIGHT32(
O_RDONLY| O_WRONLY  | O_RDWR|
O_CREAT | O_EXCL| O_NOCTTY  |
O_TRUNC | O_APPEND  | /* O_NONBLOCK | */
@@ -748,6 +748,7 @@ static int __init fcntl_init(void)
O_DIRECT| O_LARGEFILE   | O_DIRECTORY   |
O_NOFOLLOW  | O_NOATIME | O_CLOEXEC |
__FMODE_EXEC| O_PATH| __O_TMPFILE   |
+   O_ATOMIC|
__FMODE_NONOTIFY
));
 
diff --git a/fs/xfs/libxfs/xfs_bmap.c b/fs/xfs/libxfs/xfs_bmap.c
index aeffeaa..8eafca6 100644
--- a/fs/xfs/libxfs/xfs_bmap.c
+++ b/fs/xfs/libxfs/xfs_bmap.c
@@ -4681,14 +4681,14 @@ xfs_bmap_del_extent(
xfs_btree_cur_t *cur,   /* if null, not a btree */
xfs_bmbt_irec_t *del,   /* data to remove from extents */
int *logflagsp, /* inode logging flags */
-   int whichfork) /* data or attr fork */
+   int whichfork, /* data or attr fork */
+   boolfree_blocks) /* free extent at end of routine */
 {
xfs_filblks_t   da_new; /* new delay-alloc indirect blocks */
xfs_filblks_t   da_old; /* old delay-alloc indirect blocks */
xfs_fsblock_t   del_endblock=0; /* first block past del */
xfs_fileoff_t   del_endoff; /* first offset past del */
int delay;  /* current block is delayed allocated */
-   int do_fx;  /* free extent at end of routine */
xfs_bmbt_rec_host_t *ep;/* current extent entry pointer */
int error;  /* error return value */
int flags;  /* inode logging flags */
@@ -4712,8 +4712,8 @@ xfs_bmap_del_extent(
 
mp = ip->i_mount;
ifp = XFS_IFORK_PTR(ip, whichfork);
-   ASSERT((*idx >= 0) && (*idx < ifp->if_bytes /
-   (uint)sizeof(xfs_bmbt_rec_t)));
+   ASSERT(*idx >= 0);
+   ASSERT(*idx < ifp->if_bytes / sizeof(xfs_bmbt_rec_t));
ASSERT(del->br_blockcount > 0);
ep = xfs_iext_get_ext(ifp, *idx);
xfs_bmbt_get_all(ep, );
@@ -4746,10 +4746,13 @@ xfs_bmap_del_extent(
len = del->br_blockcount;
do_div(bno, mp->m_sb.sb_rextsize);
do_div(len, mp->m_sb.sb_rextsize);
-   error = xfs_rtfree_extent(tp, bno, (xfs_extlen_t)len);
-   if (error)
-   goto done;
-   do_fx = 0;
+   if (free_blocks) {
+   error = xfs_rtfree_extent(tp, bno,
+   (xfs_extlen_t)len);
+   if (error)
+   goto done;
+   free_blocks = 0;
+   }
nblks = len * mp->m_sb.sb_rextsize;
qfield = XFS_TRANS_DQ_RTBCOUNT;
}
@@ -4757,7 +4760,6 @@ xfs_bmap_del_extent(
 * Ordinary allocation.
 */
else {
-   do_fx = 1;
nblks = del->br_blockcount;
qfield = XFS_TRANS_DQ_BCOUNT;
}
@@ -4777,7 +4779,7 @@ xfs_bmap_del_extent(
da_old = startblockval(got.br_startblock);
da_new = 0;
nblks = 0;
-   do_fx = 0;
+   free_blocks = 0;
}
/*
 * Set flag value to use in switch statement.
@@ -4963,7 +4965,7 @@ xfs_bmap_del_extent(
/*
 * If we 

Re: [PATCH] mark rbd requiring stable pages

2015-10-22 Thread Ilya Dryomov
On Thu, Oct 22, 2015 at 7:22 PM, Mike Christie  wrote:
> On 10/22/15, 11:52 AM, Ilya Dryomov wrote:
>>
>> On Thu, Oct 22, 2015 at 5:37 PM, Mike Christie 
>> wrote:
>>>
>>> On 10/22/2015 06:20 AM, Ilya Dryomov wrote:


>>
>> If we are just talking about if stable pages are not used, and someone
>> is re-writing data to a page after the page has already been submitted
>> to the block layer (I mean the page is on some bio which is on a
>> request
>> which is on some request_queue scheduler list or basically anywhere in
>> the block layer), then I was saying this can occur with any block
>> driver. There is nothing that is preventing this from happening with a
>> FC driver or nvme or cciss or in dm or whatever. The app/user can
>> rewrite as late as when we are in the make_request_fn/request_fn.
>>
>> I think I am misunderstanding your question because I thought this is
>> expected behavior, and there is nothing drivers can do if the app is
>> not
>> doing a flush/sync between these types of write sequences.

 I don't see a problem with rewriting as late as when we are in
 request_fn() (or in a wq after being put there by request_fn()).  Where
 I thought there *might* be an issue is rewriting after sendpage(), if
 sendpage() is used - perhaps some sneaky sequence similar to that
 retransmit bug that would cause us to *transmit* incorrect bytes (as
 opposed to *re*transmit) or something of that nature?
>>>
>>>
>>>
>>> Just to make sure we are on the same page.
>>>
>>> Are you concerned about the tcp/net layer retransmitting due to it
>>> detecting a issue as part of the tcp protocol, or are you concerned
>>> about rbd/libceph initiating a retry like with the nfs issue?
>>
>>
>> The former, tcp/net layer.  I'm just conjecturing though.
>>
>
> For iscsi, we normally use the sendpage path. Data digests are off by
> default and some distros do not even allow you to turn them on, so our
> sendpage path has got a lot of testing and we have not seen any corruptions.
> Not saying it is not possible, but just saying we have not seen any.

Great, that's reassuring.

>
> It could be due to a recent change. Ronny, tell us about the workload and I
> will check iscsi.
>
> Oh yeah, for the tcp/net retransmission case, I had said offlist, I thought
> there might be a issue with iscsi but I guess I was wrong, so I have not
> seen any issues with that either.

I'll drop my concerns then.  Those corruptions could be a bug in ceph
reconnect code or something else - regardless, that's separate from the
issue at hand.

Thanks,

Ilya
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] mark rbd requiring stable pages

2015-10-22 Thread Ilya Dryomov
On Thu, Oct 22, 2015 at 6:07 AM, Mike Christie  wrote:
> On 10/21/2015 03:57 PM, Ilya Dryomov wrote:
>> On Wed, Oct 21, 2015 at 10:51 PM, Ilya Dryomov  wrote:
>>> On Fri, Oct 16, 2015 at 1:09 PM, Ilya Dryomov  wrote:
 Hmm...  On the one hand, yes, we do compute CRCs, but that's optional,
 so enabling this unconditionally is probably too harsh.  OTOH we are
 talking to the network, which means all sorts of delays, retransmission
 issues, etc, so I wonder how exactly "unstable" pages behave when, say,
 added to an skb - you can't write anything to a page until networking
 is fully done with it and expect it to work.  It's particularly
 alarming that you've seen corruptions.

 Currently the only users of this flag are block integrity stuff and
 md-raid5, which makes me wonder what iscsi, nfs and others do in this
 area.  There's an old ticket on this topic somewhere on the tracker, so
 I'll need to research this.  Thanks for bringing this up!
>>>
>>> Hi Mike,
>>>
>>> I was hoping to grab you for a few minutes, but you weren't there...
>>>
>>> I spent a better part of today reading code and mailing lists on this
>>> topic.  It is of course a bug that we use sendpage() which inlines
>>> pages into an skb and do nothing to keep those pages stable.  We have
>>> csums enabled by default, so setting BDI_CAP_STABLE_WRITES in the crc
>>> case is an obvious fix.
>>>
>>> I looked at drbd and iscsi and I think iscsi could do the same - ditch
>>> the fallback to sock_no_sendpage() in the datadgst_en case (and get rid
>>> of iscsi_sw_tcp_conn::sendpage member while at it).  Using stable pages
>>> rather than having a roll-your-own implementation which doesn't close
>>> the race but only narrows it sounds like a win, unless copying through
>>> sendmsg() is for some reason cheaper than stable-waiting?
>
> Yeah, that is what I was saying on the call the other day, but the
> reception was bad. We only have the sendmsg code path when digest are on
> because that code came before stable pages. When stable pages were
> created, it was on by default but did not cover all the cases, so we
> left the code. It then handled most scenarios, but I just never got
> around to removing old the code. However, it was set to off by default
> so I left it and made this patch for iscsi to turn on stable pages:
>
> [this patch only enabled stable pages when digests/crcs are on and dif
> not remove the code yet]
> https://groups.google.com/forum/#!topic/open-iscsi/n4jvWK7BPYM
>
> I did not really like the layering so I have not posted it for inclusion.

Good to know I got it right ;)

>
>
>
>>>
>>> drbd still needs the non-zero-copy version for its async protocol for
>>> when they free the pages before the NIC has chance to put them on the
>>> wire.  md-raid5 it turns out has an option to essentially disable most
>>> of its stripe cache and so it sets BDI_CAP_STABLE_WRITES to compensate
>>> if that option is enabled.
>>>
>>> What I'm worried about is the !crc (!datadgst_en) case.  I'm failing to
>>> convince myself that mucking with sendpage()ed pages while they sit in
>>> the TCP queue (or anywhere in the networking stack, really), is safe -
>>> there is nothing to prevent pages from being modified after sendpage()
>>> returned and Ronny reports data corruptions that pretty much went away
>>> with BDI_CAP_STABLE_WRITES set.  I may be, after prolonged staring at
>>> this, starting to confuse fs with block, though.  How does that work in
>>> iscsi land?
>
> This is what I was trying to ask about in the call the other day. Where
> is the corruption that Ronny was seeing. Was it checksum mismatches on
> data being written, or is incorrect meta data being written, etc?

Well, checksum mismatches are to be expected given what we are doing
now, but I wouldn't expect any data corruptions.  Ronny writes that he
saw frequent ext4 corruptions on krbd devices before he enabled stable
pages, which leads me to believe that the !crc case, for which we won't
be setting BDI_CAP_STABLE_WRITES, is going to be/remain broken.  Ronny,
could you describe it in more detail and maybe share some of those osd
logs with bad crc messages?

>
> If we are just talking about if stable pages are not used, and someone
> is re-writing data to a page after the page has already been submitted
> to the block layer (I mean the page is on some bio which is on a request
> which is on some request_queue scheduler list or basically anywhere in
> the block layer), then I was saying this can occur with any block
> driver. There is nothing that is preventing this from happening with a
> FC driver or nvme or cciss or in dm or whatever. The app/user can
> rewrite as late as when we are in the make_request_fn/request_fn.
>
> I think I am misunderstanding your question because I thought this is
> expected behavior, and there is nothing drivers can do if the app is not
> doing a 

keyring issues, 9.1.0

2015-10-22 Thread Deneau, Tom
My current situation as I upgrade to v9.1.0 is that client.admin keyring seems 
to work fine, for instance for ceph status command.  But commands that use 
client.bootstrap-osd  such as

/usr/bin/ceph --cluster ceph --name client.bootstrap-osd --keyring 
/var/lib/ceph/bootstrap-osd/ceph.keyring osd create --concise 
a428120d-99ec-4a73-999f-75d8a6bfcb2e

are getting "EACCES: access denied"

with log entries in ceph.audit.log such as

2015-10-22 13:50:24.070249 mon.0 10.0.2.132:6789/0 33 : audit [INF] 
from='client.? 10.0.2.132:0/263577121' entity='client.bootstrap-osd' 
cmd=[{"prefix": "osd create", "uuid": "a428120d-99ec-4a73-999f-75d8a6bfcb2e"}]: 
 access denied

I tried setting
debug auth = 0
in ceph.conf but couldn't tell anything from that output.

Is there anything special I should look for here?
Note: I do have /var/lib/ceph and subdirectories owned by ceph:ceph

-- Tom


--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


RE: newstore direction

2015-10-22 Thread James (Fei) Liu-SSI
Hi Sage and other fellow cephers,
  I truly share the pains with you  all about filesystem while I am working on  
objectstore to improve the performance. As mentioned , there is nothing wrong 
with filesystem. Just the Ceph as one of  use case need more supports but not 
provided in near future by filesystem no matter what reasons.

   There are so many techniques  pop out which can help to improve performance 
of OSD.  User space driver(DPDK from Intel) is one of them. It not only gives 
you the storage allocator,  also gives you the thread scheduling support,  CPU 
affinity , NUMA friendly, polling  which  might fundamentally change the 
performance of objectstore.  It should not be hard to improve CPU utilization 
3x~5x times, higher IOPS etc.
I totally agreed that goal of filestore is to gives enough support for 
filesystem with either 1 ,1b, or 2 solutions. In my humble opinion , The new 
design goal of objectstore should focus on giving the best  performance for OSD 
with new techniques. These two goals are not going to conflict with each other. 
 They are just for different purposes to make Ceph not only more stable but 
also better.  

  Scylla mentioned by Orit is a good example .

  Thanks all.

  Regards,
  James   

-Original Message-
From: ceph-devel-ow...@vger.kernel.org 
[mailto:ceph-devel-ow...@vger.kernel.org] On Behalf Of Sage Weil
Sent: Thursday, October 22, 2015 5:50 AM
To: Ric Wheeler
Cc: Orit Wasserman; ceph-devel@vger.kernel.org
Subject: Re: newstore direction

On Wed, 21 Oct 2015, Ric Wheeler wrote:
> You will have to trust me on this as the Red Hat person who spoke to 
> pretty much all of our key customers about local file systems and 
> storage - customers all have migrated over to using normal file systems under 
> Oracle/DB2.
> Typically, they use XFS or ext4.  I don't know of any non-standard 
> file systems and only have seen one account running on a raw block 
> store in 8 years
> :)
> 
> If you have a pre-allocated file and write using O_DIRECT, your IO 
> path is identical in terms of IO's sent to the device.
> 
> If we are causing additional IO's, then we really need to spend some 
> time talking to the local file system gurus about this in detail.  I 
> can help with that conversation.

If the file is truly preallocated (that is, prewritten with zeros... 
fallocate doesn't help here because the extents is marked unwritten), then
sure: there is very little change in the data path.

But at that point, what is the point?  This only works if you have one (or a 
few) huge files and the user space app already has all the complexity of a 
filesystem-like thing (with its own internal journal, allocators, garbage 
collection, etc.).  Do they just do this to ease administrative tasks like 
backup?


This is the fundamental tradeoff:

1) We have a file per object.  We fsync like crazy and the fact that there are 
two independent layers journaling and managing different types of consistency 
penalizes us.

1b) We get clever and start using obscure and/or custom ioctls in the file 
system to work around what it is used to: we swap extents to avoid write-ahead 
(see Christoph's patch), O_NOMTIME, unprivileged open-by-handle, batch fsync, 
O_ATOMIC, setext ioctl, etc.

2) We preallocate huge files and write a user-space object system that lives 
within it (pretending the file is a block device).  The file system rarely gets 
in the way (assuming the file is prewritten and we don't do anything stupid).  
But it doesn't give us anything a block device wouldn't, and it doesn't save us 
any complexity in our code.

At the end of the day, 1 and 1b are always going to be slower than 2.  
And although 1b performs a bit better than 1, it has similar (user-space) 
complexity to 2.  On the other hand, if you step back and view teh entire stack 
(ceph-osd + XFS), 1 and 1b are *significantly* more complex than 2... and yet 
still slower.  Given we ultimately have to support both (both as an upstream 
and as a distro), that's not very attractive.

Also note that every time we have strayed off the reservation from the beaten 
path (1) to anything mildly exotic (1b) we have been bitten by obscure file 
systems bugs.  And that's assume we get everything we need upstream... which is 
probably a year's endeavour.

Don't get me wrong: I'm all for making changes to file systems to better 
support systems like Ceph.  Things like O_NOCMTIME and O_ATOMIC make a huge 
amount of sense of a ton of different systems.  But our situations is a bit 
different: we always own the entire device (and often the server), so there is 
no need to share with other users or apps (and when you do, you just use the 
existing FileStore backend).  And as you know performance is a huge pain point. 
 We are already handicapped by virtue of being distributed and strongly 
consistent; we can't afford to give away more to a storage layer that isn't 
providing us much (or the right) value.

And I'm tired of half 

tracker.ceph.com downtime today

2015-10-22 Thread Dan Mick
tracker.ceph.com will be brought down today for upgrade and move to a
new host.  I plan to do this at about 4PM PST (40 minutes from now).
Expect a downtime of about 15-20 minutes.  More notification to follow.

-- 
Dan Mick
Red Hat, Inc.
Ceph docs: http://ceph.com/docs
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: tracker.ceph.com downtime today

2015-10-22 Thread Dan Mick
It's back.  New DNS info is propagating its way around.  If you
absolutely must get to it, newtracker.ceph.com is the new address, but
please don't bookmark that, as it will be going away after the transition.

Please let me know of any problems you have.

On 10/22/2015 04:09 PM, Dan Mick wrote:
> tracker.ceph.com down now
> 
> On 10/22/2015 03:20 PM, Dan Mick wrote:
>> tracker.ceph.com will be brought down today for upgrade and move to a
>> new host.  I plan to do this at about 4PM PST (40 minutes from now).
>> Expect a downtime of about 15-20 minutes.  More notification to follow.
>>
> 
> 


-- 
Dan Mick
Red Hat, Inc.
Ceph docs: http://ceph.com/docs
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: newstore direction

2015-10-22 Thread Samuel Just
Since the changes which moved the pg log and the pg info into the pg
object space, I think it's now the case that any transaction submitted
to the objectstore updates a disjoint range of objects determined by
the sequencer.  It might be easier to exploit that parallelism if we
control allocation and allocation related metadata.  We could split
the store into N pieces which partition the pg space (one additional
one for the meta sequencer?) with one rocksdb instance for each.
Space could then be parcelled out in large pieces (small frequency of
global allocation decisions) and managed more finely within each
partition.  The main challenge would be avoiding internal
fragmentation of those, but at least defragmentation can be managed on
a per-partition basis.  Such parallelism is probably necessary to
exploit the full throughput of some ssds.
-Sam

On Thu, Oct 22, 2015 at 10:42 AM, James (Fei) Liu-SSI
 wrote:
> Hi Sage and other fellow cephers,
>   I truly share the pains with you  all about filesystem while I am working 
> on  objectstore to improve the performance. As mentioned , there is nothing 
> wrong with filesystem. Just the Ceph as one of  use case need more supports 
> but not provided in near future by filesystem no matter what reasons.
>
>There are so many techniques  pop out which can help to improve 
> performance of OSD.  User space driver(DPDK from Intel) is one of them. It 
> not only gives you the storage allocator,  also gives you the thread 
> scheduling support,  CPU affinity , NUMA friendly, polling  which  might 
> fundamentally change the performance of objectstore.  It should not be hard 
> to improve CPU utilization 3x~5x times, higher IOPS etc.
> I totally agreed that goal of filestore is to gives enough support for 
> filesystem with either 1 ,1b, or 2 solutions. In my humble opinion , The new 
> design goal of objectstore should focus on giving the best  performance for 
> OSD with new techniques. These two goals are not going to conflict with each 
> other.  They are just for different purposes to make Ceph not only more 
> stable but also better.
>
>   Scylla mentioned by Orit is a good example .
>
>   Thanks all.
>
>   Regards,
>   James
>
> -Original Message-
> From: ceph-devel-ow...@vger.kernel.org 
> [mailto:ceph-devel-ow...@vger.kernel.org] On Behalf Of Sage Weil
> Sent: Thursday, October 22, 2015 5:50 AM
> To: Ric Wheeler
> Cc: Orit Wasserman; ceph-devel@vger.kernel.org
> Subject: Re: newstore direction
>
> On Wed, 21 Oct 2015, Ric Wheeler wrote:
>> You will have to trust me on this as the Red Hat person who spoke to
>> pretty much all of our key customers about local file systems and
>> storage - customers all have migrated over to using normal file systems 
>> under Oracle/DB2.
>> Typically, they use XFS or ext4.  I don't know of any non-standard
>> file systems and only have seen one account running on a raw block
>> store in 8 years
>> :)
>>
>> If you have a pre-allocated file and write using O_DIRECT, your IO
>> path is identical in terms of IO's sent to the device.
>>
>> If we are causing additional IO's, then we really need to spend some
>> time talking to the local file system gurus about this in detail.  I
>> can help with that conversation.
>
> If the file is truly preallocated (that is, prewritten with zeros...
> fallocate doesn't help here because the extents is marked unwritten), then
> sure: there is very little change in the data path.
>
> But at that point, what is the point?  This only works if you have one (or a 
> few) huge files and the user space app already has all the complexity of a 
> filesystem-like thing (with its own internal journal, allocators, garbage 
> collection, etc.).  Do they just do this to ease administrative tasks like 
> backup?
>
>
> This is the fundamental tradeoff:
>
> 1) We have a file per object.  We fsync like crazy and the fact that there 
> are two independent layers journaling and managing different types of 
> consistency penalizes us.
>
> 1b) We get clever and start using obscure and/or custom ioctls in the file 
> system to work around what it is used to: we swap extents to avoid 
> write-ahead (see Christoph's patch), O_NOMTIME, unprivileged open-by-handle, 
> batch fsync, O_ATOMIC, setext ioctl, etc.
>
> 2) We preallocate huge files and write a user-space object system that lives 
> within it (pretending the file is a block device).  The file system rarely 
> gets in the way (assuming the file is prewritten and we don't do anything 
> stupid).  But it doesn't give us anything a block device wouldn't, and it 
> doesn't save us any complexity in our code.
>
> At the end of the day, 1 and 1b are always going to be slower than 2.
> And although 1b performs a bit better than 1, it has similar (user-space) 
> complexity to 2.  On the other hand, if you step back and view teh entire 
> stack (ceph-osd + XFS), 1 and 1b are *significantly* more complex than 2... 
> and yet still 

Re: newstore direction

2015-10-22 Thread Samuel Just
Ah, except for the snapmapper.  We can split the snapmapper in the
same way, though, as long as we are careful with the name.
-Sam

On Thu, Oct 22, 2015 at 4:42 PM, Samuel Just  wrote:
> Since the changes which moved the pg log and the pg info into the pg
> object space, I think it's now the case that any transaction submitted
> to the objectstore updates a disjoint range of objects determined by
> the sequencer.  It might be easier to exploit that parallelism if we
> control allocation and allocation related metadata.  We could split
> the store into N pieces which partition the pg space (one additional
> one for the meta sequencer?) with one rocksdb instance for each.
> Space could then be parcelled out in large pieces (small frequency of
> global allocation decisions) and managed more finely within each
> partition.  The main challenge would be avoiding internal
> fragmentation of those, but at least defragmentation can be managed on
> a per-partition basis.  Such parallelism is probably necessary to
> exploit the full throughput of some ssds.
> -Sam
>
> On Thu, Oct 22, 2015 at 10:42 AM, James (Fei) Liu-SSI
>  wrote:
>> Hi Sage and other fellow cephers,
>>   I truly share the pains with you  all about filesystem while I am working 
>> on  objectstore to improve the performance. As mentioned , there is nothing 
>> wrong with filesystem. Just the Ceph as one of  use case need more supports 
>> but not provided in near future by filesystem no matter what reasons.
>>
>>There are so many techniques  pop out which can help to improve 
>> performance of OSD.  User space driver(DPDK from Intel) is one of them. It 
>> not only gives you the storage allocator,  also gives you the thread 
>> scheduling support,  CPU affinity , NUMA friendly, polling  which  might 
>> fundamentally change the performance of objectstore.  It should not be hard 
>> to improve CPU utilization 3x~5x times, higher IOPS etc.
>> I totally agreed that goal of filestore is to gives enough support for 
>> filesystem with either 1 ,1b, or 2 solutions. In my humble opinion , The new 
>> design goal of objectstore should focus on giving the best  performance for 
>> OSD with new techniques. These two goals are not going to conflict with each 
>> other.  They are just for different purposes to make Ceph not only more 
>> stable but also better.
>>
>>   Scylla mentioned by Orit is a good example .
>>
>>   Thanks all.
>>
>>   Regards,
>>   James
>>
>> -Original Message-
>> From: ceph-devel-ow...@vger.kernel.org 
>> [mailto:ceph-devel-ow...@vger.kernel.org] On Behalf Of Sage Weil
>> Sent: Thursday, October 22, 2015 5:50 AM
>> To: Ric Wheeler
>> Cc: Orit Wasserman; ceph-devel@vger.kernel.org
>> Subject: Re: newstore direction
>>
>> On Wed, 21 Oct 2015, Ric Wheeler wrote:
>>> You will have to trust me on this as the Red Hat person who spoke to
>>> pretty much all of our key customers about local file systems and
>>> storage - customers all have migrated over to using normal file systems 
>>> under Oracle/DB2.
>>> Typically, they use XFS or ext4.  I don't know of any non-standard
>>> file systems and only have seen one account running on a raw block
>>> store in 8 years
>>> :)
>>>
>>> If you have a pre-allocated file and write using O_DIRECT, your IO
>>> path is identical in terms of IO's sent to the device.
>>>
>>> If we are causing additional IO's, then we really need to spend some
>>> time talking to the local file system gurus about this in detail.  I
>>> can help with that conversation.
>>
>> If the file is truly preallocated (that is, prewritten with zeros...
>> fallocate doesn't help here because the extents is marked unwritten), then
>> sure: there is very little change in the data path.
>>
>> But at that point, what is the point?  This only works if you have one (or a 
>> few) huge files and the user space app already has all the complexity of a 
>> filesystem-like thing (with its own internal journal, allocators, garbage 
>> collection, etc.).  Do they just do this to ease administrative tasks like 
>> backup?
>>
>>
>> This is the fundamental tradeoff:
>>
>> 1) We have a file per object.  We fsync like crazy and the fact that there 
>> are two independent layers journaling and managing different types of 
>> consistency penalizes us.
>>
>> 1b) We get clever and start using obscure and/or custom ioctls in the file 
>> system to work around what it is used to: we swap extents to avoid 
>> write-ahead (see Christoph's patch), O_NOMTIME, unprivileged open-by-handle, 
>> batch fsync, O_ATOMIC, setext ioctl, etc.
>>
>> 2) We preallocate huge files and write a user-space object system that lives 
>> within it (pretending the file is a block device).  The file system rarely 
>> gets in the way (assuming the file is prewritten and we don't do anything 
>> stupid).  But it doesn't give us anything a block device wouldn't, and it 
>> doesn't save us any complexity in our code.
>>
>> At the 

Re: [PATCH] mark rbd requiring stable pages

2015-10-22 Thread Ronny Hegewald
On Thursday 22 October 2015, Ilya Dryomov wrote:
> Well, checksum mismatches are to be expected given what we are doing
> now, but I wouldn't expect any data corruptions.  Ronny writes that he
> saw frequent ext4 corruptions on krbd devices before he enabled stable
> pages, which leads me to believe that the !crc case, for which we won't
> be setting BDI_CAP_STABLE_WRITES, is going to be/remain broken.  Ronny,
> could you describe it in more detail and maybe share some of those osd
> logs with bad crc messages?
> 
This is from a 10 minute period from one of the OSDs. 

23:11:02.423728 ce5dfb70  0 bad crc in data 1657725429 != exp 496797267
23:11:37.586411 ce5dfb70  0 bad crc in data 1216602498 != exp 111161
23:12:07.805675 cc3ffb70  0 bad crc in data 3140625666 != exp 2614069504
23:12:44.485713 c96ffb70  0 bad crc in data 1712148977 != exp 3239079328
23:13:24.746217 ce5dfb70  0 bad crc in data 144620426 != exp 3156694286
23:13:52.792367 ce5dfb70  0 bad crc in data 4033880920 != exp 4159672481
23:14:22.958999 c96ffb70  0 bad crc in data 847688321 != exp 1551499144
23:16:35.015629 ce5dfb70  0 bad crc in data 2790209714 != exp 3779604715
23:17:48.482049 c96ffb70  0 bad crc in data 1563466764 != exp 528198494
23:19:28.925357 cc3ffb70  0 bad crc in data 1764275395 != exp 2075504274
23:19:59.039843 cc3ffb70  0 bad crc in data 2960172683 != exp 1215950691

The filesystem corruptions are usually ones with messages of

EXT4-fs error (device rbd4): ext4_mb_generate_buddy:757: group 155, block 
bitmap and bg descriptor inconsistent: 23625 vs 23660 free clusters

These were pretty common, at least every other day, often multiple times a 
day.

Sometimes there was a additional 

JBD2: Spotted dirty metadata buffer (dev = rbd4, blocknr = 0). There's a risk 
of filesystem corruption in case of system crash.

Another type of Filesystem corruption i experienced through kernel 
compilations, that lead to the following messages.

EXT4-fs warning (device rbd3): empty_dir:2488: bad directory (dir #282221) - 
no `.' or `..'
EXT4-fs warning (device rbd3): empty_dir:2488: bad directory (dir #273062) - 
no `.' or `..'
EXT4-fs warning (device rbd3): empty_dir:2488: bad directory (dir #272270) - 
no `.' or `..'
EXT4-fs warning (device rbd3): empty_dir:2488: bad directory (dir #282254) - 
no `.' or `..'
EXT4-fs warning (device rbd3): empty_dir:2488: bad directory (dir #273070) - 
no `.' or `..'
EXT4-fs warning (device rbd3): empty_dir:2488: bad directory (dir #272308) - 
no `.' or `..'
EXT4-fs error (device rbd3): ext4_lookup:1417: inode #270033: comm rm: deleted 
inode referenced: 270039
last message repeated 2 times
EXT4-fs warning (device rbd3): empty_dir:2488: bad directory (dir #271534) - 
no `.' or `..'
EXT4-fs warning (device rbd3): empty_dir:2488: bad directory (dir #271275) - 
no `.' or `..'
EXT4-fs warning (device rbd3): empty_dir:2488: bad directory (dir #282290) - 
no `.' or `..'
EXT4-fs warning (device rbd3): empty_dir:2488: bad directory (dir #281914) - 
no `.' or `..'
EXT4-fs error (device rbd3): ext4_lookup:1417: inode #270033: comm rm: deleted 
inode referenced: 270039
last message repeated 2 times
kernel: EXT4-fs error (device rbd3): ext4_lookup:1417: inode #273018: comm rm: 
deleted inode referenced: 282221
EXT4-fs error (device rbd3): ext4_lookup:1417: inode #273018: comm rm: deleted 
inode referenced: 282221
EXT4-fs error (device rbd3): ext4_lookup:1417: inode #273018: comm rm: deleted 
inode referenced: 281914
EXT4-fs error (device rbd3): ext4_lookup:1417: inode #273018: comm rm: deleted 
inode referenced: 281914
EXT4-fs error: 243 callbacks suppressed 
EXT4-fs error (device rbd3): ext4_lookup:1417: inode #282002: comm cp: deleted 
inode referenced: 45375
kernel: EXT4-fs error (device rbd3): ext4_lookup:1417: inode #282002: comm cp: 
deleted inode referenced: 45371

The result was that various files and directories in the kernel sourcedir 
couldn't be accessed anymore and even fsck couldn't repair it, so i had to 
finally delete it. But these ones were pretty rare.

Another issue were the data-corruptions in the files itself, that happened 
independently from the filesystem-corruptions.  These happened on most days, 
sometimes only once, sometimes multiple times a day. 

Newly written files that contained corrupted data seem to always have it only 
at one place. These corrupt data replaced the original data from the file, but 
never changed the file-size. The position of this corruptions in the files 
were always different.

Interesting part is that this corrupted parts always followed the same 
pattern. First some few hundred 0x0 bytes, then a few kb (10-30) of random 
binary data, that finished again with a few hundred bytes of 0x0.

In a few cases i could trace this data back to origin from another file that 
was read at the same time from the same programm. But that might be 
accidentially, because other corruptions that happened in the same scenario I 
couldn't trace back this way.

In other cases that 

Re: [PATCH] mark rbd requiring stable pages

2015-10-22 Thread Ronny Hegewald
On Thursday 22 October 2015, you wrote:
> It could be due to a recent change. Ronny, tell us about the workload
> and I will check iscsi.
 
I guess the best testcase is a kernel compilation in a make clean; make -j (> 
1); loop. The data-corruptions usually happen in the generated .cmd files, 
which breaks the build immediatelly and makes the corruption easy to spot.

Beside that i have seen data corruptions in other simple circumstances. 
Copying data from non-rbd to rbd device, from rbd to rbd device, scp data from 
another machine to the rbd. 

Also i have mounted the rbds on the same machines im running the OSD, which 
might be a contributing factor. 

Unfortunatly there seems to be nothing that increases the likelyhood of the 
corruption to happen. I tried all kinds of things with no success.

Another part of the corruption might have been the amount of free memory. 
Before i added the flag for stable patches i regularly had warnings like. 
Since the use of stable pages for rbd these warnings are gone too.


kernel: swapper/1: page allocation failure: order:0, mode:0x20
kernel:   88012fc83b68 8143f171 
kernel:  0020 88012fc83bf8 81127fda 88012fff9838
kernel:  880109bc7100 01ff88012fc83be8 8164aa40 0020
kernel: Call Trace:
kernel:[] dump_stack+0x48/0x5f
kernel:  [] warn_alloc_failed+0xea/0x130
kernel:  [] __alloc_pages_nodemask+0x69a/0x910
kernel:  [] ? br_handle_frame_finish+0x500/0x500 [bridge]
kernel:  [] alloc_pages_current+0xa7/0x170
kernel:  [] atl1c_alloc_rx_buffer+0x36c/0x430 [atl1c]
kernel:  [] atl1c_clean+0x212/0x3b0 [atl1c]
kernel:  [] net_rx_action+0x15f/0x320
kernel:  [] __do_softirq+0x123/0x2e0
kernel:  [] irq_exit+0x96/0xc0
kernel:  [] do_IRQ+0x65/0x110
kernel:  [] common_interrupt+0x72/0x72
kernel:[] ? retint_restore_args+0x13/0x13
kernel:  [] ? mwait_idle+0x72/0xb0
kernel:  [] ? mwait_idle+0x69/0xb0
kernel:  [] arch_cpu_idle+0xf/0x20
kernel:  [] cpu_startup_entry+0x22b/0x3e0
kernel:  [] start_secondary+0x156/0x180
kernel: Mem-Info:
kernel: Node 0 DMA per-cpu:
kernel: CPU0: hi:0, btch:   1 usd:   0
kernel: CPU1: hi:0, btch:   1 usd:   0
kernel: CPU2: hi:0, btch:   1 usd:   0
kernel: CPU3: hi:0, btch:   1 usd:   0
kernel: Node 0 DMA32 per-cpu:
kernel: CPU0: hi:  186, btch:  31 usd: 182
kernel: CPU1: hi:  186, btch:  31 usd: 179
kernel: CPU2: hi:  186, btch:  31 usd: 156
kernel: CPU3: hi:  186, btch:  31 usd: 170
kernel: Node 0 Normal per-cpu:
kernel: CPU0: hi:  186, btch:  31 usd: 138
kernel: CPU1: hi:  186, btch:  31 usd: 130
kernel: CPU2: hi:  186, btch:  31 usd:  73
kernel: CPU3: hi:  186, btch:  31 usd: 122
kernel: active_anon:499711 inactive_anon:128139 isolated_anon:0
kernel:  active_file:132181 inactive_file:145093 isolated_file:22
kernel:  unevictable:4083 dirty:1526 writeback:15597 unstable:0
kernel:  free:5225 slab_reclaimable:23735 slab_unreclaimable:29775
kernel:  mapped:11742 shmem:18846 pagetables:3946 bounce:0
kernel:  free_cma:0
kernel: Node 0 DMA free:15284kB min:32kB low:40kB high:48kB active_anon:0kB 
inactive_anon:96kB active_file:232kB inactive_file:80kB unevictable:0kB 
isolated(anon):0kB isolated(file):0kB present:15992kB managed:15908kB 
mlocked:0kB dirty:0kB writeback:0kB mapped:12kB shmem:0kB 
slab_reclaimable:52kB slab_unreclaimable:80kB kernel_stack:16kB pagetables:0kB 
unstable:0kB bounce:0kB free_cma:0kB writeback_tmp:0kB pages_scanned:88 
all_unreclaimable? no
kernel: lowmem_reserve[]: 0 3107 3818 3818
kernel: Node 0 DMA32 free:5064kB min:6420kB low:8024kB high:9628kB 
active_anon:1718524kB inactive_anon:365504kB active_file:418964kB 
inactive_file:469748kB unevictable:0kB isolated(anon):0kB isolated(file):88kB 
present:3257216kB managed:3183616kB mlocked:0kB dirty:5900kB writeback:48264kB 
mapped:39204kB shmem:54364kB slab_reclaimable:76256kB 
slab_unreclaimable:93456kB kernel_stack:6240kB pagetables:12280kB unstable:0kB 
bounce:0kB free_cma:0kB writeback_tmp:0kB pages_scanned:0 all_unreclaimable? 
no
kernel: lowmem_reserve[]: 0 0 710 710
kernel: Node 0 Normal free:552kB min:1468kB low:1832kB high:2200kB 
active_anon:280320kB inactive_anon:146956kB active_file:109528kB 
inactive_file:110544kB unevictable:16332kB isolated(anon):0kB 
isolated(file):0kB present:786432kB managed:728012kB mlocked:0kB dirty:204kB 
writeback:14124kB mapped:7752kB shmem:21020kB slab_reclaimable:18632kB 
slab_unreclaimable:25564kB kernel_stack:2432kB pagetables:3504kB unstable:0kB 
bounce:0kB free_cma:0kB writeback_tmp:0kB pages_scanned:608 all_unreclaimable? 
no
kernel: lowmem_reserve[]: 0 0 0 0
kernel: Node 0 DMA: 4*4kB (UE) 4*8kB (UEM) 2*16kB (UE) 5*32kB (UEM) 3*64kB 
(UM) 2*128kB (UE) 1*256kB (E) 2*512kB (EM) 3*1024kB (UEM) 3*2048kB (UEM) 
1*4096kB (R) = 15280kB
kernel: Node 0 DMA32: 0*4kB 1*8kB (R) 0*16kB 0*32kB 1*64kB (R) 1*128kB (R) 
1*256kB (R) 3*512kB (R) 1*1024kB (R) 1*2048kB (R) 0*4096kB = 5064kB
kernel: 

Re: tracker.ceph.com downtime today

2015-10-22 Thread Dan Mick
Fixed a configuration problem preventing updating issues, and switched
the mailer to use ipv4; if you updated and failed, or missed an email
notification, that may have been why.

On 10/22/2015 04:51 PM, Dan Mick wrote:
> It's back.  New DNS info is propagating its way around.  If you
> absolutely must get to it, newtracker.ceph.com is the new address, but
> please don't bookmark that, as it will be going away after the transition.
> 
> Please let me know of any problems you have.
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


RE: newstore direction

2015-10-22 Thread Allen Samuels
How would this kind of split affect small transactions? Will each split be 
separately transactionally consistent or is there some kind of meta-transaction 
that synchronizes each of the splits?


Allen Samuels
Software Architect, Fellow, Systems and Software Solutions

2880 Junction Avenue, San Jose, CA 95134
T: +1 408 801 7030| M: +1 408 780 6416
allen.samu...@sandisk.com

-Original Message-
From: ceph-devel-ow...@vger.kernel.org 
[mailto:ceph-devel-ow...@vger.kernel.org] On Behalf Of Samuel Just
Sent: Friday, October 23, 2015 8:42 AM
To: James (Fei) Liu-SSI 
Cc: Sage Weil ; Ric Wheeler ; Orit 
Wasserman ; ceph-devel@vger.kernel.org
Subject: Re: newstore direction

Since the changes which moved the pg log and the pg info into the pg object 
space, I think it's now the case that any transaction submitted to the 
objectstore updates a disjoint range of objects determined by the sequencer.  
It might be easier to exploit that parallelism if we control allocation and 
allocation related metadata.  We could split the store into N pieces which 
partition the pg space (one additional one for the meta sequencer?) with one 
rocksdb instance for each.
Space could then be parcelled out in large pieces (small frequency of global 
allocation decisions) and managed more finely within each partition.  The main 
challenge would be avoiding internal fragmentation of those, but at least 
defragmentation can be managed on a per-partition basis.  Such parallelism is 
probably necessary to exploit the full throughput of some ssds.
-Sam

On Thu, Oct 22, 2015 at 10:42 AM, James (Fei) Liu-SSI 
 wrote:
> Hi Sage and other fellow cephers,
>   I truly share the pains with you  all about filesystem while I am working 
> on  objectstore to improve the performance. As mentioned , there is nothing 
> wrong with filesystem. Just the Ceph as one of  use case need more supports 
> but not provided in near future by filesystem no matter what reasons.
>
>There are so many techniques  pop out which can help to improve 
> performance of OSD.  User space driver(DPDK from Intel) is one of them. It 
> not only gives you the storage allocator,  also gives you the thread 
> scheduling support,  CPU affinity , NUMA friendly, polling  which  might 
> fundamentally change the performance of objectstore.  It should not be hard 
> to improve CPU utilization 3x~5x times, higher IOPS etc.
> I totally agreed that goal of filestore is to gives enough support for 
> filesystem with either 1 ,1b, or 2 solutions. In my humble opinion , The new 
> design goal of objectstore should focus on giving the best  performance for 
> OSD with new techniques. These two goals are not going to conflict with each 
> other.  They are just for different purposes to make Ceph not only more 
> stable but also better.
>
>   Scylla mentioned by Orit is a good example .
>
>   Thanks all.
>
>   Regards,
>   James
>
> -Original Message-
> From: ceph-devel-ow...@vger.kernel.org
> [mailto:ceph-devel-ow...@vger.kernel.org] On Behalf Of Sage Weil
> Sent: Thursday, October 22, 2015 5:50 AM
> To: Ric Wheeler
> Cc: Orit Wasserman; ceph-devel@vger.kernel.org
> Subject: Re: newstore direction
>
> On Wed, 21 Oct 2015, Ric Wheeler wrote:
>> You will have to trust me on this as the Red Hat person who spoke to
>> pretty much all of our key customers about local file systems and
>> storage - customers all have migrated over to using normal file systems 
>> under Oracle/DB2.
>> Typically, they use XFS or ext4.  I don't know of any non-standard
>> file systems and only have seen one account running on a raw block
>> store in 8 years
>> :)
>>
>> If you have a pre-allocated file and write using O_DIRECT, your IO
>> path is identical in terms of IO's sent to the device.
>>
>> If we are causing additional IO's, then we really need to spend some
>> time talking to the local file system gurus about this in detail.  I
>> can help with that conversation.
>
> If the file is truly preallocated (that is, prewritten with zeros...
> fallocate doesn't help here because the extents is marked unwritten),
> then
> sure: there is very little change in the data path.
>
> But at that point, what is the point?  This only works if you have one (or a 
> few) huge files and the user space app already has all the complexity of a 
> filesystem-like thing (with its own internal journal, allocators, garbage 
> collection, etc.).  Do they just do this to ease administrative tasks like 
> backup?
>
>
> This is the fundamental tradeoff:
>
> 1) We have a file per object.  We fsync like crazy and the fact that there 
> are two independent layers journaling and managing different types of 
> consistency penalizes us.
>
> 1b) We get clever and start using obscure and/or custom ioctls in the file 
> system to work around what it is used to: we swap extents to avoid 
> write-ahead (see 

Re: tracker.ceph.com downtime today

2015-10-22 Thread Dan Mick
tracker.ceph.com down now

On 10/22/2015 03:20 PM, Dan Mick wrote:
> tracker.ceph.com will be brought down today for upgrade and move to a
> new host.  I plan to do this at about 4PM PST (40 minutes from now).
> Expect a downtime of about 15-20 minutes.  More notification to follow.
> 


-- 
Dan Mick
Red Hat, Inc.
Ceph docs: http://ceph.com/docs
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: tracker.ceph.com downtime today

2015-10-22 Thread Kyle Bader
I tried to open a new issue and got this error:

Internal error

An error occurred on the page you were trying to access.
If you continue to experience problems please contact your Redmine
administrator for assistance.

If you are the Redmine administrator, check your log files for details
about the error.


On Thu, Oct 22, 2015 at 6:15 PM, Dan Mick  wrote:
> Fixed a configuration problem preventing updating issues, and switched
> the mailer to use ipv4; if you updated and failed, or missed an email
> notification, that may have been why.
>
> On 10/22/2015 04:51 PM, Dan Mick wrote:
>> It's back.  New DNS info is propagating its way around.  If you
>> absolutely must get to it, newtracker.ceph.com is the new address, but
>> please don't bookmark that, as it will be going away after the transition.
>>
>> Please let me know of any problems you have.
>
> ---
> Note: This list is intended for discussions relating to Red Hat Storage 
> products, customers and/or support. Discussions on GlusterFS and Ceph 
> architecture, design and engineering should go to relevant upstream mailing 
> lists.



-- 
Kyle Bader - Red Hat
Senior Solution Architect
Ceph Storage Architectures
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


when an osd is started up, IO will be blocked

2015-10-22 Thread wangsongbo

Hi all,

When an osd is started, relative IO will be blocked.
According to the test result,the larger iops the clients send , the 
longer it will take to elapse.
Adjustment on all the parameters associate with recovery operations was 
also found useless.


How to reduce the impact of this process on the IO ?

Thanks and Regards,
WangSongbo

--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: tracker.ceph.com downtime today

2015-10-22 Thread Dan Mick
Found that issue; reverted the database to the non-backlog-plugin state,
created a test bug.  Retry?

On 10/22/2015 06:54 PM, Dan Mick wrote:
> I see that too.  I suspect this is because of leftover database columns
> from the backlogs plugin, which is removed.  Looking into it.
> 
> On 10/22/2015 06:43 PM, Kyle Bader wrote:
>> I tried to open a new issue and got this error:
>>
>> Internal error
>>
>> An error occurred on the page you were trying to access.
>> If you continue to experience problems please contact your Redmine
>> administrator for assistance.
>>
>> If you are the Redmine administrator, check your log files for details
>> about the error.
>>
>>
>> On Thu, Oct 22, 2015 at 6:15 PM, Dan Mick  wrote:
>>> Fixed a configuration problem preventing updating issues, and switched
>>> the mailer to use ipv4; if you updated and failed, or missed an email
>>> notification, that may have been why.
>>>
>>> On 10/22/2015 04:51 PM, Dan Mick wrote:
 It's back.  New DNS info is propagating its way around.  If you
 absolutely must get to it, newtracker.ceph.com is the new address, but
 please don't bookmark that, as it will be going away after the transition.

 Please let me know of any problems you have.
>>>
>>> ---
>>> Note: This list is intended for discussions relating to Red Hat Storage 
>>> products, customers and/or support. Discussions on GlusterFS and Ceph 
>>> architecture, design and engineering should go to relevant upstream mailing 
>>> lists.
>>
>>
>>
> 

--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: tracker.ceph.com downtime today

2015-10-22 Thread Dan Mick
I see that too.  I suspect this is because of leftover database columns
from the backlogs plugin, which is removed.  Looking into it.

On 10/22/2015 06:43 PM, Kyle Bader wrote:
> I tried to open a new issue and got this error:
> 
> Internal error
> 
> An error occurred on the page you were trying to access.
> If you continue to experience problems please contact your Redmine
> administrator for assistance.
> 
> If you are the Redmine administrator, check your log files for details
> about the error.
> 
> 
> On Thu, Oct 22, 2015 at 6:15 PM, Dan Mick  wrote:
>> Fixed a configuration problem preventing updating issues, and switched
>> the mailer to use ipv4; if you updated and failed, or missed an email
>> notification, that may have been why.
>>
>> On 10/22/2015 04:51 PM, Dan Mick wrote:
>>> It's back.  New DNS info is propagating its way around.  If you
>>> absolutely must get to it, newtracker.ceph.com is the new address, but
>>> please don't bookmark that, as it will be going away after the transition.
>>>
>>> Please let me know of any problems you have.
>>
>> ---
>> Note: This list is intended for discussions relating to Red Hat Storage 
>> products, customers and/or support. Discussions on GlusterFS and Ceph 
>> architecture, design and engineering should go to relevant upstream mailing 
>> lists.
> 
> 
> 

--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: newstore direction

2015-10-22 Thread Ric Wheeler
I disagree with your point still - your argument was that customers don't like 
to update their code so we cannot rely on them moving to better file system 
code.  Those same customers would be *just* as reluctant to upgrade OSD code.  
Been there, done that in pure block storage, pure object storage and in file 
system code (customers just don't care about the protocol, the conservative 
nature is consistent).


Not a casual observation, I have been building storage systems since the 
mid-80's.

Regards,

Ric

On 10/21/2015 09:22 PM, Allen Samuels wrote:

I agree. My only point was that you still have to factor this time into the argument that 
by continuing to put NewStore on top of a file system you'll get to a stable system much 
sooner than the longer development path of doing your own raw storage allocator. IMO, 
once you factor that into the equation the "on top of an FS" path doesn't look 
like such a clear winner.


Allen Samuels
Software Architect, Fellow, Systems and Software Solutions

2880 Junction Avenue, San Jose, CA 95134
T: +1 408 801 7030| M: +1 408 780 6416
allen.samu...@sandisk.com

-Original Message-
From: Ric Wheeler [mailto:rwhee...@redhat.com]
Sent: Thursday, October 22, 2015 10:17 AM
To: Allen Samuels ; Sage Weil ; 
ceph-devel@vger.kernel.org
Subject: Re: newstore direction

On 10/21/2015 08:53 PM, Allen Samuels wrote:

Fixing the bug doesn't take a long time. Getting it deployed is where the delay is. Many 
companies standardize on a particular release of a particular distro. Getting them to 
switch to a new release -- even a "bug fix" point release -- is a major 
undertaking that often is a complete roadblock. Just my experience. YMMV.


Customers do control the pace that they upgrade their machines, but we put out 
fixes on a very regular pace.  A lot of customers will get fixes without having 
to qualify a full new release (i.e., fixes come out between major and minor 
releases are easy).

If someone is deploying a critical server for storage, then it falls back on 
the storage software team to help guide them and encourage them to update when 
needed (and no promises of success, but people move if the win is big. If it is 
not, they can wait).

ric




PLEASE NOTE: The information contained in this electronic mail message is 
intended only for the use of the designated recipient(s) named above. If the 
reader of this message is not the intended recipient, you are hereby notified 
that you have received this message in error and that any review, 
dissemination, distribution, or copying of this message is strictly prohibited. 
If you have received this communication in error, please notify the sender by 
telephone or e-mail (as shown above) immediately and destroy any and all copies 
of this message in your possession (whether hard copies or electronically 
stored copies).



--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: newstore direction

2015-10-22 Thread Ric Wheeler

On 10/22/2015 08:50 AM, Sage Weil wrote:

On Wed, 21 Oct 2015, Ric Wheeler wrote:

You will have to trust me on this as the Red Hat person who spoke to pretty
much all of our key customers about local file systems and storage - customers
all have migrated over to using normal file systems under Oracle/DB2.
Typically, they use XFS or ext4.  I don't know of any non-standard file
systems and only have seen one account running on a raw block store in 8 years
:)

If you have a pre-allocated file and write using O_DIRECT, your IO path is
identical in terms of IO's sent to the device.

If we are causing additional IO's, then we really need to spend some time
talking to the local file system gurus about this in detail.  I can help with
that conversation.

If the file is truly preallocated (that is, prewritten with zeros...
fallocate doesn't help here because the extents is marked unwritten), then
sure: there is very little change in the data path.

But at that point, what is the point?  This only works if you have one (or
a few) huge files and the user space app already has all the complexity of
a filesystem-like thing (with its own internal journal, allocators,
garbage collection, etc.).  Do they just do this to ease administrative
tasks like backup?


I think that the key here is that if we fsync() like crazy - regardless of 
writing to a file system or to some new, yet to be define block device primitive 
store - we are limited to the IOP's of that particular block device.


Ignoring exotic hardware configs for people who can ignore all SSD devices, we 
will have rotating, high capacity, slow spinning drives for *a long time* as the 
eventual tier.  Given that assumption, we need to do better then to be limited 
to synchronous IOP's for a slow drive.  When we have commodity pricing for 
things like persistent DRAM, then I agree that writing directly to that medium 
makes sense (but you can do that with DAX by effectively mapping that into the 
process address space).


Specifically, moving from a file system with some inefficiencies will only boost 
performance from say 20-30 IOP's to roughly 40-50 IOP's.


The way this has been handled traditionally for things like databases, etc is:

* batch up the transactions that need to be destaged
* issue an O_DIRECT async IO for all of the elements that need to be written 
(bypassed the page cache, direct to the backing store)

* wait for completion

We should probably add to that sequence an fsync() of the directory (or a file 
in the file system) to insure that any volatile write cache is invalidated, but 
there is *no* reason to fsync() each file.


I think that we need to look at why the write pattern is so heavily synchronous 
and single threaded if we are hoping to extract from any given storage tier its 
maximum performance.


Doing this can raise your file creations per second (or allocations per second) 
from a few dozen to a few hundred or more per second.


The complexity that writing a new block level allocation strategy that you save 
is:

* if you lay out a lot of small objects on the block store that can grow, we 
will quickly end up doing very complicated techniques that file systems solved a 
long time ago (pre-allocation, etc)
* multi-stream aware allocation if you have multiple processes writing to the 
same store
* tracking things like allocated but unwritten (can happen if some process 
"pokes" a hole in an object, common with things like virtual machine images)


One we end up handling all of that in new, untested code, I think that we end up 
with a lot of pain and only minimal gain in terms of performance.


ric




This is the fundamental tradeoff:

1) We have a file per object.  We fsync like crazy and the fact that
there are two independent layers journaling and managing different types
of consistency penalizes us.

1b) We get clever and start using obscure and/or custom ioctls in the file
system to work around what it is used to: we swap extents to avoid
write-ahead (see Christoph's patch), O_NOMTIME, unprivileged
open-by-handle, batch fsync, O_ATOMIC, setext ioctl, etc.

2) We preallocate huge files and write a user-space object system that
lives within it (pretending the file is a block device).  The file system
rarely gets in the way (assuming the file is prewritten and we don't do
anything stupid).  But it doesn't give us anything a block device
wouldn't, and it doesn't save us any complexity in our code.

At the end of the day, 1 and 1b are always going to be slower than 2.
And although 1b performs a bit better than 1, it has similar (user-space)
complexity to 2.  On the other hand, if you step back and view teh
entire stack (ceph-osd + XFS), 1 and 1b are *significantly* more complex
than 2... and yet still slower.  Given we ultimately have to support both
(both as an upstream and as a distro), that's not very attractive.

Also note that every time we have strayed off the reservation from the
beaten path (1) to anything mildly exotic 

Re: newstore direction

2015-10-22 Thread Howard Chu
Milosz Tanski  adfin.com> writes:

> 
> On Tue, Oct 20, 2015 at 4:00 PM, Sage Weil  redhat.com> wrote:
> > On Tue, 20 Oct 2015, John Spray wrote:
> >> On Mon, Oct 19, 2015 at 8:49 PM, Sage Weil  redhat.com> wrote:
> >> >  - We have to size the kv backend storage (probably still an XFS
> >> > partition) vs the block storage.  Maybe we do this anyway (put
metadata on
> >> > SSD!) so it won't matter.  But what happens when we are storing gobs of
> >> > rgw index data or cephfs metadata?  Suddenly we are pulling storage
out of
> >> > a different pool and those aren't currently fungible.
> >>
> >> This is the concerning bit for me -- the other parts one "just" has to
> >> get the code right, but this problem could linger and be something we
> >> have to keep explaining to users indefinitely.  It reminds me of cases
> >> in other systems where users had to make an educated guess about inode
> >> size up front, depending on whether you're expecting to efficiently
> >> store a lot of xattrs.
> >>
> >> In practice it's rare for users to make these kinds of decisions well
> >> up-front: it really needs to be adjustable later, ideally
> >> automatically.  That could be pretty straightforward if the KV part
> >> was stored directly on block storage, instead of having XFS in the
> >> mix.  I'm not quite up with the state of the art in this area: are
> >> there any reasonable alternatives for the KV part that would consume
> >> some defined range of a block device from userspace, instead of
> >> sitting on top of a filesystem?
> >
> > I agree: this is my primary concern with the raw block approach.
> >
> > There are some KV alternatives that could consume block, but the problem
> > would be similar: we need to dynamically size up or down the kv portion of
> > the device.
> >
> > I see two basic options:
> >
> > 1) Wire into the Env abstraction in rocksdb to provide something just
> > smart enough to let rocksdb work.  It isn't much: named files (not that
> > many--we could easily keep the file table in ram), always written
> > sequentially, to be read later with random access. All of the code is
> > written around abstractions of SequentialFileWriter so that everything
> > posix is neatly hidden in env_posix (and there are various other env
> > implementations for in-memory mock tests etc.).
> >
> > 2) Use something like dm-thin to sit between the raw block device and XFS
> > (for rocksdb) and the block device consumed by newstore.  As long as XFS
> > doesn't fragment horrifically (it shouldn't, given we *always* write ~4mb
> > files in their entirety) we can fstrim and size down the fs portion.  If
> > we similarly make newstores allocator stick to large blocks only we would
> > be able to size down the block portion as well.  Typical dm-thin block
> > sizes seem to range from 64KB to 512KB, which seems reasonable enough to
> > me.  In fact, we could likely just size the fs volume at something
> > conservatively large (like 90%) and rely on -o discard or periodic fstrim
> > to keep its actual utilization in check.
> >
> 
> I think you could prototype a raw block device OSD store using LMDB as
> a starting point. I know there's been some experiments using LMDB as
> KV store before with positive read numbers and not great write
> numbers.
> 
> 1. It mmaps, just mmap the raw disk device / partition. I've done this
> as an experiment before, I can dig up a patch for LMDB.
> 2. It already has a free space management strategy. I'm prob it's not
> right for the OSDs in the long term but there's something to start
> there with.
> 3. It's already supports transactions / COW.
> 4. LMDB isn't a huge code base so it might be a good place to start /
> evolve code from.
> 5. You're not starting a multi-year effort at the 0 point.
> 
> As to the not great write performance, that could be addressed by
> write transaction merging (what mysql implemented a few years ago).

We have a heavily hacked version of LMDB contributed by VMware that
implements a WAL. In my preliminary testing it performs synchronous writes
30x faster (on average) than current LMDB. Their version unfortunately
slashed'n'burned a lot of LMDB features that other folks actually need, so
we can't use it as-is. Currently working on rationalizing the approach and
merging it into mdb.master.

The reasons for the WAL approach:
  1) obviously sequential writes are cheaper than random writes.
  2) fsync() of a small log file will always be faster than fsync() of a
large DB. I.e., fsync() latency is proportional to the total number of pages
in the file, not just the number of dirty pages.

LMDB on a raw block device is a simpler proposition, and one we intend to
integrate soon as well. (Milosz, did you ever submit your changes?)

> Here you have an opportunity to do it two days. One, you can do it in
> the application layer while waiting for the fsync from transaction to
> complete. This is probably the easier route. Two, you can do it in the
> DB layer (the LMDB 

Re: MDS stuck in a crash loop

2015-10-22 Thread Milosz Tanski
On Wed, Oct 21, 2015 at 5:33 PM, John Spray  wrote:
> On Wed, Oct 21, 2015 at 10:33 PM, John Spray  wrote:
>>> John, I know you've got
>>> https://github.com/ceph/ceph-qa-suite/pull/647. I think that's
>>> supposed to be for this, but I'm not sure if you spotted any issues
>>> with it or if we need to do some more diagnosing?
>>
>> That test path is just verifying that we do handle dirs without dying
>> in at least one case -- it passes with the existing ceph code, so it's
>> not reproducing this issue.
>
> Clicked send to soon, I was about to add...
>
> Milosz mentioned that they don't have the data from the system in the
> broken state, so I don't have any bright ideas about learning more
> about what went wrong here unfortunately.
>

Sorry about that, wasn't thinking at the time and just wanted to get
this up and going as quickly as possible :(

If this happens next time I'll be more careful to keep more evidence.
I think multi-fs in the same rados namespace support would actually
helped here, since it makes it easier to create a newfs and leave the
other one around (for investigation)

But makes me wonder that the broken dir scenario can probably be
replicated by hand using rados calls. There's a pretty generic ticket
there for don't die on dir errors, but I imagine the code can be
audited and steps to cause a synthetic error can be produced.

-- 
Milosz Tanski
CTO
16 East 34th Street, 15th floor
New York, NY 10016

p: 646-253-9055
e: mil...@adfin.com
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: newstore direction

2015-10-22 Thread Sage Weil
On Wed, 21 Oct 2015, Ric Wheeler wrote:
> You will have to trust me on this as the Red Hat person who spoke to pretty
> much all of our key customers about local file systems and storage - customers
> all have migrated over to using normal file systems under Oracle/DB2.
> Typically, they use XFS or ext4.  I don't know of any non-standard file
> systems and only have seen one account running on a raw block store in 8 years
> :)
> 
> If you have a pre-allocated file and write using O_DIRECT, your IO path is
> identical in terms of IO's sent to the device.
> 
> If we are causing additional IO's, then we really need to spend some time
> talking to the local file system gurus about this in detail.  I can help with
> that conversation.

If the file is truly preallocated (that is, prewritten with zeros... 
fallocate doesn't help here because the extents is marked unwritten), then 
sure: there is very little change in the data path.

But at that point, what is the point?  This only works if you have one (or 
a few) huge files and the user space app already has all the complexity of 
a filesystem-like thing (with its own internal journal, allocators, 
garbage collection, etc.).  Do they just do this to ease administrative 
tasks like backup?


This is the fundamental tradeoff:

1) We have a file per object.  We fsync like crazy and the fact that 
there are two independent layers journaling and managing different types 
of consistency penalizes us.

1b) We get clever and start using obscure and/or custom ioctls in the file 
system to work around what it is used to: we swap extents to avoid 
write-ahead (see Christoph's patch), O_NOMTIME, unprivileged 
open-by-handle, batch fsync, O_ATOMIC, setext ioctl, etc.

2) We preallocate huge files and write a user-space object system that 
lives within it (pretending the file is a block device).  The file system 
rarely gets in the way (assuming the file is prewritten and we don't do 
anything stupid).  But it doesn't give us anything a block device 
wouldn't, and it doesn't save us any complexity in our code.

At the end of the day, 1 and 1b are always going to be slower than 2.  
And although 1b performs a bit better than 1, it has similar (user-space) 
complexity to 2.  On the other hand, if you step back and view teh 
entire stack (ceph-osd + XFS), 1 and 1b are *significantly* more complex 
than 2... and yet still slower.  Given we ultimately have to support both 
(both as an upstream and as a distro), that's not very attractive.

Also note that every time we have strayed off the reservation from the 
beaten path (1) to anything mildly exotic (1b) we have been bitten by 
obscure file systems bugs.  And that's assume we get everything we need 
upstream... which is probably a year's endeavour.

Don't get me wrong: I'm all for making changes to file systems to better 
support systems like Ceph.  Things like O_NOCMTIME and O_ATOMIC make a 
huge amount of sense of a ton of different systems.  But our situations is 
a bit different: we always own the entire device (and often the server), 
so there is no need to share with other users or apps (and when you do, 
you just use the existing FileStore backend).  And as you know performance 
is a huge pain point.  We are already handicapped by virtue of being 
distributed and strongly consistent; we can't afford to give away more to 
a storage layer that isn't providing us much (or the right) value.

And I'm tired of half measures.  I want the OSD to be as fast as we can 
make it given the architectural constraints (RADOS consistency and 
ordering semantics).  This is truly low-hanging fruit: it's modular, 
self-contained, pluggable, and this will be my third time around this 
particular block.

sage
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: MDS stuck in a crash loop

2015-10-22 Thread Sage Weil
On Thu, 22 Oct 2015, John Spray wrote:
> On Thu, Oct 22, 2015 at 1:43 PM, Milosz Tanski  wrote:
> > On Wed, Oct 21, 2015 at 5:33 PM, John Spray  wrote:
> >> On Wed, Oct 21, 2015 at 10:33 PM, John Spray  wrote:
>  John, I know you've got
>  https://github.com/ceph/ceph-qa-suite/pull/647. I think that's
>  supposed to be for this, but I'm not sure if you spotted any issues
>  with it or if we need to do some more diagnosing?
> >>>
> >>> That test path is just verifying that we do handle dirs without dying
> >>> in at least one case -- it passes with the existing ceph code, so it's
> >>> not reproducing this issue.
> >>
> >> Clicked send to soon, I was about to add...
> >>
> >> Milosz mentioned that they don't have the data from the system in the
> >> broken state, so I don't have any bright ideas about learning more
> >> about what went wrong here unfortunately.
> >>
> >
> > Sorry about that, wasn't thinking at the time and just wanted to get
> > this up and going as quickly as possible :(
> >
> > If this happens next time I'll be more careful to keep more evidence.
> > I think multi-fs in the same rados namespace support would actually
> > helped here, since it makes it easier to create a newfs and leave the
> > other one around (for investigation)
> 
> Yep, good point.  I am a known enthusiast for multi-filesystem support :-)

A rados pool export on the metadata pool would have helped, too.  That 
doesn't include data object backtrace metadata, though.  I wonder if we 
should make a cephfs metadata imager tool to capture the metadata state of 
the file system (similar to the tools that are available for xfs) that 
captures both.  On the data pool side it'd just record the object names, 
xattrs, and object size, ignoring the data.

It wouldn't anonymize filenames (that is tricky without breaking the mds 
dir hashing), but it excludes data and would probably be 
sufficient for most users...

sage
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: newstore direction

2015-10-22 Thread Milosz Tanski
On Tue, Oct 20, 2015 at 4:00 PM, Sage Weil  wrote:
> On Tue, 20 Oct 2015, John Spray wrote:
>> On Mon, Oct 19, 2015 at 8:49 PM, Sage Weil  wrote:
>> >  - We have to size the kv backend storage (probably still an XFS
>> > partition) vs the block storage.  Maybe we do this anyway (put metadata on
>> > SSD!) so it won't matter.  But what happens when we are storing gobs of
>> > rgw index data or cephfs metadata?  Suddenly we are pulling storage out of
>> > a different pool and those aren't currently fungible.
>>
>> This is the concerning bit for me -- the other parts one "just" has to
>> get the code right, but this problem could linger and be something we
>> have to keep explaining to users indefinitely.  It reminds me of cases
>> in other systems where users had to make an educated guess about inode
>> size up front, depending on whether you're expecting to efficiently
>> store a lot of xattrs.
>>
>> In practice it's rare for users to make these kinds of decisions well
>> up-front: it really needs to be adjustable later, ideally
>> automatically.  That could be pretty straightforward if the KV part
>> was stored directly on block storage, instead of having XFS in the
>> mix.  I'm not quite up with the state of the art in this area: are
>> there any reasonable alternatives for the KV part that would consume
>> some defined range of a block device from userspace, instead of
>> sitting on top of a filesystem?
>
> I agree: this is my primary concern with the raw block approach.
>
> There are some KV alternatives that could consume block, but the problem
> would be similar: we need to dynamically size up or down the kv portion of
> the device.
>
> I see two basic options:
>
> 1) Wire into the Env abstraction in rocksdb to provide something just
> smart enough to let rocksdb work.  It isn't much: named files (not that
> many--we could easily keep the file table in ram), always written
> sequentially, to be read later with random access. All of the code is
> written around abstractions of SequentialFileWriter so that everything
> posix is neatly hidden in env_posix (and there are various other env
> implementations for in-memory mock tests etc.).
>
> 2) Use something like dm-thin to sit between the raw block device and XFS
> (for rocksdb) and the block device consumed by newstore.  As long as XFS
> doesn't fragment horrifically (it shouldn't, given we *always* write ~4mb
> files in their entirety) we can fstrim and size down the fs portion.  If
> we similarly make newstores allocator stick to large blocks only we would
> be able to size down the block portion as well.  Typical dm-thin block
> sizes seem to range from 64KB to 512KB, which seems reasonable enough to
> me.  In fact, we could likely just size the fs volume at something
> conservatively large (like 90%) and rely on -o discard or periodic fstrim
> to keep its actual utilization in check.
>

I think you could prototype a raw block device OSD store using LMDB as
a starting point. I know there's been some experiments using LMDB as
KV store before with positive read numbers and not great write
numbers.

1. It mmaps, just mmap the raw disk device / partition. I've done this
as an experiment before, I can dig up a patch for LMDB.
2. It already has a free space management strategy. I'm prob it's not
right for the OSDs in the long term but there's something to start
there with.
3. It's already supports transactions / COW.
4. LMDB isn't a huge code base so it might be a good place to start /
evolve code from.
5. You're not starting a multi-year effort at the 0 point.

As to the not great write performance, that could be addressed by
write transaction merging (what mysql implemented a few years ago).
Here you have an opportunity to do it two days. One, you can do it in
the application layer while waiting for the fsync from transaction to
complete. This is probably the easier route. Two, you can do it in the
DB layer (the LMDB transaction handling / locking) where you're
already started processing the following transactions using the
currently committing transaction (COW) as a starting point. This is
harder mostly because of the synchronization needed or involved.

I've actually spend some time thinking about doing LMDB write
transaction merging outside the OSD context. This was for another
project.

My 2 cents.

-- 
Milosz Tanski
CTO
16 East 34th Street, 15th floor
New York, NY 10016

p: 646-253-9055
e: mil...@adfin.com
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: MDS stuck in a crash loop

2015-10-22 Thread John Spray
On Thu, Oct 22, 2015 at 1:43 PM, Milosz Tanski  wrote:
> On Wed, Oct 21, 2015 at 5:33 PM, John Spray  wrote:
>> On Wed, Oct 21, 2015 at 10:33 PM, John Spray  wrote:
 John, I know you've got
 https://github.com/ceph/ceph-qa-suite/pull/647. I think that's
 supposed to be for this, but I'm not sure if you spotted any issues
 with it or if we need to do some more diagnosing?
>>>
>>> That test path is just verifying that we do handle dirs without dying
>>> in at least one case -- it passes with the existing ceph code, so it's
>>> not reproducing this issue.
>>
>> Clicked send to soon, I was about to add...
>>
>> Milosz mentioned that they don't have the data from the system in the
>> broken state, so I don't have any bright ideas about learning more
>> about what went wrong here unfortunately.
>>
>
> Sorry about that, wasn't thinking at the time and just wanted to get
> this up and going as quickly as possible :(
>
> If this happens next time I'll be more careful to keep more evidence.
> I think multi-fs in the same rados namespace support would actually
> helped here, since it makes it easier to create a newfs and leave the
> other one around (for investigation)

Yep, good point.  I am a known enthusiast for multi-filesystem support :-)

> But makes me wonder that the broken dir scenario can probably be
> replicated by hand using rados calls. There's a pretty generic ticket
> there for don't die on dir errors, but I imagine the code can be
> audited and steps to cause a synthetic error can be produced.

Yes, that part I have done (and will build into the automated tests in
due course) -- the bit that is still a mystery is how the damage
occurred to begin with.

John

>
> --
> Milosz Tanski
> CTO
> 16 East 34th Street, 15th floor
> New York, NY 10016
>
> p: 646-253-9055
> e: mil...@adfin.com
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH 0/3] Minor cleanup for locks API

2015-10-22 Thread Benjamin Coddington
NFS has recently been moving things around to cope with the situation where
a struct file may not be available during an unlock.  That work has
presented an opportunity to do a minor cleanup on the locks API.

Users of posix_lock_file_wait() (for FL_POSIX style locks) and
flock_lock_file_wait() (for FL_FLOCK style locks) can instead call
locks_lock_file_wait() for both lock types.  Because the passed-in file_lock
specifies its own type, the correct function can be selected on behalf of
the user.

This work allows further cleanup within NFS and lockd which will be
submitted separately.

Benjamin Coddington (3):
  locks: introduce locks_lock_inode_wait()
  Move locks API users to locks_lock_inode_wait()
  locks: cleanup posix_lock_inode_wait and flock_lock_inode_wait

 drivers/staging/lustre/lustre/llite/file.c |8 +-
 fs/9p/vfs_file.c   |4 +-
 fs/ceph/locks.c|4 +-
 fs/cifs/file.c |2 +-
 fs/dlm/plock.c |4 +-
 fs/fuse/file.c |2 +-
 fs/gfs2/file.c |8 +++---
 fs/lockd/clntproc.c|   13 +--
 fs/locks.c |   31 +++
 fs/nfs/file.c  |   13 +--
 fs/nfs/nfs4proc.c  |   13 +--
 fs/ocfs2/locks.c   |8 +++---
 include/linux/fs.h |   21 +++---
 13 files changed, 51 insertions(+), 80 deletions(-)

--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH 3/3] locks: cleanup posix_lock_inode_wait and flock_lock_inode_wait

2015-10-22 Thread Benjamin Coddington
All callers use locks_lock_inode_wait() instead.

Signed-off-by: Benjamin Coddington 
---
 fs/locks.c |5 +
 include/linux/fs.h |   24 
 2 files changed, 1 insertions(+), 28 deletions(-)

diff --git a/fs/locks.c b/fs/locks.c
index 94d50d3..b6f3c92 100644
--- a/fs/locks.c
+++ b/fs/locks.c
@@ -1167,8 +1167,7 @@ EXPORT_SYMBOL(posix_lock_file);
  * @inode: inode of file to which lock request should be applied
  * @fl: The lock to be applied
  *
- * Variant of posix_lock_file_wait that does not take a filp, and so can be
- * used after the filp has already been torn down.
+ * Apply a POSIX style lock request to an inode.
  */
 int posix_lock_inode_wait(struct inode *inode, struct file_lock *fl)
 {
@@ -1187,7 +1186,6 @@ int posix_lock_inode_wait(struct inode *inode, struct 
file_lock *fl)
}
return error;
 }
-EXPORT_SYMBOL(posix_lock_inode_wait);
 
 /**
  * locks_mandatory_locked - Check for an active lock
@@ -1873,7 +1871,6 @@ int flock_lock_inode_wait(struct inode *inode, struct 
file_lock *fl)
}
return error;
 }
-EXPORT_SYMBOL(flock_lock_inode_wait);
 
 /**
  * locks_lock_inode_wait - Apply a lock to an inode
diff --git a/include/linux/fs.h b/include/linux/fs.h
index 2e283b7..05b07c9 100644
--- a/include/linux/fs.h
+++ b/include/linux/fs.h
@@ -1053,12 +1053,10 @@ extern void locks_remove_file(struct file *);
 extern void locks_release_private(struct file_lock *);
 extern void posix_test_lock(struct file *, struct file_lock *);
 extern int posix_lock_file(struct file *, struct file_lock *, struct file_lock 
*);
-extern int posix_lock_inode_wait(struct inode *, struct file_lock *);
 extern int posix_unblock_lock(struct file_lock *);
 extern int vfs_test_lock(struct file *, struct file_lock *);
 extern int vfs_lock_file(struct file *, unsigned int, struct file_lock *, 
struct file_lock *);
 extern int vfs_cancel_lock(struct file *filp, struct file_lock *fl);
-extern int flock_lock_inode_wait(struct inode *inode, struct file_lock *fl);
 extern int locks_lock_inode_wait(struct inode *inode, struct file_lock *fl);
 extern int __break_lease(struct inode *inode, unsigned int flags, unsigned int 
type);
 extern void lease_get_mtime(struct inode *, struct timespec *time);
@@ -1145,12 +1143,6 @@ static inline int posix_lock_file(struct file *filp, 
struct file_lock *fl,
return -ENOLCK;
 }
 
-static inline int posix_lock_inode_wait(struct inode *inode,
-   struct file_lock *fl)
-{
-   return -ENOLCK;
-}
-
 static inline int posix_unblock_lock(struct file_lock *waiter)
 {
return -ENOENT;
@@ -1172,12 +1164,6 @@ static inline int vfs_cancel_lock(struct file *filp, 
struct file_lock *fl)
return 0;
 }
 
-static inline int flock_lock_inode_wait(struct inode *inode,
-   struct file_lock *request)
-{
-   return -ENOLCK;
-}
-
 static inline int locks_lock_file_wait(struct file *filp, struct file_lock *fl)
 {
return -ENOLCK;
@@ -1221,16 +1207,6 @@ static inline struct inode *file_inode(const struct file 
*f)
return f->f_inode;
 }
 
-static inline int posix_lock_file_wait(struct file *filp, struct file_lock *fl)
-{
-   return posix_lock_inode_wait(file_inode(filp), fl);
-}
-
-static inline int flock_lock_file_wait(struct file *filp, struct file_lock *fl)
-{
-   return flock_lock_inode_wait(file_inode(filp), fl);
-}
-
 static inline int locks_lock_file_wait(struct file *filp, struct file_lock *fl)
 {
return locks_lock_inode_wait(file_inode(filp), fl);
-- 
1.7.1

--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: MDS stuck in a crash loop

2015-10-22 Thread Milosz Tanski
On Thu, Oct 22, 2015 at 8:48 AM, John Spray  wrote:
> On Thu, Oct 22, 2015 at 1:43 PM, Milosz Tanski  wrote:
>> On Wed, Oct 21, 2015 at 5:33 PM, John Spray  wrote:
>>> On Wed, Oct 21, 2015 at 10:33 PM, John Spray  wrote:
> John, I know you've got
> https://github.com/ceph/ceph-qa-suite/pull/647. I think that's
> supposed to be for this, but I'm not sure if you spotted any issues
> with it or if we need to do some more diagnosing?

 That test path is just verifying that we do handle dirs without dying
 in at least one case -- it passes with the existing ceph code, so it's
 not reproducing this issue.
>>>
>>> Clicked send to soon, I was about to add...
>>>
>>> Milosz mentioned that they don't have the data from the system in the
>>> broken state, so I don't have any bright ideas about learning more
>>> about what went wrong here unfortunately.
>>>
>>
>> Sorry about that, wasn't thinking at the time and just wanted to get
>> this up and going as quickly as possible :(
>>
>> If this happens next time I'll be more careful to keep more evidence.
>> I think multi-fs in the same rados namespace support would actually
>> helped here, since it makes it easier to create a newfs and leave the
>> other one around (for investigation)
>
> Yep, good point.  I am a known enthusiast for multi-filesystem support :-)
>
>> But makes me wonder that the broken dir scenario can probably be
>> replicated by hand using rados calls. There's a pretty generic ticket
>> there for don't die on dir errors, but I imagine the code can be
>> audited and steps to cause a synthetic error can be produced.
>
> Yes, that part I have done (and will build into the automated tests in
> due course) -- the bit that is still a mystery is how the damage
> occurred to begin with.

John, my money is on me somehow fumbling the recovery process. And,
without the bash history falling off I'm going to assume that.

-- 
Milosz Tanski
CTO
16 East 34th Street, 15th floor
New York, NY 10016

p: 646-253-9055
e: mil...@adfin.com
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] mark rbd requiring stable pages

2015-10-22 Thread Mike Christie
On 10/22/2015 06:20 AM, Ilya Dryomov wrote:
> 
>> >
>> > If we are just talking about if stable pages are not used, and someone
>> > is re-writing data to a page after the page has already been submitted
>> > to the block layer (I mean the page is on some bio which is on a request
>> > which is on some request_queue scheduler list or basically anywhere in
>> > the block layer), then I was saying this can occur with any block
>> > driver. There is nothing that is preventing this from happening with a
>> > FC driver or nvme or cciss or in dm or whatever. The app/user can
>> > rewrite as late as when we are in the make_request_fn/request_fn.
>> >
>> > I think I am misunderstanding your question because I thought this is
>> > expected behavior, and there is nothing drivers can do if the app is not
>> > doing a flush/sync between these types of write sequences.
> I don't see a problem with rewriting as late as when we are in
> request_fn() (or in a wq after being put there by request_fn()).  Where
> I thought there *might* be an issue is rewriting after sendpage(), if
> sendpage() is used - perhaps some sneaky sequence similar to that
> retransmit bug that would cause us to *transmit* incorrect bytes (as
> opposed to *re*transmit) or something of that nature?


Just to make sure we are on the same page.

Are you concerned about the tcp/net layer retransmitting due to it
detecting a issue as part of the tcp protocol, or are you concerned
about rbd/libceph initiating a retry like with the nfs issue?
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH] net: ceph: osd_client: change osd_req_op_data() macro

2015-10-22 Thread Ioana Ciornei
This patch changes the osd_req_op_data() macro to not evaluate
parameters more than once in order to follow the kernel coding style.

Signed-off-by: Ioana Ciornei 
Reviewed-by: Alex Elder 
---
 net/ceph/osd_client.c | 10 ++
 1 file changed, 6 insertions(+), 4 deletions(-)

diff --git a/net/ceph/osd_client.c b/net/ceph/osd_client.c
index a362d7e..856e8f8 100644
--- a/net/ceph/osd_client.c
+++ b/net/ceph/osd_client.c
@@ -120,10 +120,12 @@ static void ceph_osd_data_bio_init(struct ceph_osd_data 
*osd_data,
 }
 #endif /* CONFIG_BLOCK */
 
-#define osd_req_op_data(oreq, whch, typ, fld)  \
-   ({  \
-   BUG_ON(whch >= (oreq)->r_num_ops);  \
-   &(oreq)->r_ops[whch].typ.fld;   \
+#define osd_req_op_data(oreq, whch, typ, fld)\
+   ({\
+   struct ceph_osd_request *__oreq = (oreq); \
+   unsigned int __whch = (whch);   \
+   BUG_ON(__whch >= __oreq->r_num_ops);  \
+   &__oreq->r_ops[__whch].typ.fld;   \
})
 
 static struct ceph_osd_data *
-- 
2.1.4

--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH 2/3] Move locks API users to locks_lock_inode_wait()

2015-10-22 Thread Benjamin Coddington
Instead of having users check for FL_POSIX or FL_FLOCK to call the correct
locks API function, use the check within locks_lock_inode_wait().  This
allows for some later cleanup.

Signed-off-by: Benjamin Coddington 
---
 drivers/staging/lustre/lustre/llite/file.c |8 ++--
 fs/9p/vfs_file.c   |4 ++--
 fs/ceph/locks.c|4 ++--
 fs/cifs/file.c |2 +-
 fs/dlm/plock.c |4 ++--
 fs/fuse/file.c |2 +-
 fs/gfs2/file.c |8 
 fs/lockd/clntproc.c|   13 +
 fs/locks.c |2 +-
 fs/nfs/file.c  |   13 +
 fs/nfs/nfs4proc.c  |   13 +
 fs/ocfs2/locks.c   |8 
 12 files changed, 22 insertions(+), 59 deletions(-)

diff --git a/drivers/staging/lustre/lustre/llite/file.c 
b/drivers/staging/lustre/lustre/llite/file.c
index dcd0c6d..4edbf46 100644
--- a/drivers/staging/lustre/lustre/llite/file.c
+++ b/drivers/staging/lustre/lustre/llite/file.c
@@ -2763,13 +2763,9 @@ ll_file_flock(struct file *file, int cmd, struct 
file_lock *file_lock)
rc = md_enqueue(sbi->ll_md_exp, , NULL,
op_data, , , 0, NULL /* req */, flags);
 
-   if ((file_lock->fl_flags & FL_FLOCK) &&
-   (rc == 0 || file_lock->fl_type == F_UNLCK))
-   rc2  = flock_lock_file_wait(file, file_lock);
-   if ((file_lock->fl_flags & FL_POSIX) &&
-   (rc == 0 || file_lock->fl_type == F_UNLCK) &&
+   if ((rc == 0 || file_lock->fl_type == F_UNLCK) &&
!(flags & LDLM_FL_TEST_LOCK))
-   rc2  = posix_lock_file_wait(file, file_lock);
+   rc2  = locks_lock_file_wait(file, file_lock);
 
if (rc2 && file_lock->fl_type != F_UNLCK) {
einfo.ei_mode = LCK_NL;
diff --git a/fs/9p/vfs_file.c b/fs/9p/vfs_file.c
index 3abc447..f23fd86 100644
--- a/fs/9p/vfs_file.c
+++ b/fs/9p/vfs_file.c
@@ -161,7 +161,7 @@ static int v9fs_file_do_lock(struct file *filp, int cmd, 
struct file_lock *fl)
if ((fl->fl_flags & FL_POSIX) != FL_POSIX)
BUG();
 
-   res = posix_lock_file_wait(filp, fl);
+   res = locks_lock_file_wait(filp, fl);
if (res < 0)
goto out;
 
@@ -231,7 +231,7 @@ out_unlock:
if (res < 0 && fl->fl_type != F_UNLCK) {
fl_type = fl->fl_type;
fl->fl_type = F_UNLCK;
-   res = posix_lock_file_wait(filp, fl);
+   res = locks_lock_file_wait(filp, fl);
fl->fl_type = fl_type;
}
 out:
diff --git a/fs/ceph/locks.c b/fs/ceph/locks.c
index 6706bde..a2cb0c2 100644
--- a/fs/ceph/locks.c
+++ b/fs/ceph/locks.c
@@ -228,12 +228,12 @@ int ceph_flock(struct file *file, int cmd, struct 
file_lock *fl)
err = ceph_lock_message(CEPH_LOCK_FLOCK, CEPH_MDS_OP_SETFILELOCK,
file, lock_cmd, wait, fl);
if (!err) {
-   err = flock_lock_file_wait(file, fl);
+   err = locks_lock_file_wait(file, fl);
if (err) {
ceph_lock_message(CEPH_LOCK_FLOCK,
  CEPH_MDS_OP_SETFILELOCK,
  file, CEPH_LOCK_UNLOCK, 0, fl);
-   dout("got %d on flock_lock_file_wait, undid lock", err);
+   dout("got %d on locks_lock_file_wait, undid lock", err);
}
}
return err;
diff --git a/fs/cifs/file.c b/fs/cifs/file.c
index e2a6af1..6afdad7 100644
--- a/fs/cifs/file.c
+++ b/fs/cifs/file.c
@@ -1553,7 +1553,7 @@ cifs_setlk(struct file *file, struct file_lock *flock, 
__u32 type,
 
 out:
if (flock->fl_flags & FL_POSIX && !rc)
-   rc = posix_lock_file_wait(file, flock);
+   rc = locks_lock_file_wait(file, flock);
return rc;
 }
 
diff --git a/fs/dlm/plock.c b/fs/dlm/plock.c
index 5532f09..3585cc0 100644
--- a/fs/dlm/plock.c
+++ b/fs/dlm/plock.c
@@ -172,7 +172,7 @@ int dlm_posix_lock(dlm_lockspace_t *lockspace, u64 number, 
struct file *file,
rv = op->info.rv;
 
if (!rv) {
-   if (posix_lock_file_wait(file, fl) < 0)
+   if (locks_lock_file_wait(file, fl) < 0)
log_error(ls, "dlm_posix_lock: vfs lock error %llx",
  (unsigned long long)number);
}
@@ -262,7 +262,7 @@ int dlm_posix_unlock(dlm_lockspace_t *lockspace, u64 
number, struct file *file,
/* cause the vfs unlock to return ENOENT if lock is not found */
fl->fl_flags |= FL_EXISTS;
 
-   rv = posix_lock_file_wait(file, fl);
+   rv = locks_lock_file_wait(file, fl);
if (rv == -ENOENT) {
rv = 0;
goto 

[PATCH 1/3] locks: introduce locks_lock_inode_wait()

2015-10-22 Thread Benjamin Coddington
Users of the locks API commonly call either posix_lock_file_wait() or
flock_lock_file_wait() depending upon the lock type.  Add a new function
locks_lock_inode_wait() which will check and call the correct function for
the type of lock passed in.

Signed-off-by: Benjamin Coddington 
---
 fs/locks.c |   24 
 include/linux/fs.h |   11 +++
 2 files changed, 35 insertions(+), 0 deletions(-)

diff --git a/fs/locks.c b/fs/locks.c
index 2a54c80..68b1784 100644
--- a/fs/locks.c
+++ b/fs/locks.c
@@ -1876,6 +1876,30 @@ int flock_lock_inode_wait(struct inode *inode, struct 
file_lock *fl)
 EXPORT_SYMBOL(flock_lock_inode_wait);
 
 /**
+ * locks_lock_inode_wait - Apply a lock to an inode
+ * @inode: inode of the file to apply to
+ * @fl: The lock to be applied
+ *
+ * Apply a POSIX or FLOCK style lock request to an inode.
+ */
+int locks_lock_inode_wait(struct inode *inode, struct file_lock *fl)
+{
+   int res = 0;
+   switch (fl->fl_flags & (FL_POSIX|FL_FLOCK)) {
+   case FL_POSIX:
+   res = posix_lock_inode_wait(inode, fl);
+   break;
+   case FL_FLOCK:
+   res = flock_lock_inode_wait(inode, fl);
+   break;
+   default:
+   BUG();
+   }
+   return res;
+}
+EXPORT_SYMBOL(locks_lock_inode_wait);
+
+/**
  * sys_flock: - flock() system call.
  * @fd: the file descriptor to lock.
  * @cmd: the type of lock to apply.
diff --git a/include/linux/fs.h b/include/linux/fs.h
index 72d8a84..2e283b7 100644
--- a/include/linux/fs.h
+++ b/include/linux/fs.h
@@ -1059,6 +1059,7 @@ extern int vfs_test_lock(struct file *, struct file_lock 
*);
 extern int vfs_lock_file(struct file *, unsigned int, struct file_lock *, 
struct file_lock *);
 extern int vfs_cancel_lock(struct file *filp, struct file_lock *fl);
 extern int flock_lock_inode_wait(struct inode *inode, struct file_lock *fl);
+extern int locks_lock_inode_wait(struct inode *inode, struct file_lock *fl);
 extern int __break_lease(struct inode *inode, unsigned int flags, unsigned int 
type);
 extern void lease_get_mtime(struct inode *, struct timespec *time);
 extern int generic_setlease(struct file *, long, struct file_lock **, void 
**priv);
@@ -1177,6 +1178,11 @@ static inline int flock_lock_inode_wait(struct inode 
*inode,
return -ENOLCK;
 }
 
+static inline int locks_lock_file_wait(struct file *filp, struct file_lock *fl)
+{
+   return -ENOLCK;
+}
+
 static inline int __break_lease(struct inode *inode, unsigned int mode, 
unsigned int type)
 {
return 0;
@@ -1225,6 +1231,11 @@ static inline int flock_lock_file_wait(struct file 
*filp, struct file_lock *fl)
return flock_lock_inode_wait(file_inode(filp), fl);
 }
 
+static inline int locks_lock_file_wait(struct file *filp, struct file_lock *fl)
+{
+   return locks_lock_inode_wait(file_inode(filp), fl);
+}
+
 struct fasync_struct {
spinlock_t  fa_lock;
int magic;
-- 
1.7.1

--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 3/3] locks: cleanup posix_lock_inode_wait and flock_lock_inode_wait

2015-10-22 Thread kbuild test robot
Hi Benjamin,

[auto build test WARNING on jlayton/linux-next -- if it's inappropriate base, 
please suggest rules for selecting the more suitable base]

url:
https://github.com/0day-ci/linux/commits/Benjamin-Coddington/locks-introduce-locks_lock_inode_wait/20151022-233848
reproduce:
# apt-get install sparse
make ARCH=x86_64 allmodconfig
make C=1 CF=-D__CHECK_ENDIAN__


sparse warnings: (new ones prefixed by >>)

>> fs/locks.c:1176:5: sparse: symbol 'posix_lock_inode_wait' was not declared. 
>> Should it be static?
>> fs/locks.c:1863:5: sparse: symbol 'flock_lock_inode_wait' was not declared. 
>> Should it be static?

Please review and possibly fold the followup patch.

---
0-DAY kernel test infrastructureOpen Source Technology Center
https://lists.01.org/pipermail/kbuild-all   Intel Corporation
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[RFC PATCH] locks: posix_lock_inode_wait() can be static

2015-10-22 Thread kbuild test robot

Signed-off-by: Fengguang Wu 
---
 locks.c |4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)

diff --git a/fs/locks.c b/fs/locks.c
index daf4664..0d2b326 100644
--- a/fs/locks.c
+++ b/fs/locks.c
@@ -1173,7 +1173,7 @@ EXPORT_SYMBOL(posix_lock_file);
  *
  * Apply a POSIX style lock request to an inode.
  */
-int posix_lock_inode_wait(struct inode *inode, struct file_lock *fl)
+static int posix_lock_inode_wait(struct inode *inode, struct file_lock *fl)
 {
int error;
might_sleep ();
@@ -1860,7 +1860,7 @@ int fcntl_setlease(unsigned int fd, struct file *filp, 
long arg)
  *
  * Apply a FLOCK style lock request to an inode.
  */
-int flock_lock_inode_wait(struct inode *inode, struct file_lock *fl)
+static int flock_lock_inode_wait(struct inode *inode, struct file_lock *fl)
 {
int error;
might_sleep();
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Ceph erasure coding

2015-10-22 Thread Kjetil Babington
Hi,

I have a question about the capabilities of the erasure coding API in
Ceph. Let's say that I have 10 data disks and 4 parity disks, is it
possible to create an erasure coding plugin which creates 20 data
chunks and 8 parity chunks, and then places two chunks on each osd?

Or said maybe a bit simpler is it possible for two or more chunks from
the same encode operation to be placed on the same osd?

- Kjetil Babington
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 1/3] locks: introduce locks_lock_inode_wait()

2015-10-22 Thread Benjamin Coddington
On Thu, 22 Oct 2015, Benjamin Coddington wrote:

> Users of the locks API commonly call either posix_lock_file_wait() or
> flock_lock_file_wait() depending upon the lock type.  Add a new function
> locks_lock_inode_wait() which will check and call the correct function for
> the type of lock passed in.
>
> Signed-off-by: Benjamin Coddington 
> ---
>  fs/locks.c |   24 
>  include/linux/fs.h |   11 +++
>  2 files changed, 35 insertions(+), 0 deletions(-)
>
> diff --git a/fs/locks.c b/fs/locks.c
> index 2a54c80..68b1784 100644
> --- a/fs/locks.c
> +++ b/fs/locks.c
> @@ -1876,6 +1876,30 @@ int flock_lock_inode_wait(struct inode *inode, struct 
> file_lock *fl)
>  EXPORT_SYMBOL(flock_lock_inode_wait);
>
>  /**
> + * locks_lock_inode_wait - Apply a lock to an inode
> + * @inode: inode of the file to apply to
> + * @fl: The lock to be applied
> + *
> + * Apply a POSIX or FLOCK style lock request to an inode.
> + */
> +int locks_lock_inode_wait(struct inode *inode, struct file_lock *fl)
> +{
> + int res = 0;
> + switch (fl->fl_flags & (FL_POSIX|FL_FLOCK)) {
> + case FL_POSIX:
> + res = posix_lock_inode_wait(inode, fl);
> + break;
> + case FL_FLOCK:
> + res = flock_lock_inode_wait(inode, fl);
> + break;
> + default:
> + BUG();
> + }
> + return res;
> +}
> +EXPORT_SYMBOL(locks_lock_inode_wait);
> +
> +/**
>   *   sys_flock: - flock() system call.
>   *   @fd: the file descriptor to lock.
>   *   @cmd: the type of lock to apply.
> diff --git a/include/linux/fs.h b/include/linux/fs.h
> index 72d8a84..2e283b7 100644
> --- a/include/linux/fs.h
> +++ b/include/linux/fs.h
> @@ -1059,6 +1059,7 @@ extern int vfs_test_lock(struct file *, struct 
> file_lock *);
>  extern int vfs_lock_file(struct file *, unsigned int, struct file_lock *, 
> struct file_lock *);
>  extern int vfs_cancel_lock(struct file *filp, struct file_lock *fl);
>  extern int flock_lock_inode_wait(struct inode *inode, struct file_lock *fl);
> +extern int locks_lock_inode_wait(struct inode *inode, struct file_lock *fl);
>  extern int __break_lease(struct inode *inode, unsigned int flags, unsigned 
> int type);
>  extern void lease_get_mtime(struct inode *, struct timespec *time);
>  extern int generic_setlease(struct file *, long, struct file_lock **, void 
> **priv);
> @@ -1177,6 +1178,11 @@ static inline int flock_lock_inode_wait(struct inode 
> *inode,
>   return -ENOLCK;
>  }
>
> +static inline int locks_lock_file_wait(struct file *filp, struct file_lock 
> *fl)
> +{
> + return -ENOLCK;
> +}
> +

So, this is obviously wrong - thank you 0-day robot.  Yes, I did build and
test against these patches, but went back and added this after I realized it
should work w/o CONFIG_FILE_LOCKING.  I'll re-send.

Ben

>  static inline int __break_lease(struct inode *inode, unsigned int mode, 
> unsigned int type)
>  {
>   return 0;
> @@ -1225,6 +1231,11 @@ static inline int flock_lock_file_wait(struct file 
> *filp, struct file_lock *fl)
>   return flock_lock_inode_wait(file_inode(filp), fl);
>  }
>
> +static inline int locks_lock_file_wait(struct file *filp, struct file_lock 
> *fl)
> +{
> + return locks_lock_inode_wait(file_inode(filp), fl);
> +}
> +
>  struct fasync_struct {
>   spinlock_t  fa_lock;
>   int magic;
> --
> 1.7.1
>
> --
> To unsubscribe from this list: send the line "unsubscribe linux-nfs" in
> the body of a message to majord...@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
>
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 1/3] locks: introduce locks_lock_inode_wait()

2015-10-22 Thread kbuild test robot
Hi Benjamin,

[auto build test ERROR on jlayton/linux-next -- if it's inappropriate base, 
please suggest rules for selecting the more suitable base]

url:
https://github.com/0day-ci/linux/commits/Benjamin-Coddington/locks-introduce-locks_lock_inode_wait/20151022-233848
config: x86_64-allnoconfig (attached as .config)
reproduce:
# save the attached .config to linux build tree
make ARCH=x86_64 

All errors (new ones prefixed by >>):

   In file included from include/linux/cgroup.h:17:0,
from include/linux/memcontrol.h:22,
from include/linux/swap.h:8,
from include/linux/suspend.h:4,
from arch/x86/kernel/asm-offsets.c:12:
>> include/linux/fs.h:1234:19: error: redefinition of 'locks_lock_file_wait'
static inline int locks_lock_file_wait(struct file *filp, struct file_lock 
*fl)
  ^
   include/linux/fs.h:1181:19: note: previous definition of 
'locks_lock_file_wait' was here
static inline int locks_lock_file_wait(struct file *filp, struct file_lock 
*fl)
  ^
   include/linux/fs.h: In function 'locks_lock_file_wait':
>> include/linux/fs.h:1236:9: error: implicit declaration of function 
>> 'locks_lock_inode_wait' [-Werror=implicit-function-declaration]
 return locks_lock_inode_wait(file_inode(filp), fl);
^
   cc1: some warnings being treated as errors
   make[2]: *** [arch/x86/kernel/asm-offsets.s] Error 1
   make[2]: Target '__build' not remade because of errors.
   make[1]: *** [prepare0] Error 2
   make[1]: Target 'prepare' not remade because of errors.
   make: *** [sub-make] Error 2

vim +/locks_lock_file_wait +1234 include/linux/fs.h

  1228  
  1229  static inline int flock_lock_file_wait(struct file *filp, struct 
file_lock *fl)
  1230  {
  1231  return flock_lock_inode_wait(file_inode(filp), fl);
  1232  }
  1233  
> 1234  static inline int locks_lock_file_wait(struct file *filp, struct 
> file_lock *fl)
  1235  {
> 1236  return locks_lock_inode_wait(file_inode(filp), fl);
  1237  }
  1238  
  1239  struct fasync_struct {

---
0-DAY kernel test infrastructureOpen Source Technology Center
https://lists.01.org/pipermail/kbuild-all   Intel Corporation


.config.gz
Description: Binary data


Re: [PATCH] mark rbd requiring stable pages

2015-10-22 Thread Ilya Dryomov
On Thu, Oct 22, 2015 at 5:37 PM, Mike Christie  wrote:
> On 10/22/2015 06:20 AM, Ilya Dryomov wrote:
>>
>>> >
>>> > If we are just talking about if stable pages are not used, and someone
>>> > is re-writing data to a page after the page has already been submitted
>>> > to the block layer (I mean the page is on some bio which is on a request
>>> > which is on some request_queue scheduler list or basically anywhere in
>>> > the block layer), then I was saying this can occur with any block
>>> > driver. There is nothing that is preventing this from happening with a
>>> > FC driver or nvme or cciss or in dm or whatever. The app/user can
>>> > rewrite as late as when we are in the make_request_fn/request_fn.
>>> >
>>> > I think I am misunderstanding your question because I thought this is
>>> > expected behavior, and there is nothing drivers can do if the app is not
>>> > doing a flush/sync between these types of write sequences.
>> I don't see a problem with rewriting as late as when we are in
>> request_fn() (or in a wq after being put there by request_fn()).  Where
>> I thought there *might* be an issue is rewriting after sendpage(), if
>> sendpage() is used - perhaps some sneaky sequence similar to that
>> retransmit bug that would cause us to *transmit* incorrect bytes (as
>> opposed to *re*transmit) or something of that nature?
>
>
> Just to make sure we are on the same page.
>
> Are you concerned about the tcp/net layer retransmitting due to it
> detecting a issue as part of the tcp protocol, or are you concerned
> about rbd/libceph initiating a retry like with the nfs issue?

The former, tcp/net layer.  I'm just conjecturing though.

(We don't have the nfs issue, because even if the client sends such
a retransmit (which it won't), the primary OSD will reject it as
a dup.)

Thanks,

Ilya
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Ceph erasure coding

2015-10-22 Thread Samuel Just
Not on purpose... out of curiosity, why do you want to do that?
-Sam

On Thu, Oct 22, 2015 at 9:44 AM, Kjetil Babington  wrote:
> Hi,
>
> I have a question about the capabilities of the erasure coding API in
> Ceph. Let's say that I have 10 data disks and 4 parity disks, is it
> possible to create an erasure coding plugin which creates 20 data
> chunks and 8 parity chunks, and then places two chunks on each osd?
>
> Or said maybe a bit simpler is it possible for two or more chunks from
> the same encode operation to be placed on the same osd?
>
> - Kjetil Babington
> --
> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
> the body of a message to majord...@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
>
>
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Ceph erasure coding

2015-10-22 Thread Loic Dachary
Hi,

On 22/10/2015 18:44, Kjetil Babington wrote:
> Hi,
> 
> I have a question about the capabilities of the erasure coding API in
> Ceph. Let's say that I have 10 data disks and 4 parity disks, is it
> possible to create an erasure coding plugin which creates 20 data
> chunks and 8 parity chunks, and then places two chunks on each osd?
> 
> Or said maybe a bit simpler is it possible for two or more chunks from
> the same encode operation to be placed on the same osd?

This is more a question of creating a crush ruleset that does it. The erasure 
code plugin encodes chunks but the crush ruleset decides where they are placed.

Cheers

> 
> - Kjetil Babington
> --
> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
> the body of a message to majord...@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> 

-- 
Loïc Dachary, Artisan Logiciel Libre



signature.asc
Description: OpenPGP digital signature