[ceph-users] 答复: 答复: 答复: 答复: Pipe "deadlock" in Hammer, 0.94.5

2017-03-16 Thread 许雪寒
Hi, Gregory.

On the other hand, I checked the fix 63e44e32974c9bae17bb1bfd4261dcb024ad845c 
should be the one that we need. However, I notice that this fix has only been 
backported down to v11.0.0, can we simply apply it to our Hammer 
version(0.94.5)?

-邮件原件-
发件人: 许雪寒 
发送时间: 2017年3月17日 10:09
收件人: 'Gregory Farnum'
抄送: ceph-users@lists.ceph.com; jiajia zhong
主题: 答复: 答复: [ceph-users] 答复: 答复: Pipe "deadlock" in Hammer, 0.94.5

I got it. Thanks very much:-)

发件人: Gregory Farnum [mailto:gfar...@redhat.com] 
发送时间: 2017年3月17日 2:10
收件人: 许雪寒
抄送: ceph-users@lists.ceph.com; jiajia zhong
主题: Re: 答复: [ceph-users] 答复: 答复: Pipe "deadlock" in Hammer, 0.94.5


On Thu, Mar 16, 2017 at 3:36 AM 许雪寒  wrote:
Hi, Gregory, is it possible to unlock Connection::lock in Pipe::read_message 
before tcp_read_nonblocking is called? I checked the code again, it seems that 
the code in tcp_read_nonblocking doesn't need to be locked by Connection::lock.

Unfortunately it does. You'll note the memory buffers it's grabbing via the 
Connection? Those need to be protected from changing (either being canceled, or 
being set up) while the read is being processed.
Now, you could probably do something more complicated around the buffer update 
mechanism, or if you know your applications don't make use of it you could just 
rip them out entirely. But while that mechanism exists it needs to be 
synchronized.
-Greg




-邮件原件-
发件人: Gregory Farnum [mailto:gfar...@redhat.com]
发送时间: 2017年1月17日 7:14
收件人: 许雪寒
抄送: jiajia zhong; ceph-users@lists.ceph.com
主题: Re: 答复: [ceph-users] 答复: 答复: Pipe "deadlock" in Hammer, 0.94.5

On Sat, Jan 14, 2017 at 7:54 PM, 许雪寒  wrote:
> Thanks for your help:-)
>
> I checked the source code again, and in read_message, it does hold the 
> Connection::lock:

You're correct of course; I wasn't looking and forgot about this bit.
This was added to deal with client-allocated buffers and/or op cancellation in 
librados, IIRC, and unfortunately definitely does need to be synchronized — I'm 
not sure about with pipe lookups, but probably even that. :/

Unfortunately it looks like you're running a version that didn't come from 
upstream (I see hash 81d4ad40d0c2a4b73529ff0db3c8f22acd15c398 in another email, 
which I can't find), so there's not much we can do to help with the specifics 
of this case — it's fiddly and my guess would be the same as Sage's, which you 
say is not the case.
-Greg

>
>                      while (left > 0) {
>                         // wait for data
>                         if (tcp_read_wait() < 0)
>                                 goto out_dethrottle;
>
>                         // get a buffer
>                         connection_state->lock.Lock();
>                         map >::iterator p =
>                                         
>connection_state->rx_buffers.find(header.tid);
>                         if (p != connection_state->rx_buffers.end()) {
>                                 if (rxbuf.length() == 0 || p->second.second 
>!= rxbuf_version) {
>                                         ldout(msgr->cct,10)
>                                                                 << "reader 
>seleting rx buffer v "
>                                                                               
>  << p->second.second << " at offset "
>                                                                               
>  << offset << " len "
>                                                                               
>  << p->second.first.length() << dendl;
>                                         rxbuf = p->second.first;
>                                         rxbuf_version = p->second.second;
>                                         // make sure it's big enough
>                                         if (rxbuf.length() < data_len)
>                                                 rxbuf.push_back(
>                                                                 
>buffer::create(data_len - rxbuf.length()));
>                                         blp = p->second.first.begin();
>                                         blp.advance(offset);
>                                 }
>                         } else {
>                                 if (!newbuf.length()) {
>                                         ldout(msgr->cct,20)
>                                                                 << "reader 
>allocating new rx buffer at offset "
>                                                                               
>  << offset << dendl;
>                                         alloc_aligned_buffer(newbuf, 
>data_len, data_off);
>                                         blp = newbuf.begin();
>                                         blp.advance(offset);
>                                 }
>                         }
>                         bufferptr bp = blp.get_current_ptr();
>                         int 

Re: [ceph-users] pgs stuck inactive

2017-03-16 Thread Brad Hubbard
So I've tested this procedure locally and it works successfully for me.

$ ./ceph -v
*** DEVELOPER MODE: setting PATH, PYTHONPATH and LD_LIBRARY_PATH ***
ceph version 0.94.10 (b1e0532418e4631af01acbc0cedd426f1905f4af)

$ ./ceph-objectstore-tool import-rados rbd 0.3.export
Importing from pgid 0.3
Write 0/409a1413/benchmark_data_boxen1.example.com_27596_object59/head
Write 0/377af323/benchmark_data_boxen1.example.com_27596_object35/head
Write 0/6448d043/benchmark_data_boxen1.example.com_27596_object73/head
Write 0/410ee8a3/benchmark_data_boxen1.example.com_27596_object52/head
Write 0/c50d4ba3/benchmark_data_boxen1.example.com_27596_object43/head
Write 0/978edd4b/benchmark_data_boxen1.example.com_27596_object56/head
Write 0/97d8967b/benchmark_data_boxen1.example.com_27596_object33/head
Write 0/9c52a59b/benchmark_data_boxen1.example.com_27596_object34/head
Write 0/d762a5db/benchmark_data_boxen1.example.com_27596_object44/head
Import successful

This is, of course, after deleting all copies of the pg on the
relevant OSDs and running force_create_pg to recreate the pg.

>From what I can see in the stack trace the rados client connection
does not seem to be completing correctly. I'm hoping we can get more
information on the problem by adding the following to the client
section of your ceph.conf file on the machine you are running
ceph-objectstore-tool on.

[client]
debug objecter = 20
debug rados = 20
debug ms = 5
log file = /var/log/ceph/$cluster-$name.$pid.log

Then run the ceph-objectstore-tool again taking careful note of what
file is created in /var/log/ceph/ and upload that.

On Thu, Mar 16, 2017 at 5:21 PM, Laszlo Budai  wrote:
> My mistake, I've run it on a wrong system ...
>
> I've attached the terminal output.
>
> I've run this on a test system where I was getting the same segfault when
> trying import-rados.
>
> Kind regards,
> Laszlo
>
> On 16.03.2017 07:41, Laszlo Budai wrote:
>>
>>
>> [root@storage2 ~]# gdb -ex 'r' -ex 't a a bt full' -ex 'q' --args
>> ceph-objectstore-tool import-rados volumes pg.3.367.export.OSD.35
>> GNU gdb (GDB) Red Hat Enterprise Linux 7.6.1-94.el7
>> Copyright (C) 2013 Free Software Foundation, Inc.
>> License GPLv3+: GNU GPL version 3 or later
>> 
>> This is free software: you are free to change and redistribute it.
>> There is NO WARRANTY, to the extent permitted by law.  Type "show copying"
>> and "show warranty" for details.
>> This GDB was configured as "x86_64-redhat-linux-gnu".
>> For bug reporting instructions, please see:
>> ...
>> Reading symbols from /usr/bin/ceph-objectstore-tool...Reading symbols from
>> /usr/lib/debug/usr/bin/ceph-objectstore-tool.debug...done.
>> done.
>> Starting program: /usr/bin/ceph-objectstore-tool import-rados volumes
>> pg.3.367.export.OSD.35
>> [Thread debugging using libthread_db enabled]
>> Using host libthread_db library "/lib64/libthread_db.so.1".
>> open: No such file or directory
>> [Inferior 1 (process 23735) exited with code 01]
>> [root@storage2 ~]#
>>
>>
>>
>>
>> Just checked:
>> [root@storage2 lib64]# ls -l /lib64/libthread_db*
>> -rwxr-xr-x. 1 root root 38352 May 12  2016 /lib64/libthread_db-1.0.so
>> lrwxrwxrwx. 1 root root19 Jun  7  2016 /lib64/libthread_db.so.1 ->
>> libthread_db-1.0.so
>> [root@storage2 lib64]#
>>
>>
>> Kind regards,
>> Laszlo
>>
>>
>> On 16.03.2017 05:26, Brad Hubbard wrote:
>>>
>>> Can you install the debuginfo for ceph (how this works depends on your
>>> distro) and run the following?
>>>
>>> # gdb -ex 'r' -ex 't a a bt full' -ex 'q' --args ceph-objectstore-tool
>>> import-rados volumes pg.3.367.export.OSD.35
>>>
>>> On Thu, Mar 16, 2017 at 12:02 AM, Laszlo Budai 
>>> wrote:

 Hello,


 the ceph-objectstore-tool import-rados volumes pg.3.367.export.OSD.35
 command crashes.

 ~# ceph-objectstore-tool import-rados volumes pg.3.367.export.OSD.35
 *** Caught signal (Segmentation fault) **
  in thread 7f85b60e28c0
  ceph version 0.94.10 (b1e0532418e4631af01acbc0cedd426f1905f4af)
  1: ceph-objectstore-tool() [0xaeeaba]
  2: (()+0x10330) [0x7f85b4dca330]
  3: (()+0xa2324) [0x7f85b1cd7324]
  4: (()+0x7d23e) [0x7f85b1cb223e]
  5: (()+0x7d478) [0x7f85b1cb2478]
  6: (rados_ioctx_create()+0x32) [0x7f85b1c89f92]
  7: (librados::Rados::ioctx_create(char const*, librados::IoCtx&)+0x15)
 [0x7f85b1c8a0e5]
  8: (do_import_rados(std::string, bool)+0xb7c) [0x68199c]
  9: (main()+0x1294) [0x651134]
  10: (__libc_start_main()+0xf5) [0x7f85b0c69f45]
  11: ceph-objectstore-tool() [0x66f8b7]
 2017-03-15 14:57:05.567987 7f85b60e28c0 -1 *** Caught signal
 (Segmentation
 fault) **
  in thread 7f85b60e28c0

  ceph version 0.94.10 (b1e0532418e4631af01acbc0cedd426f1905f4af)
  1: ceph-objectstore-tool() [0xaeeaba]
  2: (()+0x10330) [0x7f85b4dca330]
  

Re: [ceph-users] Ceph Cluster Failures

2017-03-16 Thread Christian Balzer

Hello,

On Fri, 17 Mar 2017 02:51:48 + Rich Rocque wrote:

> Hi,
> 
> 
> I talked with the person in charge about your initial feedback and questions. 
> The thought is to switch to a new setup and I was asked to pass it on and ask 
> for thoughts on whether this would be sufficient or not.
>
I assume from the new setup that the current problematic one is also on
AWS, so I'd advice to do a proper analysis there before moving to
something "new".

If you search the ML archives you'll find (few) others that have done
similar things and as far as I can recall none were particular successful.

A virtualized Ceph is going to be harder to get "right" than a HW based
one, doubly so when dealing with AWS network vagaries. 
I'm unsure if an AWS region can consist of multiple DCs, if so the
latencies when doing writes would be bad, but then again it seems your use
case is very read-heavy.

That all said, the specs for your proposal look good from a (virtual) HW
perspective. 

Christian
 
> 
> Use case:
> Overview: Need to provide shared storage/high-availability for (usually) 
> low-volume web server instances using distributed, POSIX-compliant 
> filesystem, running in Amazon Web Services. Database storage is not part of 
> the cluster.
> Logic: We know Ceph is probably overkill for our current use (and probably 
> also for my future use), so why Ceph? It’s performance, when using CephFS, 
> and its ability to support RBD (if we ever move to a container approach for 
> web servers). I’ve tried Amazon EFS (NFS-as-a-service) and GlusterFS (both 
> NFS and native client), and because of the number of small files we’re 
> working with, something that takes ~15sec. in Ceph takes several minutes 
> using other NFS or GlusterFS solutions.
> Current Load: ~100 connected clients accessing ~20GB data of e-commerce 
> related website source software.
> Expected Future Load: ~5,000 connected clients access ~1TB data
> 
> Ceph Clients:
> Primary Role: Web server & load balancer w/ SSL termination
> Hardware Configuration: 1vCPU, 512MB ram, Ubuntu 16.04 LTS (per 
> website/domain/subdomain: 2ea t2.nano instances, load balanced behind 
> haproxy, rarely manually-scaling up with new instances during expected load 
> spikes. After initial “hits,” most of the website stays in local cache, 
> resulting in generally-few iops against the Ceph cluster.)
> 
> Ceph Clusters:
> Overall: 3 Co-located Clusters across 9 servers, spanning 3 AWS Availability 
> Zones in a single region. 3 MDS per-cluster, 3 MON per cluster, 2 OSD per 
> cluster.
> Hardware Configuration (MON/MDS): r4.large instance class, 2vCPU, ~15GB ram, 
> “up to 10Gbit” network (“Enhanced Networking” enabled), EBS / SSD for root 
> (not provisioned-IOPS), Ubuntu 16.04 LTS
> Hardware Configuration (OSD): i3.large instance class, 2vCPU, ~15GB ram, “up 
> to 10Gbit” network (“Enhanced Networking” enabled), EBS/SSD for root (not 
> provisioned-IOPS, but “EBS optimized” for bandwidth), ~475GB NVMe attached, 
> ephemeral storage for OSD (co-locating journal and data)
> 
> Proposed Layout:
> AZ “A”:
> 
>   *   Server A-MM (r4.large instance):
>  *   Mon.A & MDS.A for Cluster X
>  *   Mon.A & MDS.A for Cluster Y
>  *   Mon.A & MDS.A for Cluster Z
>   *   Server A-OSD-1 (i3.large instance):
>  *   OSD.0 for Cluster X
>   *   Server A-OSD-2 (i3.large instance):
>  *   OSD.0 for Cluster Z
> 
> 
> AZ “B”:
> 
>   *   Server B-MM (r4.large instance):
>  *   Mon.B & MDS.B for Cluster X
>  *   Mon.B & MDS.B for Cluster Y
>  *   Mon.B & MDS.B for Cluster Z
>   *   Server B-OSD-1 (i3.large instance):
>  *   OSD.1 for Cluster X
>   *   Server B-OSD-2 (i3.large instance):
>  *   OSD.0 for Cluster Y
> 
> 
> AZ “C”:
> 
>   *   Server C-MM (r4.large instance):
>  *   Mon.B & MDS.B for Cluster X
>  *   Mon.B & MDS.B for Cluster Y
>  *   Mon.B & MDS.B for Cluster Z
>   *   Server C-OSD-1 (i3.large instance):
>  *   OSD.1 for Cluster Y
>   *   Server C-OSD-2 (i3.large instance):
>  *   OSD.1 for Cluster Z
> 
> 
> Alternative Layout:
> Split, by half, the NVMe storage between 2 OSDs, and provide 3ea OSDs per 
> cluster for higher availability at the expense of disk read-write 
> performance, and increase the number of clusters to 4.
> 
> 
> Thank you for your time,
> 
> Rich
> 
> 
> From: Christian Balzer 
> Sent: Thursday, March 16, 2017 2:30:49 AM
> To: Ceph Users
> Cc: Robin H. Johnson; Rich Rocque
> Subject: Re: [ceph-users] Ceph Cluster Failures
> 
> 
> Hello,
> 
> On Thu, 16 Mar 2017 02:44:29 + Robin H. Johnson wrote:
> 
> > On Thu, Mar 16, 2017 at 02:22:08AM +, Rich Rocque wrote:  
> > > Has anyone else run into this or have any suggestions on how to remedy 
> > > it?  
> > We need a LOT more info.
> >  
> Indeed.
> 
> > > After a couple months of almost no issues, our Ceph cluster has
> > > started to have frequent failures. Just this week it's failed about
> > > three 

Re: [ceph-users] Ceph-deploy and git.ceph.com

2017-03-16 Thread Shain Miley
I ended up using a newer version of ceph-deploy and things went more smoothly 
after that.

Thanks again to everyone for all the help!

Shain 

> On Mar 16, 2017, at 10:29 AM, Shain Miley  wrote:
> 
> This sender failed our fraud detection checks and may not be who they appear 
> to be. Learn about spoofing    Feedback 
> 
> It looks like things are working a bit better today…however now I am getting 
> the following error:
> 
> [hqosd6][DEBUG ] detect platform information from remote host
> [hqosd6][DEBUG ] detect machine type
> [ceph_deploy.install][INFO  ] Distro info: Ubuntu 14.04 trusty
> [hqosd6][INFO  ] installing ceph on hqosd6
> [hqosd6][INFO  ] Running command: env DEBIAN_FRONTEND=noninteractive apt-get 
> -q install --assume-yes ca-certificates
> [hqosd6][DEBUG ] Reading package lists...
> [hqosd6][DEBUG ] Building dependency tree...
> [hqosd6][DEBUG ] Reading state information...
> [hqosd6][DEBUG ] ca-certificates is already the newest version.
> [hqosd6][DEBUG ] 0 upgraded, 0 newly installed, 0 to remove and 3 not 
> upgraded.
> [hqosd6][INFO  ] Running command: wget -O release.asc 
> https://ceph.com/git/?p=ceph.git;a=blob_plain;f=keys/release.asc 
> 
> [hqosd6][WARNIN] --2017-03-16 10:25:17--  
> https://ceph.com/git/?p=ceph.git;a=blob_plain;f=keys/release.asc 
> 
> [hqosd6][WARNIN] Resolving ceph.com  (ceph.com 
> )... 158.69.68.141
> [hqosd6][WARNIN] Connecting to ceph.com  (ceph.com 
> )|158.69.68.141|:443... connected.
> [hqosd6][WARNIN] HTTP request sent, awaiting response... 301 Moved Permanently
> [hqosd6][WARNIN] Location: 
> https://git.ceph.com/?p=ceph.git;a=blob_plain;f=keys/release.asc 
>  [following]
> [hqosd6][WARNIN] --2017-03-16 10:25:17--  
> https://git.ceph.com/?p=ceph.git;a=blob_plain;f=keys/release.asc 
> 
> [hqosd6][WARNIN] Resolving git.ceph.com  (git.ceph.com 
> )... 8.43.84.132
> [hqosd6][WARNIN] Connecting to git.ceph.com  
> (git.ceph.com )|8.43.84.132|:443... connected.
> [hqosd6][WARNIN] HTTP request sent, awaiting response... 200 OK
> [hqosd6][WARNIN] Length: 1645 (1.6K) [text/plain]
> [hqosd6][WARNIN] Saving to: ‘release.asc’
> [hqosd6][WARNIN] 
> [hqosd6][WARNIN]  0K .
>  100%  219M=0s
> [hqosd6][WARNIN] 
> [hqosd6][WARNIN] 2017-03-16 10:25:17 (219 MB/s) - ‘release.asc’ saved 
> [1645/1645]
> [hqosd6][WARNIN] 
> [hqosd6][INFO  ] Running command: apt-key add release.asc
> [hqosd6][DEBUG ] OK
> [hqosd6][DEBUG ] add deb repo to sources.list
> [hqosd6][INFO  ] Running command: apt-get -q update
> [hqosd6][DEBUG ] Ign http://us.archive.ubuntu.com 
>  trusty InRelease
> [hqosd6][DEBUG ] Hit http://us.archive.ubuntu.com 
>  trusty-updates InRelease
> [hqosd6][DEBUG ] Hit http://us.archive.ubuntu.com 
>  trusty-backports InRelease
> [hqosd6][DEBUG ] Get:1 http://us.archive.ubuntu.com 
>  trusty Release.gpg [933 B]
> [hqosd6][DEBUG ] Hit http://security.ubuntu.com  
> trusty-security InRelease
> [hqosd6][DEBUG ] Hit http://us.archive.ubuntu.com 
>  trusty-updates/main Sources
> [hqosd6][DEBUG ] Hit http://us.archive.ubuntu.com 
>  trusty-updates/restricted Sources
> [hqosd6][DEBUG ] Get:2 http://us.archive.ubuntu.com 
>  trusty-updates/universe Sources [175 kB]
> [hqosd6][DEBUG ] Hit http://security.ubuntu.com  
> trusty-security/main Sources
> [hqosd6][DEBUG ] Get:3 http://ceph.com  trusty InRelease
> [hqosd6][WARNIN] Splitting up 
> /var/lib/apt/lists/partial/ceph.com_debian-hammer_dists_trusty_InRelease into 
> data and signature failedE: GPG error: http://ceph.com  
> trusty InRelease: Clearsigned file isn't valid, got 'NODATA' (does the 
> network require authentication?)
> [hqosd6][DEBUG ] Ign http://ceph.com  trusty InRelease
> [hqosd6][ERROR ] RuntimeError: command returned non-zero exit status: 100
> [ceph_deploy][ERROR ] RuntimeError: Failed to execute command: apt-get -q 
> update
> 
> Does anyone know if there is still an issue ongoing issue….or is this 
> something that should be working at this point?
> 
> Thanks again,
> Shain
> 
> 
> 
>> On Mar 15, 2017, at 2:08 PM, Shain Miley > > wrote:
>> 
>> This sender failed our fraud detection 

Re: [ceph-users] noout, nodown and blocked requests

2017-03-16 Thread Shain Miley
Hi,
Thanks for the link.  

I unset the nodown config option and things did seem to improve, although we 
did still get a few reports from users about issues related filesystem (rbd) 
access, even after that action was taken.

Thanks again,

Shain 

> On Mar 13, 2017, at 2:43 AM, Alexandre DERUMIER  wrote:
> 
> Hi,
> 
>>> Currently I have the. noout and nodown flags set while doing the 
>>> maintenance work.
> 
> you only need noout to avoid rebalancing
> 
> see documentation:
> http://docs.ceph.com/docs/kraken/rados/troubleshooting/troubleshooting-osd/
> "STOPPING W/OUT REBALANCING".
> 
> 
> Your clients are hanging because of the no down flag
> 
> 
> See this blog for no-out, no-down flags experiements
> 
> https://www.sebastien-han.fr/blog/2013/04/17/some-ceph-experiments/
> 
> - Mail original -
> De: "Shain Miley" 
> À: "ceph-users" 
> Envoyé: Lundi 13 Mars 2017 04:58:08
> Objet: [ceph-users] noout, nodown and blocked requests
> 
> Hello, 
> One of the nodes in our 14 node cluster is offline and before I totally 
> commit to fully removing the node from the cluster (there is a chance I can 
> get the node back in working order in the next few days) I would like to run 
> the cluster with that single node out for a few days. 
> 
> Currently I have the. noout and nodown flags set while doing the maintenance 
> work. 
> 
> Some users are complaining about disconnects and other oddities when try to 
> save and access files currently on the cluster. 
> 
> I am also seeing some blocked requests when viewing the cluster status (at 
> this point I see 160 block requests spread over 15 to 20 osd’s). 
> 
> Currently I have a replication level of 3 on this pool and a min_size of 1. 
> 
> My question is this…is there a better method to use (other than using noout 
> and nodown) in this scenario where I do not want data movement yet…but I do 
> want the reads and writes to the cluster to to respond as normally as 
> possible for the end users? 
> 
> Thanks in advance, 
> 
> Shain 
> 
> 
> ___ 
> ceph-users mailing list 
> ceph-users@lists.ceph.com 
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com 
> 

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Unable to boot OS on cluster node

2017-03-16 Thread Shain Miley
Thanks all for the help,

I was able to reinstall Ubuntu, reinstall Ceph, after a server reboot the OSD’s 
are once again part of the cluster.

Thanks again,
Shain


> On Mar 10, 2017, at 2:55 PM, Lincoln Bryant  wrote:
> 
> Hi Shain,
> 
> As long as you don’t nuke the OSDs or the journals, you should be OK. I think 
> the keyring and such are typically stored on the OSD itself. If you have lost 
> track of what physical device maps to what OSD, you can always mount the OSDs 
> in a temporary spot and cat the “whoami” file.
> 
> —Lincoln
> 
>> On Mar 10, 2017, at 11:33 AM, Shain Miley  wrote:
>> 
>> Hello,
>> 
>> We had an issue with one of our Dell 720xd servers and now the raid card 
>> cannot seem to boot from the Ubuntu OS drive volume.
>> 
>> I would like to know...if I reload the OS...is there an easy way to get the 
>> 12 OSD's disks back into the cluster without just having to remove them from 
>> the cluster, wipe the drives and then re-add them?
>> 
>> Right now I have the 'noout' and 'nodown' flags set on the cluster so there 
>> has been no data movement yet as a result of this node being down.
>> 
>> Thanks in advance for any help.
>> 
>> Shain
>> 
>> 
>> -- 
>> NPR | Shain Miley | Manager of Infrastructure, Digital Media | 
>> smi...@npr.org | 202.513.3649
>> 
>> ___
>> ceph-users mailing list
>> ceph-users@lists.ceph.com
>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
> 

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Ceph Cluster Failures

2017-03-16 Thread Rich Rocque
Hi,


I talked with the person in charge about your initial feedback and questions. 
The thought is to switch to a new setup and I was asked to pass it on and ask 
for thoughts on whether this would be sufficient or not.


Use case:
Overview: Need to provide shared storage/high-availability for (usually) 
low-volume web server instances using distributed, POSIX-compliant filesystem, 
running in Amazon Web Services. Database storage is not part of the cluster.
Logic: We know Ceph is probably overkill for our current use (and probably also 
for my future use), so why Ceph? It’s performance, when using CephFS, and its 
ability to support RBD (if we ever move to a container approach for web 
servers). I’ve tried Amazon EFS (NFS-as-a-service) and GlusterFS (both NFS and 
native client), and because of the number of small files we’re working with, 
something that takes ~15sec. in Ceph takes several minutes using other NFS or 
GlusterFS solutions.
Current Load: ~100 connected clients accessing ~20GB data of e-commerce related 
website source software.
Expected Future Load: ~5,000 connected clients access ~1TB data

Ceph Clients:
Primary Role: Web server & load balancer w/ SSL termination
Hardware Configuration: 1vCPU, 512MB ram, Ubuntu 16.04 LTS (per 
website/domain/subdomain: 2ea t2.nano instances, load balanced behind haproxy, 
rarely manually-scaling up with new instances during expected load spikes. 
After initial “hits,” most of the website stays in local cache, resulting in 
generally-few iops against the Ceph cluster.)

Ceph Clusters:
Overall: 3 Co-located Clusters across 9 servers, spanning 3 AWS Availability 
Zones in a single region. 3 MDS per-cluster, 3 MON per cluster, 2 OSD per 
cluster.
Hardware Configuration (MON/MDS): r4.large instance class, 2vCPU, ~15GB ram, 
“up to 10Gbit” network (“Enhanced Networking” enabled), EBS / SSD for root (not 
provisioned-IOPS), Ubuntu 16.04 LTS
Hardware Configuration (OSD): i3.large instance class, 2vCPU, ~15GB ram, “up to 
10Gbit” network (“Enhanced Networking” enabled), EBS/SSD for root (not 
provisioned-IOPS, but “EBS optimized” for bandwidth), ~475GB NVMe attached, 
ephemeral storage for OSD (co-locating journal and data)

Proposed Layout:
AZ “A”:

  *   Server A-MM (r4.large instance):
 *   Mon.A & MDS.A for Cluster X
 *   Mon.A & MDS.A for Cluster Y
 *   Mon.A & MDS.A for Cluster Z
  *   Server A-OSD-1 (i3.large instance):
 *   OSD.0 for Cluster X
  *   Server A-OSD-2 (i3.large instance):
 *   OSD.0 for Cluster Z


AZ “B”:

  *   Server B-MM (r4.large instance):
 *   Mon.B & MDS.B for Cluster X
 *   Mon.B & MDS.B for Cluster Y
 *   Mon.B & MDS.B for Cluster Z
  *   Server B-OSD-1 (i3.large instance):
 *   OSD.1 for Cluster X
  *   Server B-OSD-2 (i3.large instance):
 *   OSD.0 for Cluster Y


AZ “C”:

  *   Server C-MM (r4.large instance):
 *   Mon.B & MDS.B for Cluster X
 *   Mon.B & MDS.B for Cluster Y
 *   Mon.B & MDS.B for Cluster Z
  *   Server C-OSD-1 (i3.large instance):
 *   OSD.1 for Cluster Y
  *   Server C-OSD-2 (i3.large instance):
 *   OSD.1 for Cluster Z


Alternative Layout:
Split, by half, the NVMe storage between 2 OSDs, and provide 3ea OSDs per 
cluster for higher availability at the expense of disk read-write performance, 
and increase the number of clusters to 4.


Thank you for your time,

Rich


From: Christian Balzer 
Sent: Thursday, March 16, 2017 2:30:49 AM
To: Ceph Users
Cc: Robin H. Johnson; Rich Rocque
Subject: Re: [ceph-users] Ceph Cluster Failures


Hello,

On Thu, 16 Mar 2017 02:44:29 + Robin H. Johnson wrote:

> On Thu, Mar 16, 2017 at 02:22:08AM +, Rich Rocque wrote:
> > Has anyone else run into this or have any suggestions on how to remedy it?
> We need a LOT more info.
>
Indeed.

> > After a couple months of almost no issues, our Ceph cluster has
> > started to have frequent failures. Just this week it's failed about
> > three times.
> >
> > The issue appears to be than an MDS or Monitor will fail and then all
> > clients hang. After that, all clients need to be forcibly restarted.
> - Can you define monitor 'failing' in this case?
> - What do the logs contain?
> - Is it running out of memory?
> - Can you turn up the debug level?
> - Has your cluster experienced continual growth and now might be
>   undersized in some regard?
>
A single MON failure should not cause any problems to boot.

"ceph -s" , "ceph osd tree"  and "ceph osd pool ls detail" as well.

> > The architecture for our setup is:
> Are these virtual machines? The overall specs seem rather like VM
> instances rather than hardware.
>
There are small servers like that, but a valid question indeed.
In particular, if it is dedicated HW, FULL specs.

> > 3 ea MON, MDS instances (co-located) on 2cpu, 4GB RAM servers
> What sort of SSD are the monitor datastores on? ('mon data' in the
> config)
>
He doesn't mention SSDs in the MON/MDS context, so we could 

Re: [ceph-users] CephFS mount shows the entire cluster size as apposed to custom-cephfs-pool-size

2017-03-16 Thread Deepak Naidu
Not sure, if this is still true with Jewel CephFS ie

cephfs does not support any type of quota, df always reports entire cluster 
size.

https://www.spinics.net/lists/ceph-users/msg05623.html

--
Deepak

From: Deepak Naidu
Sent: Thursday, March 16, 2017 6:19 PM
To: 'ceph-users'
Subject: CephFS mount shows the entire cluster size as apposed to 
custom-cephfs-pool-size

Greetings,

I am trying to build a CephFS system. Currently I have created my crush map 
which uses only certain OSD & I have pools created out from them. But when I 
mount the cephFS the mount size is my entire ceph cluster size, how is that ?


Ceph cluster & pools

[ceph-admin@storageAdmin ~]$ ceph df
GLOBAL:
SIZE  AVAIL RAW USED %RAW USED
4722G 4721G 928M  0.02
POOLS:
NAME  ID USED %USED MAX AVAIL 
OBJECTS
ecpool_disk1220 0   1199G   
 0
rcpool_disk2 240 0   1499G  
  0
rcpool_cepfsMeta 25 4420  0  76682M
20


CephFS volume & pool

Here data0 is the volume/filesystem name
rcpool_cepfsMeta - is the meta-data pool
rcpool_disk2 - is the data pool

[ceph-admin@storageAdmin ~]$ ceph fs ls
name: data0, metadata pool: rcpool_cepfsMeta, data pools: [rcpool_disk2 ]


Command to mount CephFS
sudo mount -t ceph mon1:6789:/ /mnt/cephfs/ -o 
name=admin,secretfile=admin.secret


Client host df -h output
192.168.1.101:6789:/ 4.7T  928M  4.7T   1% /mnt/cephfs



--
Deepak





---
This email message is for the sole use of the intended recipient(s) and may 
contain
confidential information.  Any unauthorized review, use, disclosure or 
distribution
is prohibited.  If you are not the intended recipient, please contact the 
sender by
reply email and destroy all copies of the original message.
---
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] 答复: 答复: 答复: 答复: Pipe "deadlock" in Hammer, 0.94.5

2017-03-16 Thread 许雪寒
I got it. Thanks very much:-)

发件人: Gregory Farnum [mailto:gfar...@redhat.com] 
发送时间: 2017年3月17日 2:10
收件人: 许雪寒
抄送: ceph-users@lists.ceph.com; jiajia zhong
主题: Re: 答复: [ceph-users] 答复: 答复: Pipe "deadlock" in Hammer, 0.94.5


On Thu, Mar 16, 2017 at 3:36 AM 许雪寒  wrote:
Hi, Gregory, is it possible to unlock Connection::lock in Pipe::read_message 
before tcp_read_nonblocking is called? I checked the code again, it seems that 
the code in tcp_read_nonblocking doesn't need to be locked by Connection::lock.

Unfortunately it does. You'll note the memory buffers it's grabbing via the 
Connection? Those need to be protected from changing (either being canceled, or 
being set up) while the read is being processed.
Now, you could probably do something more complicated around the buffer update 
mechanism, or if you know your applications don't make use of it you could just 
rip them out entirely. But while that mechanism exists it needs to be 
synchronized.
-Greg




-邮件原件-
发件人: Gregory Farnum [mailto:gfar...@redhat.com]
发送时间: 2017年1月17日 7:14
收件人: 许雪寒
抄送: jiajia zhong; ceph-users@lists.ceph.com
主题: Re: 答复: [ceph-users] 答复: 答复: Pipe "deadlock" in Hammer, 0.94.5

On Sat, Jan 14, 2017 at 7:54 PM, 许雪寒  wrote:
> Thanks for your help:-)
>
> I checked the source code again, and in read_message, it does hold the 
> Connection::lock:

You're correct of course; I wasn't looking and forgot about this bit.
This was added to deal with client-allocated buffers and/or op cancellation in 
librados, IIRC, and unfortunately definitely does need to be synchronized — I'm 
not sure about with pipe lookups, but probably even that. :/

Unfortunately it looks like you're running a version that didn't come from 
upstream (I see hash 81d4ad40d0c2a4b73529ff0db3c8f22acd15c398 in another email, 
which I can't find), so there's not much we can do to help with the specifics 
of this case — it's fiddly and my guess would be the same as Sage's, which you 
say is not the case.
-Greg

>
>                      while (left > 0) {
>                         // wait for data
>                         if (tcp_read_wait() < 0)
>                                 goto out_dethrottle;
>
>                         // get a buffer
>                         connection_state->lock.Lock();
>                         map >::iterator p =
>                                         
>connection_state->rx_buffers.find(header.tid);
>                         if (p != connection_state->rx_buffers.end()) {
>                                 if (rxbuf.length() == 0 || p->second.second 
>!= rxbuf_version) {
>                                         ldout(msgr->cct,10)
>                                                                 << "reader 
>seleting rx buffer v "
>                                                                               
>  << p->second.second << " at offset "
>                                                                               
>  << offset << " len "
>                                                                               
>  << p->second.first.length() << dendl;
>                                         rxbuf = p->second.first;
>                                         rxbuf_version = p->second.second;
>                                         // make sure it's big enough
>                                         if (rxbuf.length() < data_len)
>                                                 rxbuf.push_back(
>                                                                 
>buffer::create(data_len - rxbuf.length()));
>                                         blp = p->second.first.begin();
>                                         blp.advance(offset);
>                                 }
>                         } else {
>                                 if (!newbuf.length()) {
>                                         ldout(msgr->cct,20)
>                                                                 << "reader 
>allocating new rx buffer at offset "
>                                                                               
>  << offset << dendl;
>                                         alloc_aligned_buffer(newbuf, 
>data_len, data_off);
>                                         blp = newbuf.begin();
>                                         blp.advance(offset);
>                                 }
>                         }
>                         bufferptr bp = blp.get_current_ptr();
>                         int read = MIN(bp.length(), left);
>                         ldout(msgr->cct,20)
>                                                 << "reader reading 
>nonblocking into "
>                                                                 << (void*) 
>bp.c_str() << " len " << bp.length()
>                                                                 << dendl;
>                         int got = tcp_read_nonblocking(bp.c_str(), read);
>      

[ceph-users] CephFS mount shows the entire cluster size as apposed to custom-cephfs-pool-size

2017-03-16 Thread Deepak Naidu
Greetings,

I am trying to build a CephFS system. Currently I have created my crush map 
which uses only certain OSD & I have pools created out from them. But when I 
mount the cephFS the mount size is my entire ceph cluster size, how is that ?


Ceph cluster & pools

[ceph-admin@storageAdmin ~]$ ceph df
GLOBAL:
SIZE  AVAIL RAW USED %RAW USED
4722G 4721G 928M  0.02
POOLS:
NAME  ID USED %USED MAX AVAIL 
OBJECTS
ecpool_disk1220 0   1199G   
 0
rcpool_disk2 240 0   1499G  
  0
rcpool_cepfsMeta 25 4420  0  76682M
20


CephFS volume & pool

Here data0 is the volume/filesystem name
rcpool_cepfsMeta - is the meta-data pool
rcpool_disk2 - is the data pool

[ceph-admin@storageAdmin ~]$ ceph fs ls
name: data0, metadata pool: rcpool_cepfsMeta, data pools: [rcpool_disk2 ]


Command to mount CephFS
sudo mount -t ceph mon1:6789:/ /mnt/cephfs/ -o 
name=admin,secretfile=admin.secret


Client host df -h output
192.168.1.101:6789:/ 4.7T  928M  4.7T   1% /mnt/cephfs



--
Deepak





---
This email message is for the sole use of the intended recipient(s) and may 
contain
confidential information.  Any unauthorized review, use, disclosure or 
distribution
is prohibited.  If you are not the intended recipient, please contact the 
sender by
reply email and destroy all copies of the original message.
---
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] RBD mirror error requesting lock: (30) Read-only file system

2017-03-16 Thread Jason Dillaman
Any chance you have two or more instance of rbd-mirror daemon running
against the same cluster (zone2 in this instance)? The error message
is stating that there is another process that owns the exclusive lock
to the image and it is refusing to release it. The fact that the
status ping-pongs back-and-forth between OK and ERROR/WARNING also
hints that you have two or more rbd-mirror daemons fighting each
other. In the Jewel and Kraken releases, we unfortunately only support
a single rbd-mirror daemon process per cluster. In the forthcoming
Luminous release, we are hoping to add active/active support (it
already safely supports self-promoting active/passive if more than one
rbd-mirror daemon process is started).

On Thu, Mar 16, 2017 at 5:48 PM, daniel parkes  wrote:
> Hi!,
>
> I'm having a problem with a new ceph deployment using rbd mirroring and it's
> just in case someone can help me out or point me in the right direction.
>
> I have a ceph jewel install, with 2 clusters(zone1,zone2), rbd is working
> fine, but the rbd mirroring between sites is not working correctly.
>
> I have configured  pool replication in the default rbd pool, I have setup
> the peers and created 2 test images:
>
> [root@mon3 ceph]# rbd --user zone1 --cluster zone1 mirror pool info
> Mode: pool
> Peers:
>   UUID NAME  CLIENT
>   397b37ef-8300-4dd3-a637-2a03c3b9289c zone2 client.zone2
> [root@mon3 ceph]# rbd --user zone2 --cluster zone2 mirror pool info
> Mode: pool
> Peers:
>   UUID NAME  CLIENT
>   2c11f1dc-67a4-43f1-be33-b785f1f6b366 zone1 client.zone1
>
> Primary is ok:
>
> [root@mon3 ceph]# rbd --user zone1 --cluster zone1 mirror pool status
> --verbose
> health: OK
> images: 2 total
> 2 stopped
>
> test-2:
>   global_id:   511e3aa4-0e24-42b4-9c2e-8d84fc9f48f4
>   state:   up+stopped
>   description: remote image is non-primary or local image is primary
>   last_update: 2017-03-16 17:38:08
>
> And secondary is always in this state:
>
> [root@mon3 ceph]# rbd --user zone2 --cluster zone2 mirror pool status
> --verbose
> health: WARN
> images: 2 total
> 1 syncing
>
> test-2:
>   global_id:   511e3aa4-0e24-42b4-9c2e-8d84fc9f48f4
>   state:   up+syncing
>   description: bootstrapping, OPEN_LOCAL_IMAGE
>   last_update: 2017-03-16 17:41:02
>
> Sometimes for a couple of seconds it goes into replay state and health ok,
> but then back to bootstrapping, OPEN_LOCAL_IMAGE. what does this state
> mean?.
>
> In the log files I have this error:
>
> 2017-03-16 17:43:02.404372 7ff6262e7700 -1 librbd::ImageWatcher:
> 0x7ff654003190 error requesting lock: (30) Read-only file system
> 2017-03-16 17:43:03.411327 7ff6262e7700 -1 librbd::ImageWatcher:
> 0x7ff654003190 error requesting lock: (30) Read-only file system
> 2017-03-16 17:43:04.420074 7ff6262e7700 -1 librbd::ImageWatcher:
> 0x7ff654003190 error requesting lock: (30) Read-only file system
> 2017-03-16 17:43:05.422253 7ff6262e7700 -1 librbd::ImageWatcher:
> 0x7ff654003190 error requesting lock: (30) Read-only file system
> 2017-03-16 17:43:06.428447 7ff6262e7700 -1 librbd::ImageWatcher:
> 0x7ff654003190 error requesting lock: (30) Read-only file system
>
> Not sure to what file it refers that is RO, I have tried to strace it, but
> couldn't find it.
>
> I have disable selinux just in case but the result is the same the OS is
> rhel 7.2 by the way.
>
> If a do a demote/promote of the image, I get the same state and errors on
> the other cluster.
>
> If someone could help it would be great.
>
> Thnx in advance.
>
> Regards
>
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>



-- 
Jason
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Mirroring data between pools on the same cluster

2017-03-16 Thread Jason Dillaman
On Thu, Mar 16, 2017 at 3:24 PM, Adam Carheden  wrote:
> On Thu, Mar 16, 2017 at 11:55 AM, Jason Dillaman  wrote:
>> On Thu, Mar 16, 2017 at 1:02 PM, Adam Carheden  wrote:
>>> Ceph can mirror data between clusters
>>> (http://docs.ceph.com/docs/master/rbd/rbd-mirroring/), but can it
>>> mirror data between pools in the same cluster?
>>
>> Unfortunately, that's a negative. The rbd-mirror daemon currently
>> assumes that the local and remote pool names are the same. Therefore,
>> you cannot mirror images between a pool named "X" and a pool named
>> "Y".
> I figured as much from the command syntax. Am I going about this all
> wrong? There have got to be lots of orgs with two room that back each
> other up. How do others solve that problem?

Not sure. This is definitely the first time I've heard this as an
example for RBD mirroring. However, it's a relatively new feature and
we expect the more people that use it, the more interesting scenarios
we will learn about. .

> How about a single 10Gb fiber link (which is, unfortunately, used for
> everything, not just CEPH)? Any advice on estimating if/when latency
> over a single link will become a problem?

A quick end-to-end performance test would probably quickly answer that
question from a TCP/IP perspective. Ceph IO latencies will a
combination of the network latency (client to primary PG and primary
PG to secondary PGs replication), disk IO latency, and Ceph software
latency.

>> At the current time, I think three separate clusters would be the only
>> thing that could satisfy all use-case requirements. While I have never
>> attempted this, I would think that you should be able to run two
>> clusters on the same node (e.g. the HA cluster gets one OSD per node
>> in both rooms and the roomX cluster gets the remainder of OSDs in each
>> node in its respective room).
>
> Great idea. I guess that could be done either by munging some port
> numbers and non-default config file locations or by running CEPH OSDs
> and monitors on VMs. Any compelling reason for one way over the other?

Containerized Ceph is another alternative and is gaining interest. If
you use VMs, you will take a slight performance hit from the
virtualization but the existing deployment tools will work w/o
modification. As an alternative, use the existing deployment tools to
deploy the two "room" clusters and then just manually create the few
extra OSDs and MONs for the HA cluster.

> --
> Adam Carheden
> Systems Administrator
> UCAR/NCAR/RAL
> x2753

-- 
Jason
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] RBD mirror error requesting lock: (30) Read-only file system

2017-03-16 Thread daniel parkes
Hi!,

I'm having a problem with a new ceph deployment using rbd mirroring and
it's just in case someone can help me out or point me in the right
direction.

I have a ceph jewel install, with 2 clusters(zone1,zone2), rbd is working
fine, but the rbd mirroring between sites is not working correctly.

I have configured  pool replication in the default rbd pool, I have setup
the peers and created 2 test images:

[root@mon3 ceph]# rbd --user zone1 --cluster zone1 mirror pool info
Mode: pool
Peers:
  UUID NAME  CLIENT
  397b37ef-8300-4dd3-a637-2a03c3b9289c zone2 client.zone2
[root@mon3 ceph]# rbd --user zone2 --cluster zone2 mirror pool info
Mode: pool
Peers:
  UUID NAME  CLIENT
  2c11f1dc-67a4-43f1-be33-b785f1f6b366 zone1 client.zone1

Primary is ok:

[root@mon3 ceph]# rbd --user zone1 --cluster zone1 mirror pool status
--verbose
health: OK
images: 2 total
2 stopped

test-2:
  global_id:   511e3aa4-0e24-42b4-9c2e-8d84fc9f48f4
  state:   up+stopped
  description: remote image is non-primary or local image is primary
  last_update: 2017-03-16 17:38:08

And secondary is always in this state:

[root@mon3 ceph]# rbd --user zone2 --cluster zone2 mirror pool status
--verbose
health: WARN
images: 2 total
1 syncing

test-2:
  global_id:   511e3aa4-0e24-42b4-9c2e-8d84fc9f48f4
  state:   up+syncing
  description: bootstrapping, OPEN_LOCAL_IMAGE
  last_update: 2017-03-16 17:41:02

Sometimes for a couple of seconds it goes into replay state and health ok,
but then back to bootstrapping, OPEN_LOCAL_IMAGE. what does this state
mean?.

In the log files I have this error:

2017-03-16 17:43:02.404372 7ff6262e7700 -1 librbd::ImageWatcher:
0x7ff654003190 error requesting lock: (30) Read-only file system
2017-03-16 17:43:03.411327 7ff6262e7700 -1 librbd::ImageWatcher:
0x7ff654003190 error requesting lock: (30) Read-only file system
2017-03-16 17:43:04.420074 7ff6262e7700 -1 librbd::ImageWatcher:
0x7ff654003190 error requesting lock: (30) Read-only file system
2017-03-16 17:43:05.422253 7ff6262e7700 -1 librbd::ImageWatcher:
0x7ff654003190 error requesting lock: (30) Read-only file system
2017-03-16 17:43:06.428447 7ff6262e7700 -1 librbd::ImageWatcher:
0x7ff654003190 error requesting lock: (30) Read-only file system

Not sure to what file it refers that is RO, I have tried to strace it, but
couldn't find it.

I have disable selinux just in case but the result is the same the OS is
rhel 7.2 by the way.

If a do a demote/promote of the image, I get the same state and errors on
the other cluster.

If someone could help it would be great.

Thnx in advance.

Regards
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] radosgw global quotas

2017-03-16 Thread Graham Allan
This might be a dumb question, but I'm not at all sure what the "global 
quotas" in the radosgw region map actually do.


It is like a default quota which is applied to all users or buckets, 
without having to set them individually, or is it a blanket/aggregate 
quota applied across all users and buckets in the region/zonegroup?


Graham
--
Graham Allan
Minnesota Supercomputing Institute - g...@umn.edu
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Mirroring data between pools on the same cluster

2017-03-16 Thread Adam Carheden
On Thu, Mar 16, 2017 at 11:55 AM, Jason Dillaman  wrote:
> On Thu, Mar 16, 2017 at 1:02 PM, Adam Carheden  wrote:
>> Ceph can mirror data between clusters
>> (http://docs.ceph.com/docs/master/rbd/rbd-mirroring/), but can it
>> mirror data between pools in the same cluster?
>
> Unfortunately, that's a negative. The rbd-mirror daemon currently
> assumes that the local and remote pool names are the same. Therefore,
> you cannot mirror images between a pool named "X" and a pool named
> "Y".
I figured as much from the command syntax. Am I going about this all
wrong? There have got to be lots of orgs with two room that back each
other up. How do others solve that problem?

How about a single 10Gb fiber link (which is, unfortunately, used for
everything, not just CEPH)? Any advice on estimating if/when latency
over a single link will become a problem?

> At the current time, I think three separate clusters would be the only
> thing that could satisfy all use-case requirements. While I have never
> attempted this, I would think that you should be able to run two
> clusters on the same node (e.g. the HA cluster gets one OSD per node
> in both rooms and the roomX cluster gets the remainder of OSDs in each
> node in its respective room).

Great idea. I guess that could be done either by munging some port
numbers and non-default config file locations or by running CEPH OSDs
and monitors on VMs. Any compelling reason for one way over the other?

-- 
Adam Carheden
Systems Administrator
UCAR/NCAR/RAL
x2753
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] mds fails to start after upgrading to 10.2.6

2017-03-16 Thread John Spray
On Thu, Mar 16, 2017 at 5:50 PM, Chad William Seys
 wrote:
> Hi All,
>   After upgrading to 10.2.6 on Debian Jessie, the MDS server fails to start.
> Below is what is written to the log file from attempted start to failure:
>   Any ideas?  I'll probably try rolling back to 10.2.5 in the meantime.

Looks similar to this ticket:
http://tracker.ceph.com/issues/16842

John

> Thanks!
> C.
>
> On 03/16/2017 12:48 PM, r...@mds01.hep.wisc.edu wrote:
>>
>> 2017-03-16 12:46:38.063709 7f605e746180  0 set uid:gid to 64045:64045
>> (ceph:ceph)
>> 2017-03-16 12:46:38.063825 7f605e746180  0 ceph version 10.2.6
>> (656b5b63ed7c43bd014bcafd81b001959d5f089f), process ceph-mds, pid 10858
>> 2017-03-16 12:46:39.755982 7f6057b62700  1 mds.mds01.hep.wisc.edu
>> handle_mds_map standby
>> 2017-03-16 12:46:39.898430 7f6057b62700  1 mds.0.4072 handle_mds_map i am
>> now mds.0.4072
>> 2017-03-16 12:46:39.898437 7f6057b62700  1 mds.0.4072 handle_mds_map state
>> change up:boot --> up:replay
>> 2017-03-16 12:46:39.898459 7f6057b62700  1 mds.0.4072 replay_start
>> 2017-03-16 12:46:39.898466 7f6057b62700  1 mds.0.4072  recovery set is
>> 2017-03-16 12:46:39.898475 7f6057b62700  1 mds.0.4072  waiting for osdmap
>> 253396 (which blacklists prior instance)
>> 2017-03-16 12:46:40.227204 7f6052956700  0 mds.0.cache creating system
>> inode with ino:100
>> 2017-03-16 12:46:40.227569 7f6052956700  0 mds.0.cache creating system
>> inode with ino:1
>> 2017-03-16 12:46:40.954494 7f6050d48700  1 mds.0.4072 replay_done
>> 2017-03-16 12:46:40.954526 7f6050d48700  1 mds.0.4072 making mds journal
>> writeable
>> 2017-03-16 12:46:42.211070 7f6057b62700  1 mds.0.4072 handle_mds_map i am
>> now mds.0.4072
>> 2017-03-16 12:46:42.211074 7f6057b62700  1 mds.0.4072 handle_mds_map state
>> change up:replay --> up:reconnect
>> 2017-03-16 12:46:42.211094 7f6057b62700  1 mds.0.4072 reconnect_start
>> 2017-03-16 12:46:42.211098 7f6057b62700  1 mds.0.4072 reopen_log
>> 2017-03-16 12:46:42.211105 7f6057b62700  1 mds.0.server reconnect_clients
>> -- 5 sessions
>> 2017-03-16 12:47:28.502417 7f605535d700  1 mds.0.server reconnect gave up
>> on client.14384220 10.128.198.55:0/2012593454
>> 2017-03-16 12:47:28.505126 7f605535d700 -1 ./include/interval_set.h: In
>> function 'void interval_set::insert(T, T, T*, T*) [with T = inodeno_t]'
>> thread 7f605535d700 time 2017-03-16 12:47:28.502496
>> ./include/interval_set.h: 355: FAILED assert(0)
>>
>>  ceph version 10.2.6 (656b5b63ed7c43bd014bcafd81b001959d5f089f)
>>  1: (ceph::__ceph_assert_fail(char const*, char const*, int, char
>> const*)+0x82) [0x7f605e248ed2]
>>  2: (()+0x1ea5fe) [0x7f605de245fe]
>>  3: (InoTable::project_release_ids(interval_set&)+0x917)
>> [0x7f605e065ad7]
>>  4: (Server::journal_close_session(Session*, int, Context*)+0x18e)
>> [0x7f605de89f1e]
>>  5: (Server::kill_session(Session*, Context*)+0x133) [0x7f605de8bf23]
>>  6: (Server::reconnect_tick()+0x148) [0x7f605de8d378]
>>  7: (MDSRankDispatcher::tick()+0x389) [0x7f605de524d9]
>>  8: (Context::complete(int)+0x9) [0x7f605de3fcd9]
>>  9: (SafeTimer::timer_thread()+0x104) [0x7f605e239e84]
>>  10: (SafeTimerThread::entry()+0xd) [0x7f605e23ad2d]
>>  11: (()+0x8064) [0x7f605d53d064]
>>  12: (clone()+0x6d) [0x7f605ba8262d]
>>  NOTE: a copy of the executable, or `objdump -rdS ` is needed
>> to interpret this.
>>
>> --- begin dump of recent events ---
>>   -263> 2017-03-16 12:46:38.056353 7f605e746180  5 asok(0x7f6068a2a000)
>> register_command perfcounters_dump hook 0x7f6068a06030
>>   -262> 2017-03-16 12:46:38.056425 7f605e746180  5 asok(0x7f6068a2a000)
>> register_command 1 hook 0x7f6068a06030
>>   -261> 2017-03-16 12:46:38.056431 7f605e746180  5 asok(0x7f6068a2a000)
>> register_command perf dump hook 0x7f6068a06030
>>   -260> 2017-03-16 12:46:38.056434 7f605e746180  5 asok(0x7f6068a2a000)
>> register_command perfcounters_schema hook 0x7f6068a06030
>>   -259> 2017-03-16 12:46:38.056437 7f605e746180  5 asok(0x7f6068a2a000)
>> register_command 2 hook 0x7f6068a06030
>>   -258> 2017-03-16 12:46:38.056440 7f605e746180  5 asok(0x7f6068a2a000)
>> register_command perf schema hook 0x7f6068a06030
>>   -257> 2017-03-16 12:46:38.056444 7f605e746180  5 asok(0x7f6068a2a000)
>> register_command perf reset hook 0x7f6068a06030
>>   -256> 2017-03-16 12:46:38.056448 7f605e746180  5 asok(0x7f6068a2a000)
>> register_command config show hook 0x7f6068a06030
>>   -255> 2017-03-16 12:46:38.056457 7f605e746180  5 asok(0x7f6068a2a000)
>> register_command config set hook 0x7f6068a06030
>>   -254> 2017-03-16 12:46:38.056461 7f605e746180  5 asok(0x7f6068a2a000)
>> register_command config get hook 0x7f6068a06030
>>   -253> 2017-03-16 12:46:38.056464 7f605e746180  5 asok(0x7f6068a2a000)
>> register_command config diff hook 0x7f6068a06030
>>   -252> 2017-03-16 12:46:38.056466 7f605e746180  5 asok(0x7f6068a2a000)
>> register_command log flush hook 0x7f6068a06030
>>   -251> 2017-03-16 12:46:38.056469 7f605e746180  5 asok(0x7f6068a2a000)
>> register_command log 

Re: [ceph-users] 答复: 答复: 答复: Pipe "deadlock" in Hammer, 0.94.5

2017-03-16 Thread Gregory Farnum
On Thu, Mar 16, 2017 at 3:36 AM 许雪寒  wrote:

> Hi, Gregory, is it possible to unlock Connection::lock in
> Pipe::read_message before tcp_read_nonblocking is called? I checked the
> code again, it seems that the code in tcp_read_nonblocking doesn't need to
> be locked by Connection::lock.


Unfortunately it does. You'll note the memory buffers it's grabbing via the
Connection? Those need to be protected from changing (either being
canceled, or being set up) while the read is being processed.
Now, you could probably do something more complicated around the buffer
update mechanism, or if you know your applications don't make use of it you
could just rip them out entirely. But while that mechanism exists it needs
to be synchronized.
-Greg



>
> -邮件原件-
> 发件人: Gregory Farnum [mailto:gfar...@redhat.com]
> 发送时间: 2017年1月17日 7:14
> 收件人: 许雪寒
> 抄送: jiajia zhong; ceph-users@lists.ceph.com
> 主题: Re: 答复: [ceph-users] 答复: 答复: Pipe "deadlock" in Hammer, 0.94.5
>
> On Sat, Jan 14, 2017 at 7:54 PM, 许雪寒  wrote:
> > Thanks for your help:-)
> >
> > I checked the source code again, and in read_message, it does hold the
> Connection::lock:
>
> You're correct of course; I wasn't looking and forgot about this bit.
> This was added to deal with client-allocated buffers and/or op
> cancellation in librados, IIRC, and unfortunately definitely does need to
> be synchronized — I'm not sure about with pipe lookups, but probably even
> that. :/
>
> Unfortunately it looks like you're running a version that didn't come from
> upstream (I see hash 81d4ad40d0c2a4b73529ff0db3c8f22acd15c398 in another
> email, which I can't find), so there's not much we can do to help with the
> specifics of this case — it's fiddly and my guess would be the same as
> Sage's, which you say is not the case.
> -Greg
>
> >
> >  while (left > 0) {
> > // wait for data
> > if (tcp_read_wait() < 0)
> > goto out_dethrottle;
> >
> > // get a buffer
> > connection_state->lock.Lock();
> > map
> >::iterator p =
> >
>  connection_state->rx_buffers.find(header.tid);
> > if (p != connection_state->rx_buffers.end()) {
> > if (rxbuf.length() == 0 ||
> p->second.second != rxbuf_version) {
> > ldout(msgr->cct,10)
> > <<
> "reader seleting rx buffer v "
> >
><< p->second.second << " at offset "
> >
><< offset << " len "
> >
><< p->second.first.length() << dendl;
> > rxbuf = p->second.first;
> > rxbuf_version = p->second.second;
> > // make sure it's big enough
> > if (rxbuf.length() < data_len)
> > rxbuf.push_back(
> >
>  buffer::create(data_len - rxbuf.length()));
> > blp = p->second.first.begin();
> > blp.advance(offset);
> > }
> > } else {
> > if (!newbuf.length()) {
> > ldout(msgr->cct,20)
> > <<
> "reader allocating new rx buffer at offset "
> >
><< offset << dendl;
> > alloc_aligned_buffer(newbuf,
> data_len, data_off);
> > blp = newbuf.begin();
> > blp.advance(offset);
> > }
> > }
> > bufferptr bp = blp.get_current_ptr();
> > int read = MIN(bp.length(), left);
> > ldout(msgr->cct,20)
> > << "reader reading
> nonblocking into "
> > <<
> (void*) bp.c_str() << " len " << bp.length()
> > << dendl;
> > int got = tcp_read_nonblocking(bp.c_str(), read);
> > ldout(msgr->cct,30)
> > << "reader read " << got
> << " of " << read << dendl;
> > connection_state->lock.Unlock();
> > if (got < 0)
> > goto out_dethrottle;
> > if (got > 0) {
> > blp.advance(got);
> > data.append(bp, 0, got);
> >   

Re: [ceph-users] Mirroring data between pools on the same cluster

2017-03-16 Thread Jason Dillaman
On Thu, Mar 16, 2017 at 1:02 PM, Adam Carheden  wrote:
> Ceph can mirror data between clusters
> (http://docs.ceph.com/docs/master/rbd/rbd-mirroring/), but can it
> mirror data between pools in the same cluster?

Unfortunately, that's a negative. The rbd-mirror daemon currently
assumes that the local and remote pool names are the same. Therefore,
you cannot mirror images between a pool named "X" and a pool named
"Y".

> My use case is DR in the even of a room failure. I have a single CEPH
> cluster that spans multiple rooms. The two rooms have separate power
> and cooling, but have a single 10Gbe link between them (actually 2 w/
> active-passive failover). I can configure pools and crushmaps to keep
> data local to each room so my single link doesn't become a bottleneck.
> However, I'd like to be able to recovery quickly if a room UPS fails.
>
> Ideally I'd like something like this:
>
> HA pool - spans rooms but we limit how much we put on it to avoid
> latency or saturation issues with our single 10Gbe link.
> room1 pool - Writes only to OSDs in room 1
> room2 pool - Writes only to OSDs in room 2
> room1-backup pool - Asynchronous mirror of room1 pool that writes only
> to OSDs in room 2
> room2-backup pool - Asynchronous mirror of room2 pool that writes only
> to OSDs in room 1
>
> In the event of a room failure, my very important stuff migrates or
> reboots immediately in the other room without any manual steps. For
> everything else, I manually spin up new VMs (scripted, of course) that
> run from the mirrored backups.
>
> Is this possible?
>
> If I made it two separate CEPH clusters, how would I do the automated
> HA failover? I could have 3 clusters (HA, room1, room2, mirroring
> between room1 and roomt2), but then each cluster would be so small (2
> nodes, 3 nodes) that node failure becomes more of a risk than room
> failure.

At the current time, I think three separate clusters would be the only
thing that could satisfy all use-case requirements. While I have never
attempted this, I would think that you should be able to run two
clusters on the same node (e.g. the HA cluster gets one OSD per node
in both rooms and the roomX cluster gets the remainder of OSDs in each
node in its respective room).

>
> (And yes, I do have a 3rd small room with monitors running so if one
> of the primary rooms goes down monitors in the remaining room + 3rd
> room have a quorum)
>
> Thanks
> --
> Adam Carheden
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com



-- 
Jason
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] mds fails to start after upgrading to 10.2.6

2017-03-16 Thread Chad William Seys

Hi All,
  After upgrading to 10.2.6 on Debian Jessie, the MDS server fails to 
start.  Below is what is written to the log file from attempted start to 
failure:

  Any ideas?  I'll probably try rolling back to 10.2.5 in the meantime.

Thanks!
C.

On 03/16/2017 12:48 PM, r...@mds01.hep.wisc.edu wrote:

2017-03-16 12:46:38.063709 7f605e746180  0 set uid:gid to 64045:64045 
(ceph:ceph)
2017-03-16 12:46:38.063825 7f605e746180  0 ceph version 10.2.6 
(656b5b63ed7c43bd014bcafd81b001959d5f089f), process ceph-mds, pid 10858
2017-03-16 12:46:39.755982 7f6057b62700  1 mds.mds01.hep.wisc.edu 
handle_mds_map standby
2017-03-16 12:46:39.898430 7f6057b62700  1 mds.0.4072 handle_mds_map i am now 
mds.0.4072
2017-03-16 12:46:39.898437 7f6057b62700  1 mds.0.4072 handle_mds_map state change 
up:boot --> up:replay
2017-03-16 12:46:39.898459 7f6057b62700  1 mds.0.4072 replay_start
2017-03-16 12:46:39.898466 7f6057b62700  1 mds.0.4072  recovery set is
2017-03-16 12:46:39.898475 7f6057b62700  1 mds.0.4072  waiting for osdmap 
253396 (which blacklists prior instance)
2017-03-16 12:46:40.227204 7f6052956700  0 mds.0.cache creating system inode 
with ino:100
2017-03-16 12:46:40.227569 7f6052956700  0 mds.0.cache creating system inode 
with ino:1
2017-03-16 12:46:40.954494 7f6050d48700  1 mds.0.4072 replay_done
2017-03-16 12:46:40.954526 7f6050d48700  1 mds.0.4072 making mds journal 
writeable
2017-03-16 12:46:42.211070 7f6057b62700  1 mds.0.4072 handle_mds_map i am now 
mds.0.4072
2017-03-16 12:46:42.211074 7f6057b62700  1 mds.0.4072 handle_mds_map state change 
up:replay --> up:reconnect
2017-03-16 12:46:42.211094 7f6057b62700  1 mds.0.4072 reconnect_start
2017-03-16 12:46:42.211098 7f6057b62700  1 mds.0.4072 reopen_log
2017-03-16 12:46:42.211105 7f6057b62700  1 mds.0.server reconnect_clients -- 5 
sessions
2017-03-16 12:47:28.502417 7f605535d700  1 mds.0.server reconnect gave up on 
client.14384220 10.128.198.55:0/2012593454
2017-03-16 12:47:28.505126 7f605535d700 -1 ./include/interval_set.h: In function 
'void interval_set::insert(T, T, T*, T*) [with T = inodeno_t]' thread 
7f605535d700 time 2017-03-16 12:47:28.502496
./include/interval_set.h: 355: FAILED assert(0)

 ceph version 10.2.6 (656b5b63ed7c43bd014bcafd81b001959d5f089f)
 1: (ceph::__ceph_assert_fail(char const*, char const*, int, char const*)+0x82) 
[0x7f605e248ed2]
 2: (()+0x1ea5fe) [0x7f605de245fe]
 3: (InoTable::project_release_ids(interval_set&)+0x917) 
[0x7f605e065ad7]
 4: (Server::journal_close_session(Session*, int, Context*)+0x18e) 
[0x7f605de89f1e]
 5: (Server::kill_session(Session*, Context*)+0x133) [0x7f605de8bf23]
 6: (Server::reconnect_tick()+0x148) [0x7f605de8d378]
 7: (MDSRankDispatcher::tick()+0x389) [0x7f605de524d9]
 8: (Context::complete(int)+0x9) [0x7f605de3fcd9]
 9: (SafeTimer::timer_thread()+0x104) [0x7f605e239e84]
 10: (SafeTimerThread::entry()+0xd) [0x7f605e23ad2d]
 11: (()+0x8064) [0x7f605d53d064]
 12: (clone()+0x6d) [0x7f605ba8262d]
 NOTE: a copy of the executable, or `objdump -rdS ` is needed to 
interpret this.

--- begin dump of recent events ---
  -263> 2017-03-16 12:46:38.056353 7f605e746180  5 asok(0x7f6068a2a000) 
register_command perfcounters_dump hook 0x7f6068a06030
  -262> 2017-03-16 12:46:38.056425 7f605e746180  5 asok(0x7f6068a2a000) 
register_command 1 hook 0x7f6068a06030
  -261> 2017-03-16 12:46:38.056431 7f605e746180  5 asok(0x7f6068a2a000) 
register_command perf dump hook 0x7f6068a06030
  -260> 2017-03-16 12:46:38.056434 7f605e746180  5 asok(0x7f6068a2a000) 
register_command perfcounters_schema hook 0x7f6068a06030
  -259> 2017-03-16 12:46:38.056437 7f605e746180  5 asok(0x7f6068a2a000) 
register_command 2 hook 0x7f6068a06030
  -258> 2017-03-16 12:46:38.056440 7f605e746180  5 asok(0x7f6068a2a000) 
register_command perf schema hook 0x7f6068a06030
  -257> 2017-03-16 12:46:38.056444 7f605e746180  5 asok(0x7f6068a2a000) 
register_command perf reset hook 0x7f6068a06030
  -256> 2017-03-16 12:46:38.056448 7f605e746180  5 asok(0x7f6068a2a000) 
register_command config show hook 0x7f6068a06030
  -255> 2017-03-16 12:46:38.056457 7f605e746180  5 asok(0x7f6068a2a000) 
register_command config set hook 0x7f6068a06030
  -254> 2017-03-16 12:46:38.056461 7f605e746180  5 asok(0x7f6068a2a000) 
register_command config get hook 0x7f6068a06030
  -253> 2017-03-16 12:46:38.056464 7f605e746180  5 asok(0x7f6068a2a000) 
register_command config diff hook 0x7f6068a06030
  -252> 2017-03-16 12:46:38.056466 7f605e746180  5 asok(0x7f6068a2a000) 
register_command log flush hook 0x7f6068a06030
  -251> 2017-03-16 12:46:38.056469 7f605e746180  5 asok(0x7f6068a2a000) 
register_command log dump hook 0x7f6068a06030
  -250> 2017-03-16 12:46:38.056472 7f605e746180  5 asok(0x7f6068a2a000) 
register_command log reopen hook 0x7f6068a06030
  -249> 2017-03-16 12:46:38.063709 7f605e746180  0 set uid:gid to 64045:64045 
(ceph:ceph)
  -248> 2017-03-16 12:46:38.063825 7f605e746180  0 ceph version 10.2.6 
(656b5b63ed7c43bd014bcafd81b001959d5f089f), process ceph-mds, pid 10858
  

Re: [ceph-users] Directly addressing files on individual OSD

2017-03-16 Thread Ronny Aasen

On 16.03.2017 08:26, Youssef Eldakar wrote:

Thanks for the reply, Anthony, and I am sorry my question did not give 
sufficient background.

This is the cluster behind archive.bibalex.org. Storage nodes keep archived 
webpages as multi-member GZIP files on the disks, which are formatted using XFS 
as standalone file systems. The access system consults an index that says where 
a URL is stored, which is then fetched over HTTP from the individual storage 
node that has the URL somewhere on one of the disks. So far, we have pretty 
much been managing the storage using homegrown scripts to have each GZIP file 
stored on 2 separate nodes. This obviously has been requiring a good deal of 
manual work and as such has not been very effective.

Given that description, do you feel Ceph could be an appropriate choice?


if you adapt your scripts to something like...

"Storage nodes archives webpages as gzip files, hashes the url to use as 
an object name and saves the gzipfiles as an object in ceph via the S3 
interface.  The access system gets a request for an url, it hashes an 
url into a object name and fetch the gzip (object) using regular S3 get 
syntax"


ceph would deal with replication, you would only put objects in, and 
fetch them out.


you could if you need it store the list of urls and hashes. except as a 
list of what you have stored.
this is just an example tho. you could also use cephfs, mounted on nodes 
and serve files as today.


ceph is just a storage tool it could work very nicely for your needs. 
but accessing the files on osd's directly will only bring pain.



___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] Mirroring data between pools on the same cluster

2017-03-16 Thread Adam Carheden
Ceph can mirror data between clusters
(http://docs.ceph.com/docs/master/rbd/rbd-mirroring/), but can it
mirror data between pools in the same cluster?

My use case is DR in the even of a room failure. I have a single CEPH
cluster that spans multiple rooms. The two rooms have separate power
and cooling, but have a single 10Gbe link between them (actually 2 w/
active-passive failover). I can configure pools and crushmaps to keep
data local to each room so my single link doesn't become a bottleneck.
However, I'd like to be able to recovery quickly if a room UPS fails.

Ideally I'd like something like this:

HA pool - spans rooms but we limit how much we put on it to avoid
latency or saturation issues with our single 10Gbe link.
room1 pool - Writes only to OSDs in room 1
room2 pool - Writes only to OSDs in room 2
room1-backup pool - Asynchronous mirror of room1 pool that writes only
to OSDs in room 2
room2-backup pool - Asynchronous mirror of room2 pool that writes only
to OSDs in room 1

In the event of a room failure, my very important stuff migrates or
reboots immediately in the other room without any manual steps. For
everything else, I manually spin up new VMs (scripted, of course) that
run from the mirrored backups.

Is this possible?

If I made it two separate CEPH clusters, how would I do the automated
HA failover? I could have 3 clusters (HA, room1, room2, mirroring
between room1 and roomt2), but then each cluster would be so small (2
nodes, 3 nodes) that node failure becomes more of a risk than room
failure.


(And yes, I do have a 3rd small room with monitors running so if one
of the primary rooms goes down monitors in the remaining room + 3rd
room have a quorum)

Thanks
-- 
Adam Carheden
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Directly addressing files on individual OSD

2017-03-16 Thread Youssef Eldakar
I found this

http://ceph.com/geen-categorie/ceph-osd-where-is-my-data/

which leads me to think I can perhaps directly process the files on the OSD by 
going to the /var/lib/ceph/osd directory.

Would that make sense?

Youssef Eldakar

Bibliotheca Alexandrina

From: Youssef Eldakar
Sent: Thursday, March 16, 2017 09:26
To: Anthony D'Atri; ceph-users@lists.ceph.com
Subject: RE: [ceph-users] Directly addressing files on individual OSD

Thanks for the reply, Anthony, and I am sorry my question did not give 
sufficient background.

This is the cluster behind archive.bibalex.org. Storage nodes keep archived 
webpages as multi-member GZIP files on the disks, which are formatted using XFS 
as standalone file systems. The access system consults an index that says where 
a URL is stored, which is then fetched over HTTP from the individual storage 
node that has the URL somewhere on one of the disks. So far, we have pretty 
much been managing the storage using homegrown scripts to have each GZIP file 
stored on 2 separate nodes. This obviously has been requiring a good deal of 
manual work and as such has not been very effective.

Given that description, do you feel Ceph could be an appropriate choice?

Thanks once again for the reply.

Youssef Eldakar
Bibliotheca Alexandrina

From: ceph-users [ceph-users-boun...@lists.ceph.com] on behalf of Anthony 
D'Atri [a...@dreamsnake.net]
Sent: Thursday, March 16, 2017 01:37
To: ceph-users@lists.ceph.com
Subject: Re: [ceph-users] Directly addressing files on individual OSD

As I parse Youssef’s message, I believe there are some misconceptions.  It 
might help if you could give a bit more info on what your existing ‘cluster’ is 
running.  NFS? CIFS/SMB?  Something else?

1) Ceph regularly runs scrubs to ensure that all copies of data are consistent. 
 The checksumming that you describe would be both infeasible and redundant.

2) It sounds as though your current back-end stores user files as-is and is 
either a traditional file server setup or perhaps a virtual filesystem 
aggregating multiple filesystems.  Ceph is not a file storage solution in this 
sense.  The below sounds as though you want user files to not be sharded across 
multiple servers.  This is antithetical to how Ceph works and is counter to 
data durability and availability, unless there is some replication that you 
haven’t described.  Reference this diagram:

http://docs.ceph.com/docs/master/_images/stack.png

Beneath the hood Ceph operates internally on ‘objects’ that are not exposed to 
clients as such. There are several different client interfaces that are built 
on top of this block service:

- RBD volumes — think in terms of a virtual disk drive attached to a VM
- RGW — like Amazon S3 or Swift
- CephFS — provides a mountable filesystem interface, somewhat like NFS or even 
SMB but with important distictions in behavior and use-case

I had not heard of iRODS before but just looked it up.  It is a very different 
thing than any of the common interfaces to Ceph.

If your users need to mount the storage as a share / volume, in the sense of 
SMB or NFS, then Ceph may not be your best option.  If they can cope with an S3 
/ Swift type REST object interface, a cluster with RGW interfaces might do the 
job, or perhaps Swift or Gluster.   It’s hard to say for sure based on 
assumptions of what you need.

— Anthony


We currently run a commodity cluster that supports a few petabytes of data. 
Each node in the cluster has 4 drives, currently mounted as /0 through /3. We 
have been researching alternatives for managing the storage, Ceph being one 
possibility, iRODS being another. For preservation purposes, we would like each 
file to exist as one whole piece per drive (as opposed to being striped across 
multiple drives). It appears this is the default in Ceph.

Now, it has always been convenient for us to run distributed jobs over SSH to, 
for instance, compile a list of checksums of all files in the cluster:

dsh -Mca 'find /{0..3}/items -name \*.warc.gz | xargs md5sum 
>/tmp/$HOSTNAME.md5sum'

And that nicely allows each node to process its own files using the local CPU.

Would this scenario still be possible where Ceph is managing the storage?
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Ceph-deploy and git.ceph.com

2017-03-16 Thread Shain Miley
It looks like things are working a bit better today…however now I am getting 
the following error:

[hqosd6][DEBUG ] detect platform information from remote host
[hqosd6][DEBUG ] detect machine type
[ceph_deploy.install][INFO  ] Distro info: Ubuntu 14.04 trusty
[hqosd6][INFO  ] installing ceph on hqosd6
[hqosd6][INFO  ] Running command: env DEBIAN_FRONTEND=noninteractive apt-get -q 
install --assume-yes ca-certificates
[hqosd6][DEBUG ] Reading package lists...
[hqosd6][DEBUG ] Building dependency tree...
[hqosd6][DEBUG ] Reading state information...
[hqosd6][DEBUG ] ca-certificates is already the newest version.
[hqosd6][DEBUG ] 0 upgraded, 0 newly installed, 0 to remove and 3 not upgraded.
[hqosd6][INFO  ] Running command: wget -O release.asc 
https://ceph.com/git/?p=ceph.git;a=blob_plain;f=keys/release.asc
[hqosd6][WARNIN] --2017-03-16 10:25:17--  
https://ceph.com/git/?p=ceph.git;a=blob_plain;f=keys/release.asc
[hqosd6][WARNIN] Resolving ceph.com (ceph.com)... 158.69.68.141
[hqosd6][WARNIN] Connecting to ceph.com (ceph.com)|158.69.68.141|:443... 
connected.
[hqosd6][WARNIN] HTTP request sent, awaiting response... 301 Moved Permanently
[hqosd6][WARNIN] Location: 
https://git.ceph.com/?p=ceph.git;a=blob_plain;f=keys/release.asc [following]
[hqosd6][WARNIN] --2017-03-16 10:25:17--  
https://git.ceph.com/?p=ceph.git;a=blob_plain;f=keys/release.asc
[hqosd6][WARNIN] Resolving git.ceph.com (git.ceph.com)... 8.43.84.132
[hqosd6][WARNIN] Connecting to git.ceph.com (git.ceph.com)|8.43.84.132|:443... 
connected.
[hqosd6][WARNIN] HTTP request sent, awaiting response... 200 OK
[hqosd6][WARNIN] Length: 1645 (1.6K) [text/plain]
[hqosd6][WARNIN] Saving to: ‘release.asc’
[hqosd6][WARNIN] 
[hqosd6][WARNIN]  0K . 
100%  219M=0s
[hqosd6][WARNIN] 
[hqosd6][WARNIN] 2017-03-16 10:25:17 (219 MB/s) - ‘release.asc’ saved 
[1645/1645]
[hqosd6][WARNIN] 
[hqosd6][INFO  ] Running command: apt-key add release.asc
[hqosd6][DEBUG ] OK
[hqosd6][DEBUG ] add deb repo to sources.list
[hqosd6][INFO  ] Running command: apt-get -q update
[hqosd6][DEBUG ] Ign http://us.archive.ubuntu.com trusty InRelease
[hqosd6][DEBUG ] Hit http://us.archive.ubuntu.com trusty-updates InRelease
[hqosd6][DEBUG ] Hit http://us.archive.ubuntu.com trusty-backports InRelease
[hqosd6][DEBUG ] Get:1 http://us.archive.ubuntu.com trusty Release.gpg [933 B]
[hqosd6][DEBUG ] Hit http://security.ubuntu.com trusty-security InRelease
[hqosd6][DEBUG ] Hit http://us.archive.ubuntu.com trusty-updates/main Sources
[hqosd6][DEBUG ] Hit http://us.archive.ubuntu.com trusty-updates/restricted 
Sources
[hqosd6][DEBUG ] Get:2 http://us.archive.ubuntu.com trusty-updates/universe 
Sources [175 kB]
[hqosd6][DEBUG ] Hit http://security.ubuntu.com trusty-security/main Sources
[hqosd6][DEBUG ] Get:3 http://ceph.com trusty InRelease
[hqosd6][WARNIN] Splitting up 
/var/lib/apt/lists/partial/ceph.com_debian-hammer_dists_trusty_InRelease into 
data and signature failedE: GPG error: http://ceph.com trusty InRelease: 
Clearsigned file isn't valid, got 'NODATA' (does the network require 
authentication?)
[hqosd6][DEBUG ] Ign http://ceph.com trusty InRelease
[hqosd6][ERROR ] RuntimeError: command returned non-zero exit status: 100
[ceph_deploy][ERROR ] RuntimeError: Failed to execute command: apt-get -q update

Does anyone know if there is still an issue ongoing issue….or is this something 
that should be working at this point?

Thanks again,
Shain



> On Mar 15, 2017, at 2:08 PM, Shain Miley  wrote:
> 
> This sender failed our fraud detection checks and may not be who they appear 
> to be. Learn about spoofing    Feedback 
> 
> Thanks for all the help so far.
> 
> Just to be clear…if I am planning on upgrading the cluster from Hammer in say 
> the next 3 months…what is the suggested upgrade path?
> 
> Thanks again,
> Shain 
> 
>> On Mar 15, 2017, at 2:05 PM, Abhishek Lekshmanan > > wrote:
>> 
>> 
>> 
>> On 15/03/17 18:32, Shinobu Kinjo wrote:
>>> So description of Jewel is wrong?
>>> 
>>> http://docs.ceph.com/docs/master/releases/ 
>>> 
>> Yeah we missed updating jewel dates as well when updating about hammer, 
>> Jewel is an LTS and would get more upgrades. Once Luminous is released, 
>> however, we'll eventually shift focus on bugs that would hinder upgrades to 
>> Luminous itself
>> 
>> Abhishek
>>> On Thu, Mar 16, 2017 at 2:27 AM, John Spray >> > wrote:
 On Wed, Mar 15, 2017 at 5:04 PM, Shinobu Kinjo > wrote:
> It may be probably kind of challenge but please consider Kraken (or
> later) because Jewel will be retired:
> 
> http://docs.ceph.com/docs/master/releases/ 
> 
 Nope, Jewel is LTS, Kraken 

Re: [ceph-users] Log message --> "bdev(/var/lib/ceph/osd/ceph-x/block) aio_submit retries"

2017-03-16 Thread Sage Weil
On Thu, 16 Mar 2017, Brad Hubbard wrote:
> On Thu, Mar 16, 2017 at 4:33 PM, nokia ceph  wrote:
> > Hello Brad,
> >
> > I meant for this parameter bdev_aio_max_queue_depth , Sage suggested try
> > diff values, 128,1024 , 4096 . So my doubt how this calculation happens? Is
> > this  related to memory?
> 
> The bdev_aio_max_queue_depth parameter represents the nr_events
> argument to the libaio io_setup function.
> 
> int io_setup(unsigned nr_events, aio_context_t *ctx_idp);
> 
> From the man page for io_setup:
> 
> "The io_setup() system call creates an asynchronous I/O context
> suitable for concurrently processing nr_events operations."
> 
> The current theory we are working with is that io_submit is returning
> EAGAIN because nr_events is too small at the default of 32. Therefore
> we have suggested raising this value. There is no real calculation
> involved in the values Sage is suggesting other than they are
> *larger*. It's a matter of playing with the value to see if, and when,
> the error messages go away. If we know a larger value reduces or
> eradicates the error we can then turn our focus more to *why*. Longer
> term this can assist us in setting a more reasonable default.

One nuance is that small values are equivalent because the kernel 
apparently rounds up to a page size-aligned buffer full of struct iocb's 
(or whatever the kernel equivalent is).  My guess is that we want a 
default that equates to 2 or 4 pages instead of 1 page.

We do lots of testing of the same kernel in the lab; I'm adding an item to 
my list to look for these messages in our logs too.  FWIW, as long as the 
retry is succeeding this is pretty harmless (we're basic just limiting the 
depth of the io queue at the kernel and device to some probably-reasoanble 
value).  My curiousity here is somewhat academic.  :)

Thanks!
sage


> 
> >
> > Thanks
> >
> >
> >
> >
> > On Thu, Mar 16, 2017 at 11:53 AM, Brad Hubbard  wrote:
> >>
> >> On Thu, Mar 16, 2017 at 4:15 PM, nokia ceph 
> >> wrote:
> >> > Hello,
> >> >
> >> > We are running latest kernel - 3.10.0-514.2.2.el7.x86_64 { RHEL 7.3 }
> >> >
> >> > Sure I will try to alter this directive - bdev_aio_max_queue_depth and
> >> > will
> >> > share our results.
> >> >
> >> > Could you please explain how this calculation happens?
> >>
> >> What calculation are you referring to?
> >>
> >> > Thanks
> >> >
> >> >
> >> > On Wed, Mar 15, 2017 at 7:54 PM, Sage Weil  wrote:
> >> >>
> >> >> On Wed, 15 Mar 2017, Brad Hubbard wrote:
> >> >> > +ceph-devel
> >> >> >
> >> >> > On Wed, Mar 15, 2017 at 5:25 PM, nokia ceph
> >> >> > 
> >> >> > wrote:
> >> >> > > Hello,
> >> >> > >
> >> >> > > We suspect these messages not only at the time of OSD creation. But
> >> >> > > in
> >> >> > > idle
> >> >> > > conditions also. May I know what is the impact of these error? Can
> >> >> > > we
> >> >> > > safely
> >> >> > > ignore this? Or is there any way/config to fix this problem
> >> >> > >
> >> >> > > Few occurrence for these events as follows:---
> >> >> > >
> >> >> > > 
> >> >> > > 2017-03-14 17:16:09.500370 7fedeba61700  4 rocksdb: (Original Log
> >> >> > > Time
> >> >> > > 2017/03/14-17:16:09.453130) [default] Level-0 commit table #60
> >> >> > > started
> >> >> > > 2017-03-14 17:16:09.500374 7fedeba61700  4 rocksdb: (Original Log
> >> >> > > Time
> >> >> > > 2017/03/14-17:16:09.500273) [default] Level-0 commit table #60:
> >> >> > > memtable #1
> >> >> > > done
> >> >> > > 2017-03-14 17:16:09.500376 7fedeba61700  4 rocksdb: (Original Log
> >> >> > > Time
> >> >> > > 2017/03/14-17:16:09.500297) EVENT_LOG_v1 {"time_micros":
> >> >> > > 1489511769500289,
> >> >> > > "job": 17, "event": "flush_finished", "lsm_state": [2, 4, 6, 0, 0,
> >> >> > > 0,
> >> >> > > 0],
> >> >> > > "immutable_memtables": 0}
> >> >> > > 2017-03-14 17:16:09.500382 7fedeba61700  4 rocksdb: (Original Log
> >> >> > > Time
> >> >> > > 2017/03/14-17:16:09.500330) [default] Level summary: base level 1
> >> >> > > max
> >> >> > > bytes
> >> >> > > base 268435456 files[2 4 6 0 0 0 0] max score 0.76
> >> >> > >
> >> >> > > 2017-03-14 17:16:09.500390 7fedeba61700  4 rocksdb: [JOB 17] Try to
> >> >> > > delete
> >> >> > > WAL files size 244090350, prev total WAL file size 247331500,
> >> >> > > number
> >> >> > > of live
> >> >> > > WAL files 2.
> >> >> > >
> >> >> > > 2017-03-14 17:34:11.610513 7fedf3a71700 -1
> >> >> > > bdev(/var/lib/ceph/osd/ceph-73/block) aio_submit retries 6
> >> >> >
> >> >> > These errors come from here.
> >> >> >
> >> >> > void KernelDevice::aio_submit(IOContext *ioc)
> >> >> > {
> >> >> > ...
> >> >> > int r = aio_queue.submit(*cur, );
> >> >> > if (retries)
> >> >> >   derr << __func__ << " retries " << retries << dendl;
> >> >> >
> >> >> > The submit function is this one which calls libaio's io_submit
> >> >> > function directly and increments retries if it receives EAGAIN.
> >> 

[ceph-users] OSD separation for obj. and block storage

2017-03-16 Thread M Ranga Swami Reddy
Hello,
We use ceph cluster with 10 nodes/servers with 15 OSDs per node.
Here, I wanted to use 10 OSDs for block storage (i.e volumes pool) and 5
OSDs for obj. storage (ie rgw pool). And plan to use "replica" type for
block and obj. pools.

Please advise, if the above is good use or any bottlenecks in performance
or so.?

Alternaitve, we can use all 15 OSDs for both pools. Any inputs appreciated.

Thanks
Swami
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Odd latency numbers

2017-03-16 Thread Rhian Resnick
Regarding opennebula it is working, we do find the network functionality less 
then flexible. We would prefer the orchestration layer allow each primary group 
to create a network infrastructure internally to meet their needs and then 
automatically provide nat from one or more public ip addresses (think aws and 
azure). This doesn't seem to be implemented at this time and will likely 
require manual intervention per group of users to resolve. Otherwise we like 
the software and find it much more lightweight then openstack. We need a tool 
that can be managed by a very small team and opennebula meets that goal.




Thanks for checking this out this data for our test cluster, it isn't 
production so yes we are throwing the spaghetti on the wall trying to make sure 
our we are able to handle issues as they come up.


We already planned to increase the pg count and have done so. (thanks)


Here is our osd tree, as this is test we are currently sharing the osd disks 
for cache tier (replica 3) and data (erasure), some more hardware is on the way 
so we can test the using SSD's.


We have been reviewing atop, iostat, sar, and our snmp monitoring (not granular 
enough) and have confirmed the disks on this particular node are under a higher 
load then the others. We will likely take the time to deploy graphite since it 
will help with another project also. On speculation that was discussed this 
morning is a bad cache battery on the perc card in ceph-mon1 which could 
explain the +10 ms latency we see on all for drives. (Wouldn't be ceph at all 
in this case)


ID WEIGHT  TYPE NAME  UP/DOWN REWEIGHT PRIMARY-AFFINITY
-1 3.12685 root default
-2 1.08875 host ceph-mon1
 0 0.27219 osd.0   up  1.0  1.0
 1 0.27219 osd.1   up  1.0  1.0
 2 0.27219 osd.2   up  1.0  1.0
 4 0.27219 osd.4   up  1.0  1.0
-3 0.94936 host ceph-mon2
 3 0.27219 osd.3   up  1.0  1.0
 5 0.27219 osd.5   up  1.0  1.0
 7 0.27219 osd.7   up  1.0  1.0
 9 0.13280 osd.9   up  1.0  1.0
-4 1.08875 host ceph-mon3
 6 0.27219 osd.6   up  1.0  1.0
 8 0.27219 osd.8   up  1.0  1.0
10 0.27219 osd.10  up  1.0  1.0
11 0.27219 osd.11  up  1.0  1.0




Rhian Resnick

Assistant Director Middleware and HPC

Office of Information Technology


Florida Atlantic University

777 Glades Road, CM22, Rm 173B

Boca Raton, FL 33431

Phone 561.297.2647

Fax 561.297.0222

 [image] 



From: Christian Balzer 
Sent: Wednesday, March 15, 2017 8:31 PM
To: ceph-users@lists.ceph.com
Cc: Rhian Resnick
Subject: Re: [ceph-users] Odd latency numbers


Hello,

On Wed, 15 Mar 2017 16:49:00 + Rhian Resnick wrote:

> Morning all,
>
>
> We starting to apply load to our test cephfs system and are noticing some odd 
> latency numbers. We are using erasure coding for the cold data pools and 
> replication for our our cache tiers (not on ssd yet) . We noticed the 
> following high latency on one node and it seams to be slowing down writes and 
> reads on the cluster.
>
The pg dump below was massive overkill at this point in time, whereas a
"ceph osd tree" would have probably shown us the topology (where is your
tier, where your EC pool(s)?).
Same for a "ceph osd pool ls detail".

So if we were to assume that node is you cache tier (replica 1?), then the
latencies would make sense.
But that's guesswork, so describe your cluster in more detail.

And yes, a single slow OSD (stealthily failing drive, etc) can bring a
cluster to its knees.
This is why many people here tend to get every last bit of info with
collectd and feed it into carbon and graphite/grafana, etc.
This will immediately indicate culprits and allow you to correlate this
with other data, like actual disk/network/cpu load, etc.

For the time being run atop on that node and see if you can reduce the
issue to something like "all disk are busy all the time" or "CPU meltdown".

>
> Our next step is break out mds, mgr, and mons to different machines but we 
> wanted to start the discussion here.
>

If your nodes (not a single iota of HW/NW info from you) are powerful
enough, breaking out stuff isn't likely to help or a necessity.

More below.

>
> Here is a bunch of information you may find useful.
>
>
> ceph.conf
>
> [global]
> fsid = X
> mon_initial_members = ceph-mon1, ceph-mon2, ceph-mon3
> mon_host = 10.141.167.238,10.141.160.251,10.141.161.249
> auth_cluster_required = cephx
> auth_service_required = cephx
> auth_client_required = cephx
>
> cluster network = 10.85.8.0/22
> public network = 10.141.0.0/16
>
> # we tested this with 

Re: [ceph-users] CephFS pg ratio of metadata/data pool

2017-03-16 Thread John Spray
On Thu, Mar 16, 2017 at 11:12 AM, TYLin  wrote:
> Hi all,
>
> We have a CephFS which its metadata pool and data pool share same set of 
> OSDs. According to the PGs calculation:
>
> (100*num_osds) / num_replica

That guideline tells you rougly how many PGs you want in total -- when
you have multiple pools you need to share it out between them.

As you suggest, it is probably sensible to use a smaller number of PGs
for the metadata pool than for the data pool.  I would be tempted to
try something like an 80:20 ratio perhaps, that's just off the top of
my head.

Remember that there is no special prioritisation for metadata traffic
over data traffic on the OSDs, so if you're mixing them together on
the same drives, then you may see MDS slowdown if your clients
saturate the system with data writes.  The alternative is to dedicate
some SSD OSDs for metadata.

John


>
> If we have 56 OSDs, we should set 5120 PGs to each pool to make the data 
> evenly distributed to all the OSDs. However, if we set metadata pool and data 
> pool to both 5120 there will have warning about “too many pgs”. We currently 
> set 2048 to metadata pool and data pool but it seems data may not evenly 
> distribute to OSDs due to no sufficient PGs. Can we set a smaller PGs to 
> metadata pool and larger PGs to data pool? E.g. 1024 pg to metadata and 4096 
> to data pool. Is there a recommend ratio? Will this result in any performance 
> issue?
>
> Thanks,
> Tim
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] CephFS pg ratio of metadata/data pool

2017-03-16 Thread TYLin
Hi all,

We have a CephFS which its metadata pool and data pool share same set of OSDs. 
According to the PGs calculation:

(100*num_osds) / num_replica

If we have 56 OSDs, we should set 5120 PGs to each pool to make the data evenly 
distributed to all the OSDs. However, if we set metadata pool and data pool to 
both 5120 there will have warning about “too many pgs”. We currently set 2048 
to metadata pool and data pool but it seems data may not evenly distribute to 
OSDs due to no sufficient PGs. Can we set a smaller PGs to metadata pool and 
larger PGs to data pool? E.g. 1024 pg to metadata and 4096 to data pool. Is 
there a recommend ratio? Will this result in any performance issue?

Thanks,
Tim
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] 答复: 答复: 答复: 答复: Pipe "deadlock" in Hammer, 0.94.5

2017-03-16 Thread 许雪寒
Hi, Gregory, is it possible to unlock Connection::lock in Pipe::read_message 
before tcp_read_nonblocking is called? I checked the code again, it seems that 
the code in tcp_read_nonblocking doesn't need to be locked by Connection::lock.

-邮件原件-
发件人: Gregory Farnum [mailto:gfar...@redhat.com] 
发送时间: 2017年1月17日 7:14
收件人: 许雪寒
抄送: jiajia zhong; ceph-users@lists.ceph.com
主题: Re: 答复: [ceph-users] 答复: 答复: Pipe "deadlock" in Hammer, 0.94.5

On Sat, Jan 14, 2017 at 7:54 PM, 许雪寒  wrote:
> Thanks for your help:-)
>
> I checked the source code again, and in read_message, it does hold the 
> Connection::lock:

You're correct of course; I wasn't looking and forgot about this bit.
This was added to deal with client-allocated buffers and/or op cancellation in 
librados, IIRC, and unfortunately definitely does need to be synchronized — I'm 
not sure about with pipe lookups, but probably even that. :/

Unfortunately it looks like you're running a version that didn't come from 
upstream (I see hash 81d4ad40d0c2a4b73529ff0db3c8f22acd15c398 in another email, 
which I can't find), so there's not much we can do to help with the specifics 
of this case — it's fiddly and my guess would be the same as Sage's, which you 
say is not the case.
-Greg

>
>  while (left > 0) {
> // wait for data
> if (tcp_read_wait() < 0)
> goto out_dethrottle;
>
> // get a buffer
> connection_state->lock.Lock();
> map >::iterator p =
> 
> connection_state->rx_buffers.find(header.tid);
> if (p != connection_state->rx_buffers.end()) {
> if (rxbuf.length() == 0 || p->second.second 
> != rxbuf_version) {
> ldout(msgr->cct,10)
> << "reader 
> seleting rx buffer v "
>   
>   << p->second.second << " at offset "
>   
>   << offset << " len "
>   
>   << p->second.first.length() << dendl;
> rxbuf = p->second.first;
> rxbuf_version = p->second.second;
> // make sure it's big enough
> if (rxbuf.length() < data_len)
> rxbuf.push_back(
> 
> buffer::create(data_len - rxbuf.length()));
> blp = p->second.first.begin();
> blp.advance(offset);
> }
> } else {
> if (!newbuf.length()) {
> ldout(msgr->cct,20)
> << "reader 
> allocating new rx buffer at offset "
>   
>   << offset << dendl;
> alloc_aligned_buffer(newbuf, 
> data_len, data_off);
> blp = newbuf.begin();
> blp.advance(offset);
> }
> }
> bufferptr bp = blp.get_current_ptr();
> int read = MIN(bp.length(), left);
> ldout(msgr->cct,20)
> << "reader reading 
> nonblocking into "
> << (void*) 
> bp.c_str() << " len " << bp.length()
> << dendl;
> int got = tcp_read_nonblocking(bp.c_str(), read);
> ldout(msgr->cct,30)
> << "reader read " << got << " 
> of " << read << dendl;
> connection_state->lock.Unlock();
> if (got < 0)
> goto out_dethrottle;
> if (got > 0) {
> blp.advance(got);
> data.append(bp, 0, got);
> offset += got;
> left -= got;
> } // else we got a signal or something; just loop.
> }
>
> As shown in the above code, in the reading loop, it first lock 
> 

Re: [ceph-users] CephFS fuse client users stuck

2017-03-16 Thread Dan van der Ster
On Tue, Mar 14, 2017 at 5:55 PM, John Spray  wrote:
> On Tue, Mar 14, 2017 at 2:10 PM, Andras Pataki
>  wrote:
>> Hi John,
>>
>> I've checked the MDS session list, and the fuse client does appear on that
>> with 'state' as 'open'.  So both the fuse client and the MDS agree on an
>> open connection.
>>
>> Attached is the log of the ceph fuse client at debug level 20.  The MDS got
>> restarted at 9:44:20, and it went through its startup, and was in an
>> 'active' state in ceph -s by 9:45:20.  As for the IP addresses in the logs,
>> 10.128.128.110 is the MDS IP, the 10.128.128.1xy addresses are OSDs,
>> 10.128.129.63 is the IP of the client the log is from.
>
> So it looks like the client is getting stuck waiting for some
> capabilities (the 7fff9c3f7700 thread in that log, which eventually
> completes a ll_write on inode 100024ebea8 after the MDS restart).
> Hard to say whether the MDS failed to send it the proper messages, or
> if the client somehow missed it.
>
> It would be useful to have equally verbose logs from the MDS side from
> earlier on, at the point that the client started trying to do the
> write.  I wonder if you could see if your MDS+client can handle both
> being run at "debug mds = 20", "debug client = 20" respectively for a
> while, then when a client gets stuck, do the MDS restart, and follow
> back in the client log to work out which inode it was stuck on, then
> find log areas on the MDS side relating to that inode number.
>

Yesterday we had a network outage and afterwards had similarly stuck
ceph-fuse's (v10.2.6), which were fixed by an mds flip.

Here is debug_client=20 when we try to ls /cephfs (it's hanging forever):

2017-03-16 09:15:36.164051 7f9e7cf61700 20 client.9341013 trim_cache
size 33 max 16384
2017-03-16 09:15:37.164169 7f9e7cf61700 20 client.9341013 trim_cache
size 33 max 16384
2017-03-16 09:15:38.164258 7f9e7cf61700 20 client.9341013 trim_cache
size 33 max 16384
2017-03-16 09:15:38.77 7f9e515d8700  3 client.9341013 ll_getattr 1.head
2017-03-16 09:15:38.744491 7f9e515d8700 10 client.9341013 _getattr
mask pAsLsXsFs issued=0
2017-03-16 09:15:38.744533 7f9e515d8700 15 inode.get on 0x7f9e8ea7c300
1.head now 56
2017-03-16 09:15:38.744558 7f9e515d8700 20 client.9341013
choose_target_mds starting with req->inode 1.head(faked_ino=0 re
f=56 ll_ref=143935987 cap_refs={} open={} mode=40755 size=0/0
mtime=2017-03-15 05:50:02.430699 caps=-(0=pAsLsXsFs) has_dir
_layout 0x7f9e8ea7c300)
2017-03-16 09:15:38.744584 7f9e515d8700 20 client.9341013
choose_target_mds 1.head(faked_ino=0 ref=56 ll_ref=143935987 cap
_refs={} open={} mode=40755 size=0/0 mtime=2017-03-15 05:50:02.430699
caps=-(0=pAsLsXsFs) has_dir_layout 0x7f9e8ea7c300) i
s_hash=0 hash=0
2017-03-16 09:15:38.744592 7f9e515d8700 10 client.9341013
choose_target_mds from caps on inode 1.head(faked_ino=0 ref=56 l
l_ref=143935987 cap_refs={} open={} mode=40755 size=0/0
mtime=2017-03-15 05:50:02.430699 caps=-(0=pAsLsXsFs) has_dir_layou
t 0x7f9e8ea7c300)
2017-03-16 09:15:38.744601 7f9e515d8700 20 client.9341013 mds is 0
2017-03-16 09:15:38.744608 7f9e515d8700 10 client.9341013 send_request
rebuilding request 18992614 for mds.0
2017-03-16 09:15:38.744624 7f9e515d8700 20 client.9341013
encode_cap_releases enter (req: 0x7f9e8ef0a280, mds: 0)
2017-03-16 09:15:38.744627 7f9e515d8700 20 client.9341013 send_request
set sent_stamp to 2017-03-16 09:15:38.744626
2017-03-16 09:15:38.744632 7f9e515d8700 10 client.9341013 send_request
client_request(unknown.0:18992614 getattr pAsLsXsFs
 #1 2017-03-16 09:15:38.744538) v3 to mds.0
2017-03-16 09:15:38.744691 7f9e515d8700 20 client.9341013 awaiting
reply|forward|kick on 0x7f9e515d6fa0
2017-03-16 09:15:39.164365 7f9e7cf61700 20 client.9341013 trim_cache
size 33 max 16384
2017-03-16 09:15:40.164470 7f9e7cf61700 20 client.9341013 trim_cache
size 33 max 16384


And the full log when we failover the mds at 2017-03-16
09:20:47.799250 is here:
https://cernbox.cern.ch/index.php/s/sCYdvb9furqS64y

I also have the ceph-fuse core dump if it would be useful.

-- Dan
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Ceph Cluster Failures

2017-03-16 Thread Christian Balzer

Hello,

On Thu, 16 Mar 2017 02:44:29 + Robin H. Johnson wrote:

> On Thu, Mar 16, 2017 at 02:22:08AM +, Rich Rocque wrote:
> > Has anyone else run into this or have any suggestions on how to remedy it?  
> We need a LOT more info.
>
Indeed.
 
> > After a couple months of almost no issues, our Ceph cluster has
> > started to have frequent failures. Just this week it's failed about
> > three times.
> >
> > The issue appears to be than an MDS or Monitor will fail and then all
> > clients hang. After that, all clients need to be forcibly restarted.  
> - Can you define monitor 'failing' in this case? 
> - What do the logs contain? 
> - Is it running out of memory?
> - Can you turn up the debug level?
> - Has your cluster experienced continual growth and now might be
>   undersized in some regard?
> 
A single MON failure should not cause any problems to boot.

"ceph -s" , "ceph osd tree"  and "ceph osd pool ls detail" as well.

> > The architecture for our setup is:  
> Are these virtual machines? The overall specs seem rather like VM
> instances rather than hardware.
>
There are small servers like that, but a valid question indeed.
In particular, if it is dedicated HW, FULL specs.
 
> > 3 ea MON, MDS instances (co-located) on 2cpu, 4GB RAM servers  
> What sort of SSD are the monitor datastores on? ('mon data' in the
> config)
> 
He doesn't mention SSDs in the MON/MDS context, so we could be looking at
something even slower. FULL SPECS. 

4GB RAM would be fine for a single MON, but combined with MDS it may
be a bit tight.

> > 12 ea OSDs (ssd), on 1cpu, 1GB RAM servers  
> 12 SSDs to a single server, with 1cpu/1GB RAM? That's absurdly low-spec.
> How many OSD servers, what SSDs?
> 
I think he means 12 individual servers. Again, there are micro servers
like that around, like:
https://www.supermicro.com.tw/products/system/2U/2015/SYS-2015TA-HTRF.cfm

IF the SSDs are decent, CPU may be tight but 1GB RAM for a combination of
OS _and_ OSD is way too little for my taste and experience.

Christian

> What is the network setup & connectivity between them (hopefully
> 10Gbit).
> 


-- 
Christian BalzerNetwork/Systems Engineer
ch...@gol.com   Global OnLine Japan/Rakuten Communications
http://www.gol.com/
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Directly addressing files on individual OSD

2017-03-16 Thread Youssef Eldakar
Thanks for the reply, Anthony, and I am sorry my question did not give 
sufficient background.

This is the cluster behind archive.bibalex.org. Storage nodes keep archived 
webpages as multi-member GZIP files on the disks, which are formatted using XFS 
as standalone file systems. The access system consults an index that says where 
a URL is stored, which is then fetched over HTTP from the individual storage 
node that has the URL somewhere on one of the disks. So far, we have pretty 
much been managing the storage using homegrown scripts to have each GZIP file 
stored on 2 separate nodes. This obviously has been requiring a good deal of 
manual work and as such has not been very effective.

Given that description, do you feel Ceph could be an appropriate choice?

Thanks once again for the reply.

Youssef Eldakar
Bibliotheca Alexandrina

From: ceph-users [ceph-users-boun...@lists.ceph.com] on behalf of Anthony 
D'Atri [a...@dreamsnake.net]
Sent: Thursday, March 16, 2017 01:37
To: ceph-users@lists.ceph.com
Subject: Re: [ceph-users] Directly addressing files on individual OSD

As I parse Youssef’s message, I believe there are some misconceptions.  It 
might help if you could give a bit more info on what your existing ‘cluster’ is 
running.  NFS? CIFS/SMB?  Something else?

1) Ceph regularly runs scrubs to ensure that all copies of data are consistent. 
 The checksumming that you describe would be both infeasible and redundant.

2) It sounds as though your current back-end stores user files as-is and is 
either a traditional file server setup or perhaps a virtual filesystem 
aggregating multiple filesystems.  Ceph is not a file storage solution in this 
sense.  The below sounds as though you want user files to not be sharded across 
multiple servers.  This is antithetical to how Ceph works and is counter to 
data durability and availability, unless there is some replication that you 
haven’t described.  Reference this diagram:

http://docs.ceph.com/docs/master/_images/stack.png

Beneath the hood Ceph operates internally on ‘objects’ that are not exposed to 
clients as such. There are several different client interfaces that are built 
on top of this block service:

- RBD volumes — think in terms of a virtual disk drive attached to a VM
- RGW — like Amazon S3 or Swift
- CephFS — provides a mountable filesystem interface, somewhat like NFS or even 
SMB but with important distictions in behavior and use-case

I had not heard of iRODS before but just looked it up.  It is a very different 
thing than any of the common interfaces to Ceph.

If your users need to mount the storage as a share / volume, in the sense of 
SMB or NFS, then Ceph may not be your best option.  If they can cope with an S3 
/ Swift type REST object interface, a cluster with RGW interfaces might do the 
job, or perhaps Swift or Gluster.   It’s hard to say for sure based on 
assumptions of what you need.

— Anthony


We currently run a commodity cluster that supports a few petabytes of data. 
Each node in the cluster has 4 drives, currently mounted as /0 through /3. We 
have been researching alternatives for managing the storage, Ceph being one 
possibility, iRODS being another. For preservation purposes, we would like each 
file to exist as one whole piece per drive (as opposed to being striped across 
multiple drives). It appears this is the default in Ceph.

Now, it has always been convenient for us to run distributed jobs over SSH to, 
for instance, compile a list of checksums of all files in the cluster:

dsh -Mca 'find /{0..3}/items -name \*.warc.gz | xargs md5sum 
>/tmp/$HOSTNAME.md5sum'

And that nicely allows each node to process its own files using the local CPU.

Would this scenario still be possible where Ceph is managing the storage?
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] pgs stuck inactive

2017-03-16 Thread Laszlo Budai

My mistake, I've run it on a wrong system ...

I've attached the terminal output.

I've run this on a test system where I was getting the same segfault when 
trying import-rados.

Kind regards,
Laszlo

On 16.03.2017 07:41, Laszlo Budai wrote:


[root@storage2 ~]# gdb -ex 'r' -ex 't a a bt full' -ex 'q' --args 
ceph-objectstore-tool import-rados volumes pg.3.367.export.OSD.35
GNU gdb (GDB) Red Hat Enterprise Linux 7.6.1-94.el7
Copyright (C) 2013 Free Software Foundation, Inc.
License GPLv3+: GNU GPL version 3 or later 
This is free software: you are free to change and redistribute it.
There is NO WARRANTY, to the extent permitted by law.  Type "show copying"
and "show warranty" for details.
This GDB was configured as "x86_64-redhat-linux-gnu".
For bug reporting instructions, please see:
...
Reading symbols from /usr/bin/ceph-objectstore-tool...Reading symbols from 
/usr/lib/debug/usr/bin/ceph-objectstore-tool.debug...done.
done.
Starting program: /usr/bin/ceph-objectstore-tool import-rados volumes 
pg.3.367.export.OSD.35
[Thread debugging using libthread_db enabled]
Using host libthread_db library "/lib64/libthread_db.so.1".
open: No such file or directory
[Inferior 1 (process 23735) exited with code 01]
[root@storage2 ~]#




Just checked:
[root@storage2 lib64]# ls -l /lib64/libthread_db*
-rwxr-xr-x. 1 root root 38352 May 12  2016 /lib64/libthread_db-1.0.so
lrwxrwxrwx. 1 root root19 Jun  7  2016 /lib64/libthread_db.so.1 -> 
libthread_db-1.0.so
[root@storage2 lib64]#


Kind regards,
Laszlo


On 16.03.2017 05:26, Brad Hubbard wrote:

Can you install the debuginfo for ceph (how this works depends on your
distro) and run the following?

# gdb -ex 'r' -ex 't a a bt full' -ex 'q' --args ceph-objectstore-tool
import-rados volumes pg.3.367.export.OSD.35

On Thu, Mar 16, 2017 at 12:02 AM, Laszlo Budai  wrote:

Hello,


the ceph-objectstore-tool import-rados volumes pg.3.367.export.OSD.35
command crashes.

~# ceph-objectstore-tool import-rados volumes pg.3.367.export.OSD.35
*** Caught signal (Segmentation fault) **
 in thread 7f85b60e28c0
 ceph version 0.94.10 (b1e0532418e4631af01acbc0cedd426f1905f4af)
 1: ceph-objectstore-tool() [0xaeeaba]
 2: (()+0x10330) [0x7f85b4dca330]
 3: (()+0xa2324) [0x7f85b1cd7324]
 4: (()+0x7d23e) [0x7f85b1cb223e]
 5: (()+0x7d478) [0x7f85b1cb2478]
 6: (rados_ioctx_create()+0x32) [0x7f85b1c89f92]
 7: (librados::Rados::ioctx_create(char const*, librados::IoCtx&)+0x15)
[0x7f85b1c8a0e5]
 8: (do_import_rados(std::string, bool)+0xb7c) [0x68199c]
 9: (main()+0x1294) [0x651134]
 10: (__libc_start_main()+0xf5) [0x7f85b0c69f45]
 11: ceph-objectstore-tool() [0x66f8b7]
2017-03-15 14:57:05.567987 7f85b60e28c0 -1 *** Caught signal (Segmentation
fault) **
 in thread 7f85b60e28c0

 ceph version 0.94.10 (b1e0532418e4631af01acbc0cedd426f1905f4af)
 1: ceph-objectstore-tool() [0xaeeaba]
 2: (()+0x10330) [0x7f85b4dca330]
 3: (()+0xa2324) [0x7f85b1cd7324]
 4: (()+0x7d23e) [0x7f85b1cb223e]
 5: (()+0x7d478) [0x7f85b1cb2478]
 6: (rados_ioctx_create()+0x32) [0x7f85b1c89f92]
 7: (librados::Rados::ioctx_create(char const*, librados::IoCtx&)+0x15)
[0x7f85b1c8a0e5]
 8: (do_import_rados(std::string, bool)+0xb7c) [0x68199c]
 9: (main()+0x1294) [0x651134]
 10: (__libc_start_main()+0xf5) [0x7f85b0c69f45]
 11: ceph-objectstore-tool() [0x66f8b7]
 NOTE: a copy of the executable, or `objdump -rdS ` is needed to
interpret this.

--- begin dump of recent events ---
   -14> 2017-03-15 14:57:05.557743 7f85b60e28c0  5 asok(0x5632000)
register_command perfcounters_dump hook 0x55e6130
   -13> 2017-03-15 14:57:05.557807 7f85b60e28c0  5 asok(0x5632000)
register_command 1 hook 0x55e6130
   -12> 2017-03-15 14:57:05.557818 7f85b60e28c0  5 asok(0x5632000)
register_command perf dump hook 0x55e6130
   -11> 2017-03-15 14:57:05.557828 7f85b60e28c0  5 asok(0x5632000)
register_command perfcounters_schema hook 0x55e6130
   -10> 2017-03-15 14:57:05.557836 7f85b60e28c0  5 asok(0x5632000)
register_command 2 hook 0x55e6130
-9> 2017-03-15 14:57:05.557841 7f85b60e28c0  5 asok(0x5632000)
register_command perf schema hook 0x55e6130
-8> 2017-03-15 14:57:05.557851 7f85b60e28c0  5 asok(0x5632000)
register_command perf reset hook 0x55e6130
-7> 2017-03-15 14:57:05.557855 7f85b60e28c0  5 asok(0x5632000)
register_command config show hook 0x55e6130
-6> 2017-03-15 14:57:05.557864 7f85b60e28c0  5 asok(0x5632000)
register_command config set hook 0x55e6130
-5> 2017-03-15 14:57:05.557868 7f85b60e28c0  5 asok(0x5632000)
register_command config get hook 0x55e6130
-4> 2017-03-15 14:57:05.557877 7f85b60e28c0  5 asok(0x5632000)
register_command config diff hook 0x55e6130
-3> 2017-03-15 14:57:05.557880 7f85b60e28c0  5 asok(0x5632000)
register_command log flush hook 0x55e6130
-2> 2017-03-15 14:57:05.557888 7f85b60e28c0  5 asok(0x5632000)
register_command log dump hook 0x55e6130
-1> 2017-03-15 14:57:05.557892 7f85b60e28c0  5 

Re: [ceph-users] Log message --> "bdev(/var/lib/ceph/osd/ceph-x/block) aio_submit retries"

2017-03-16 Thread nokia ceph
Sounds good :), Brad many thanks for the explanation .

On Thu, Mar 16, 2017 at 12:42 PM, Brad Hubbard  wrote:

> On Thu, Mar 16, 2017 at 4:33 PM, nokia ceph 
> wrote:
> > Hello Brad,
> >
> > I meant for this parameter bdev_aio_max_queue_depth , Sage suggested try
> > diff values, 128,1024 , 4096 . So my doubt how this calculation happens?
> Is
> > this  related to memory?
>
> The bdev_aio_max_queue_depth parameter represents the nr_events
> argument to the libaio io_setup function.
>
> int io_setup(unsigned nr_events, aio_context_t *ctx_idp);
>
> From the man page for io_setup:
>
> "The io_setup() system call creates an asynchronous I/O context
> suitable for concurrently processing nr_events operations."
>
> The current theory we are working with is that io_submit is returning
> EAGAIN because nr_events is too small at the default of 32. Therefore
> we have suggested raising this value. There is no real calculation
> involved in the values Sage is suggesting other than they are
> *larger*. It's a matter of playing with the value to see if, and when,
> the error messages go away. If we know a larger value reduces or
> eradicates the error we can then turn our focus more to *why*. Longer
> term this can assist us in setting a more reasonable default.
>
> >
> > Thanks
> >
> >
> >
> >
> > On Thu, Mar 16, 2017 at 11:53 AM, Brad Hubbard 
> wrote:
> >>
> >> On Thu, Mar 16, 2017 at 4:15 PM, nokia ceph 
> >> wrote:
> >> > Hello,
> >> >
> >> > We are running latest kernel - 3.10.0-514.2.2.el7.x86_64 { RHEL 7.3 }
> >> >
> >> > Sure I will try to alter this directive - bdev_aio_max_queue_depth and
> >> > will
> >> > share our results.
> >> >
> >> > Could you please explain how this calculation happens?
> >>
> >> What calculation are you referring to?
> >>
> >> > Thanks
> >> >
> >> >
> >> > On Wed, Mar 15, 2017 at 7:54 PM, Sage Weil  wrote:
> >> >>
> >> >> On Wed, 15 Mar 2017, Brad Hubbard wrote:
> >> >> > +ceph-devel
> >> >> >
> >> >> > On Wed, Mar 15, 2017 at 5:25 PM, nokia ceph
> >> >> > 
> >> >> > wrote:
> >> >> > > Hello,
> >> >> > >
> >> >> > > We suspect these messages not only at the time of OSD creation.
> But
> >> >> > > in
> >> >> > > idle
> >> >> > > conditions also. May I know what is the impact of these error?
> Can
> >> >> > > we
> >> >> > > safely
> >> >> > > ignore this? Or is there any way/config to fix this problem
> >> >> > >
> >> >> > > Few occurrence for these events as follows:---
> >> >> > >
> >> >> > > 
> >> >> > > 2017-03-14 17:16:09.500370 7fedeba61700  4 rocksdb: (Original Log
> >> >> > > Time
> >> >> > > 2017/03/14-17:16:09.453130) [default] Level-0 commit table #60
> >> >> > > started
> >> >> > > 2017-03-14 17:16:09.500374 7fedeba61700  4 rocksdb: (Original Log
> >> >> > > Time
> >> >> > > 2017/03/14-17:16:09.500273) [default] Level-0 commit table #60:
> >> >> > > memtable #1
> >> >> > > done
> >> >> > > 2017-03-14 17:16:09.500376 7fedeba61700  4 rocksdb: (Original Log
> >> >> > > Time
> >> >> > > 2017/03/14-17:16:09.500297) EVENT_LOG_v1 {"time_micros":
> >> >> > > 1489511769500289,
> >> >> > > "job": 17, "event": "flush_finished", "lsm_state": [2, 4, 6, 0,
> 0,
> >> >> > > 0,
> >> >> > > 0],
> >> >> > > "immutable_memtables": 0}
> >> >> > > 2017-03-14 17:16:09.500382 7fedeba61700  4 rocksdb: (Original Log
> >> >> > > Time
> >> >> > > 2017/03/14-17:16:09.500330) [default] Level summary: base level 1
> >> >> > > max
> >> >> > > bytes
> >> >> > > base 268435456 files[2 4 6 0 0 0 0] max score 0.76
> >> >> > >
> >> >> > > 2017-03-14 17:16:09.500390 7fedeba61700  4 rocksdb: [JOB 17] Try
> to
> >> >> > > delete
> >> >> > > WAL files size 244090350, prev total WAL file size 247331500,
> >> >> > > number
> >> >> > > of live
> >> >> > > WAL files 2.
> >> >> > >
> >> >> > > 2017-03-14 17:34:11.610513 7fedf3a71700 -1
> >> >> > > bdev(/var/lib/ceph/osd/ceph-73/block) aio_submit retries 6
> >> >> >
> >> >> > These errors come from here.
> >> >> >
> >> >> > void KernelDevice::aio_submit(IOContext *ioc)
> >> >> > {
> >> >> > ...
> >> >> > int r = aio_queue.submit(*cur, );
> >> >> > if (retries)
> >> >> >   derr << __func__ << " retries " << retries << dendl;
> >> >> >
> >> >> > The submit function is this one which calls libaio's io_submit
> >> >> > function directly and increments retries if it receives EAGAIN.
> >> >> >
> >> >> > #if defined(HAVE_LIBAIO)
> >> >> > int FS::aio_queue_t::submit(aio_t , int *retries)
> >> >> > {
> >> >> >   // 2^16 * 125us = ~8 seconds, so max sleep is ~16 seconds
> >> >> >   int attempts = 16;
> >> >> >   int delay = 125;
> >> >> >   iocb *piocb = 
> >> >> >   while (true) {
> >> >> > int r = io_submit(ctx, 1, ); <-NOTE
> >> >> > if (r < 0) {
> >> >> >   if (r == -EAGAIN && attempts-- > 0) { <-NOTE
> >> >> > usleep(delay);
> >> >> > delay *= 2;
> >> >> > 

Re: [ceph-users] Log message --> "bdev(/var/lib/ceph/osd/ceph-x/block) aio_submit retries"

2017-03-16 Thread Brad Hubbard
On Thu, Mar 16, 2017 at 4:33 PM, nokia ceph  wrote:
> Hello Brad,
>
> I meant for this parameter bdev_aio_max_queue_depth , Sage suggested try
> diff values, 128,1024 , 4096 . So my doubt how this calculation happens? Is
> this  related to memory?

The bdev_aio_max_queue_depth parameter represents the nr_events
argument to the libaio io_setup function.

int io_setup(unsigned nr_events, aio_context_t *ctx_idp);

>From the man page for io_setup:

"The io_setup() system call creates an asynchronous I/O context
suitable for concurrently processing nr_events operations."

The current theory we are working with is that io_submit is returning
EAGAIN because nr_events is too small at the default of 32. Therefore
we have suggested raising this value. There is no real calculation
involved in the values Sage is suggesting other than they are
*larger*. It's a matter of playing with the value to see if, and when,
the error messages go away. If we know a larger value reduces or
eradicates the error we can then turn our focus more to *why*. Longer
term this can assist us in setting a more reasonable default.

>
> Thanks
>
>
>
>
> On Thu, Mar 16, 2017 at 11:53 AM, Brad Hubbard  wrote:
>>
>> On Thu, Mar 16, 2017 at 4:15 PM, nokia ceph 
>> wrote:
>> > Hello,
>> >
>> > We are running latest kernel - 3.10.0-514.2.2.el7.x86_64 { RHEL 7.3 }
>> >
>> > Sure I will try to alter this directive - bdev_aio_max_queue_depth and
>> > will
>> > share our results.
>> >
>> > Could you please explain how this calculation happens?
>>
>> What calculation are you referring to?
>>
>> > Thanks
>> >
>> >
>> > On Wed, Mar 15, 2017 at 7:54 PM, Sage Weil  wrote:
>> >>
>> >> On Wed, 15 Mar 2017, Brad Hubbard wrote:
>> >> > +ceph-devel
>> >> >
>> >> > On Wed, Mar 15, 2017 at 5:25 PM, nokia ceph
>> >> > 
>> >> > wrote:
>> >> > > Hello,
>> >> > >
>> >> > > We suspect these messages not only at the time of OSD creation. But
>> >> > > in
>> >> > > idle
>> >> > > conditions also. May I know what is the impact of these error? Can
>> >> > > we
>> >> > > safely
>> >> > > ignore this? Or is there any way/config to fix this problem
>> >> > >
>> >> > > Few occurrence for these events as follows:---
>> >> > >
>> >> > > 
>> >> > > 2017-03-14 17:16:09.500370 7fedeba61700  4 rocksdb: (Original Log
>> >> > > Time
>> >> > > 2017/03/14-17:16:09.453130) [default] Level-0 commit table #60
>> >> > > started
>> >> > > 2017-03-14 17:16:09.500374 7fedeba61700  4 rocksdb: (Original Log
>> >> > > Time
>> >> > > 2017/03/14-17:16:09.500273) [default] Level-0 commit table #60:
>> >> > > memtable #1
>> >> > > done
>> >> > > 2017-03-14 17:16:09.500376 7fedeba61700  4 rocksdb: (Original Log
>> >> > > Time
>> >> > > 2017/03/14-17:16:09.500297) EVENT_LOG_v1 {"time_micros":
>> >> > > 1489511769500289,
>> >> > > "job": 17, "event": "flush_finished", "lsm_state": [2, 4, 6, 0, 0,
>> >> > > 0,
>> >> > > 0],
>> >> > > "immutable_memtables": 0}
>> >> > > 2017-03-14 17:16:09.500382 7fedeba61700  4 rocksdb: (Original Log
>> >> > > Time
>> >> > > 2017/03/14-17:16:09.500330) [default] Level summary: base level 1
>> >> > > max
>> >> > > bytes
>> >> > > base 268435456 files[2 4 6 0 0 0 0] max score 0.76
>> >> > >
>> >> > > 2017-03-14 17:16:09.500390 7fedeba61700  4 rocksdb: [JOB 17] Try to
>> >> > > delete
>> >> > > WAL files size 244090350, prev total WAL file size 247331500,
>> >> > > number
>> >> > > of live
>> >> > > WAL files 2.
>> >> > >
>> >> > > 2017-03-14 17:34:11.610513 7fedf3a71700 -1
>> >> > > bdev(/var/lib/ceph/osd/ceph-73/block) aio_submit retries 6
>> >> >
>> >> > These errors come from here.
>> >> >
>> >> > void KernelDevice::aio_submit(IOContext *ioc)
>> >> > {
>> >> > ...
>> >> > int r = aio_queue.submit(*cur, );
>> >> > if (retries)
>> >> >   derr << __func__ << " retries " << retries << dendl;
>> >> >
>> >> > The submit function is this one which calls libaio's io_submit
>> >> > function directly and increments retries if it receives EAGAIN.
>> >> >
>> >> > #if defined(HAVE_LIBAIO)
>> >> > int FS::aio_queue_t::submit(aio_t , int *retries)
>> >> > {
>> >> >   // 2^16 * 125us = ~8 seconds, so max sleep is ~16 seconds
>> >> >   int attempts = 16;
>> >> >   int delay = 125;
>> >> >   iocb *piocb = 
>> >> >   while (true) {
>> >> > int r = io_submit(ctx, 1, ); <-NOTE
>> >> > if (r < 0) {
>> >> >   if (r == -EAGAIN && attempts-- > 0) { <-NOTE
>> >> > usleep(delay);
>> >> > delay *= 2;
>> >> > (*retries)++;
>> >> > continue;
>> >> >   }
>> >> >   return r;
>> >> > }
>> >> > assert(r == 1);
>> >> > break;
>> >> >   }
>> >> >   return 0;
>> >> > }
>> >> >
>> >> >
>> >> > From the man page.
>> >> >
>> >> > IO_SUBMIT(2)   Linux Programmer's
>> >> > Manual  IO_SUBMIT(2)
>> >> >
>> >> > NAME
>> >> >   

Re: [ceph-users] Log message --> "bdev(/var/lib/ceph/osd/ceph-x/block) aio_submit retries"

2017-03-16 Thread nokia ceph
Hello Brad,

I meant for this parameter bdev_aio_max_queue_depth , Sage suggested try
diff values, 128,1024 , 4096 . So my doubt how this calculation happens? Is
this  related to memory?

Thanks




On Thu, Mar 16, 2017 at 11:53 AM, Brad Hubbard  wrote:

> On Thu, Mar 16, 2017 at 4:15 PM, nokia ceph 
> wrote:
> > Hello,
> >
> > We are running latest kernel - 3.10.0-514.2.2.el7.x86_64 { RHEL 7.3 }
> >
> > Sure I will try to alter this directive - bdev_aio_max_queue_depth and
> will
> > share our results.
> >
> > Could you please explain how this calculation happens?
>
> What calculation are you referring to?
>
> > Thanks
> >
> >
> > On Wed, Mar 15, 2017 at 7:54 PM, Sage Weil  wrote:
> >>
> >> On Wed, 15 Mar 2017, Brad Hubbard wrote:
> >> > +ceph-devel
> >> >
> >> > On Wed, Mar 15, 2017 at 5:25 PM, nokia ceph  >
> >> > wrote:
> >> > > Hello,
> >> > >
> >> > > We suspect these messages not only at the time of OSD creation. But
> in
> >> > > idle
> >> > > conditions also. May I know what is the impact of these error? Can
> we
> >> > > safely
> >> > > ignore this? Or is there any way/config to fix this problem
> >> > >
> >> > > Few occurrence for these events as follows:---
> >> > >
> >> > > 
> >> > > 2017-03-14 17:16:09.500370 7fedeba61700  4 rocksdb: (Original Log
> Time
> >> > > 2017/03/14-17:16:09.453130) [default] Level-0 commit table #60
> started
> >> > > 2017-03-14 17:16:09.500374 7fedeba61700  4 rocksdb: (Original Log
> Time
> >> > > 2017/03/14-17:16:09.500273) [default] Level-0 commit table #60:
> >> > > memtable #1
> >> > > done
> >> > > 2017-03-14 17:16:09.500376 7fedeba61700  4 rocksdb: (Original Log
> Time
> >> > > 2017/03/14-17:16:09.500297) EVENT_LOG_v1 {"time_micros":
> >> > > 1489511769500289,
> >> > > "job": 17, "event": "flush_finished", "lsm_state": [2, 4, 6, 0, 0,
> 0,
> >> > > 0],
> >> > > "immutable_memtables": 0}
> >> > > 2017-03-14 17:16:09.500382 7fedeba61700  4 rocksdb: (Original Log
> Time
> >> > > 2017/03/14-17:16:09.500330) [default] Level summary: base level 1
> max
> >> > > bytes
> >> > > base 268435456 files[2 4 6 0 0 0 0] max score 0.76
> >> > >
> >> > > 2017-03-14 17:16:09.500390 7fedeba61700  4 rocksdb: [JOB 17] Try to
> >> > > delete
> >> > > WAL files size 244090350, prev total WAL file size 247331500, number
> >> > > of live
> >> > > WAL files 2.
> >> > >
> >> > > 2017-03-14 17:34:11.610513 7fedf3a71700 -1
> >> > > bdev(/var/lib/ceph/osd/ceph-73/block) aio_submit retries 6
> >> >
> >> > These errors come from here.
> >> >
> >> > void KernelDevice::aio_submit(IOContext *ioc)
> >> > {
> >> > ...
> >> > int r = aio_queue.submit(*cur, );
> >> > if (retries)
> >> >   derr << __func__ << " retries " << retries << dendl;
> >> >
> >> > The submit function is this one which calls libaio's io_submit
> >> > function directly and increments retries if it receives EAGAIN.
> >> >
> >> > #if defined(HAVE_LIBAIO)
> >> > int FS::aio_queue_t::submit(aio_t , int *retries)
> >> > {
> >> >   // 2^16 * 125us = ~8 seconds, so max sleep is ~16 seconds
> >> >   int attempts = 16;
> >> >   int delay = 125;
> >> >   iocb *piocb = 
> >> >   while (true) {
> >> > int r = io_submit(ctx, 1, ); <-NOTE
> >> > if (r < 0) {
> >> >   if (r == -EAGAIN && attempts-- > 0) { <-NOTE
> >> > usleep(delay);
> >> > delay *= 2;
> >> > (*retries)++;
> >> > continue;
> >> >   }
> >> >   return r;
> >> > }
> >> > assert(r == 1);
> >> > break;
> >> >   }
> >> >   return 0;
> >> > }
> >> >
> >> >
> >> > From the man page.
> >> >
> >> > IO_SUBMIT(2)   Linux Programmer's
> >> > Manual  IO_SUBMIT(2)
> >> >
> >> > NAME
> >> >io_submit - submit asynchronous I/O blocks for processing
> >> > ...
> >> > RETURN VALUE
> >> >On success, io_submit() returns the number of iocbs submitted
> >> > (which may be 0 if nr is zero).  For the  failure
> >> >return, see NOTES.
> >> >
> >> > ERRORS
> >> >EAGAIN Insufficient resources are available to queue any iocbs.
> >> >
> >> > I suspect increasing bdev_aio_max_queue_depth may help here but some
> >> > of the other devs may have more/better ideas.
> >>
> >> Yes--try increasing bdev_aio_max_queue_depth.  It defaults to 32; try
> >> changing it to 128, 1024, or 4096 and see if these errors go away.
> >>
> >> I've never been able to trigger this on my test boxes, but I put in the
> >> warning to help ensure we pick a good default.
> >>
> >> What kernel version are you running?
> >>
> >> Thanks!
> >> sage
> >
> >
>
>
>
> --
> Cheers,
> Brad
>
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Log message --> "bdev(/var/lib/ceph/osd/ceph-x/block) aio_submit retries"

2017-03-16 Thread Brad Hubbard
On Thu, Mar 16, 2017 at 4:15 PM, nokia ceph  wrote:
> Hello,
>
> We are running latest kernel - 3.10.0-514.2.2.el7.x86_64 { RHEL 7.3 }
>
> Sure I will try to alter this directive - bdev_aio_max_queue_depth and will
> share our results.
>
> Could you please explain how this calculation happens?

What calculation are you referring to?

> Thanks
>
>
> On Wed, Mar 15, 2017 at 7:54 PM, Sage Weil  wrote:
>>
>> On Wed, 15 Mar 2017, Brad Hubbard wrote:
>> > +ceph-devel
>> >
>> > On Wed, Mar 15, 2017 at 5:25 PM, nokia ceph 
>> > wrote:
>> > > Hello,
>> > >
>> > > We suspect these messages not only at the time of OSD creation. But in
>> > > idle
>> > > conditions also. May I know what is the impact of these error? Can we
>> > > safely
>> > > ignore this? Or is there any way/config to fix this problem
>> > >
>> > > Few occurrence for these events as follows:---
>> > >
>> > > 
>> > > 2017-03-14 17:16:09.500370 7fedeba61700  4 rocksdb: (Original Log Time
>> > > 2017/03/14-17:16:09.453130) [default] Level-0 commit table #60 started
>> > > 2017-03-14 17:16:09.500374 7fedeba61700  4 rocksdb: (Original Log Time
>> > > 2017/03/14-17:16:09.500273) [default] Level-0 commit table #60:
>> > > memtable #1
>> > > done
>> > > 2017-03-14 17:16:09.500376 7fedeba61700  4 rocksdb: (Original Log Time
>> > > 2017/03/14-17:16:09.500297) EVENT_LOG_v1 {"time_micros":
>> > > 1489511769500289,
>> > > "job": 17, "event": "flush_finished", "lsm_state": [2, 4, 6, 0, 0, 0,
>> > > 0],
>> > > "immutable_memtables": 0}
>> > > 2017-03-14 17:16:09.500382 7fedeba61700  4 rocksdb: (Original Log Time
>> > > 2017/03/14-17:16:09.500330) [default] Level summary: base level 1 max
>> > > bytes
>> > > base 268435456 files[2 4 6 0 0 0 0] max score 0.76
>> > >
>> > > 2017-03-14 17:16:09.500390 7fedeba61700  4 rocksdb: [JOB 17] Try to
>> > > delete
>> > > WAL files size 244090350, prev total WAL file size 247331500, number
>> > > of live
>> > > WAL files 2.
>> > >
>> > > 2017-03-14 17:34:11.610513 7fedf3a71700 -1
>> > > bdev(/var/lib/ceph/osd/ceph-73/block) aio_submit retries 6
>> >
>> > These errors come from here.
>> >
>> > void KernelDevice::aio_submit(IOContext *ioc)
>> > {
>> > ...
>> > int r = aio_queue.submit(*cur, );
>> > if (retries)
>> >   derr << __func__ << " retries " << retries << dendl;
>> >
>> > The submit function is this one which calls libaio's io_submit
>> > function directly and increments retries if it receives EAGAIN.
>> >
>> > #if defined(HAVE_LIBAIO)
>> > int FS::aio_queue_t::submit(aio_t , int *retries)
>> > {
>> >   // 2^16 * 125us = ~8 seconds, so max sleep is ~16 seconds
>> >   int attempts = 16;
>> >   int delay = 125;
>> >   iocb *piocb = 
>> >   while (true) {
>> > int r = io_submit(ctx, 1, ); <-NOTE
>> > if (r < 0) {
>> >   if (r == -EAGAIN && attempts-- > 0) { <-NOTE
>> > usleep(delay);
>> > delay *= 2;
>> > (*retries)++;
>> > continue;
>> >   }
>> >   return r;
>> > }
>> > assert(r == 1);
>> > break;
>> >   }
>> >   return 0;
>> > }
>> >
>> >
>> > From the man page.
>> >
>> > IO_SUBMIT(2)   Linux Programmer's
>> > Manual  IO_SUBMIT(2)
>> >
>> > NAME
>> >io_submit - submit asynchronous I/O blocks for processing
>> > ...
>> > RETURN VALUE
>> >On success, io_submit() returns the number of iocbs submitted
>> > (which may be 0 if nr is zero).  For the  failure
>> >return, see NOTES.
>> >
>> > ERRORS
>> >EAGAIN Insufficient resources are available to queue any iocbs.
>> >
>> > I suspect increasing bdev_aio_max_queue_depth may help here but some
>> > of the other devs may have more/better ideas.
>>
>> Yes--try increasing bdev_aio_max_queue_depth.  It defaults to 32; try
>> changing it to 128, 1024, or 4096 and see if these errors go away.
>>
>> I've never been able to trigger this on my test boxes, but I put in the
>> warning to help ensure we pick a good default.
>>
>> What kernel version are you running?
>>
>> Thanks!
>> sage
>
>



-- 
Cheers,
Brad
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Log message --> "bdev(/var/lib/ceph/osd/ceph-x/block) aio_submit retries"

2017-03-16 Thread nokia ceph
Hello,

We are running latest kernel - 3.10.0-514.2.2.el7.x86_64 { RHEL 7.3 }

Sure I will try to alter this directive - bdev_aio_max_queue_depth and will
share our results.

Could you please explain how this calculation happens?

Thanks


On Wed, Mar 15, 2017 at 7:54 PM, Sage Weil  wrote:

> On Wed, 15 Mar 2017, Brad Hubbard wrote:
> > +ceph-devel
> >
> > On Wed, Mar 15, 2017 at 5:25 PM, nokia ceph 
> wrote:
> > > Hello,
> > >
> > > We suspect these messages not only at the time of OSD creation. But in
> idle
> > > conditions also. May I know what is the impact of these error? Can we
> safely
> > > ignore this? Or is there any way/config to fix this problem
> > >
> > > Few occurrence for these events as follows:---
> > >
> > > 
> > > 2017-03-14 17:16:09.500370 7fedeba61700  4 rocksdb: (Original Log Time
> > > 2017/03/14-17:16:09.453130) [default] Level-0 commit table #60 started
> > > 2017-03-14 17:16:09.500374 7fedeba61700  4 rocksdb: (Original Log Time
> > > 2017/03/14-17:16:09.500273) [default] Level-0 commit table #60:
> memtable #1
> > > done
> > > 2017-03-14 17:16:09.500376 7fedeba61700  4 rocksdb: (Original Log Time
> > > 2017/03/14-17:16:09.500297) EVENT_LOG_v1 {"time_micros":
> 1489511769500289,
> > > "job": 17, "event": "flush_finished", "lsm_state": [2, 4, 6, 0, 0, 0,
> 0],
> > > "immutable_memtables": 0}
> > > 2017-03-14 17:16:09.500382 7fedeba61700  4 rocksdb: (Original Log Time
> > > 2017/03/14-17:16:09.500330) [default] Level summary: base level 1 max
> bytes
> > > base 268435456 files[2 4 6 0 0 0 0] max score 0.76
> > >
> > > 2017-03-14 17:16:09.500390 7fedeba61700  4 rocksdb: [JOB 17] Try to
> delete
> > > WAL files size 244090350, prev total WAL file size 247331500, number
> of live
> > > WAL files 2.
> > >
> > > 2017-03-14 17:34:11.610513 7fedf3a71700 -1
> > > bdev(/var/lib/ceph/osd/ceph-73/block) aio_submit retries 6
> >
> > These errors come from here.
> >
> > void KernelDevice::aio_submit(IOContext *ioc)
> > {
> > ...
> > int r = aio_queue.submit(*cur, );
> > if (retries)
> >   derr << __func__ << " retries " << retries << dendl;
> >
> > The submit function is this one which calls libaio's io_submit
> > function directly and increments retries if it receives EAGAIN.
> >
> > #if defined(HAVE_LIBAIO)
> > int FS::aio_queue_t::submit(aio_t , int *retries)
> > {
> >   // 2^16 * 125us = ~8 seconds, so max sleep is ~16 seconds
> >   int attempts = 16;
> >   int delay = 125;
> >   iocb *piocb = 
> >   while (true) {
> > int r = io_submit(ctx, 1, ); <-NOTE
> > if (r < 0) {
> >   if (r == -EAGAIN && attempts-- > 0) { <-NOTE
> > usleep(delay);
> > delay *= 2;
> > (*retries)++;
> > continue;
> >   }
> >   return r;
> > }
> > assert(r == 1);
> > break;
> >   }
> >   return 0;
> > }
> >
> >
> > From the man page.
> >
> > IO_SUBMIT(2)   Linux Programmer's
> > Manual  IO_SUBMIT(2)
> >
> > NAME
> >io_submit - submit asynchronous I/O blocks for processing
> > ...
> > RETURN VALUE
> >On success, io_submit() returns the number of iocbs submitted
> > (which may be 0 if nr is zero).  For the  failure
> >return, see NOTES.
> >
> > ERRORS
> >EAGAIN Insufficient resources are available to queue any iocbs.
> >
> > I suspect increasing bdev_aio_max_queue_depth may help here but some
> > of the other devs may have more/better ideas.
>
> Yes--try increasing bdev_aio_max_queue_depth.  It defaults to 32; try
> changing it to 128, 1024, or 4096 and see if these errors go away.
>
> I've never been able to trigger this on my test boxes, but I put in the
> warning to help ensure we pick a good default.
>
> What kernel version are you running?
>
> Thanks!
> sage
>
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com