Re: [ceph-users] libvirt + rbd questions
Hi Dajka: Yes, you are right. Thanks for your reply . I changed a computer to do the same thing , it worked. And I compared the packages , but found nothing different yet . There must be something I am still missing. On Fri, Aug 25, 2017 at 3:25 PM, Dajka Tamás <vi...@vipernet.hu> wrote: > Hi, > > is qemu-img working for you ont he VM Host machine? > > Did you create the vol? What does 'rbd ls' say? > > Did you feed the secret (and the key created with ceph auth) to virsh? > > Cheers, > > Tom > p.s.: you maybe need anyother package for qemu-rbd support - I did so on > latest Debain (stretch) > > -Original Message- > From: ceph-users [mailto:ceph-users-boun...@lists.ceph.com] On Behalf Of Z > Will > Sent: Friday, August 25, 2017 8:49 AM > To: Ceph-User <ceph-us...@ceph.com> > Subject: [ceph-users] libvirt + rbd questions > > Hi all: > I have tried to install a vm using rbd as disk, following the steps from > ceph doc, but met some problems. the packages environment is following: > > CentOS Linux release 7.2.1511 (Core) > libvirt-2.0.0-10.el7_3.9.x86_64 > libvirt-python-2.0.0-2.el7.x86_64 > virt-manager-common-1.4.0-2.el7.noarch > qemu-2.0.0-1.el7.6.x86_64 > > and the procedure is : > create a pool > > rbd2 > 48283717632 > 133 unit='bytes'>46990565376 > > rbd username='admin'> > > > > > create a vol t2 > > run the following cmd > > virt-install -n puppy2 --vcpu 1 -c /etc/ceph/tahr64-6.0.5.iso --disk > vol=rbd2/t2 --memory 1024 --vnc --vnclisten=0.0.0.0 -v > > it failed , the error is : > WARNING No operating system detected, VM performance may suffer. > Specify an OS with --os-variant for optimal results. > WARNING Unable to connect to graphical console: virt-viewer not installed. > Please install the 'virt-viewer' package. > WARNING No console to launch for the guest, defaulting to --wait -1 > > Starting install... > ERRORinternal error: qemu unexpectedly closed the monitor: > (process:663223): GLib-WARNING **: gmem.c:482: custom memory allocation > vtable not supported 2017-08-25T06:33:59.392000Z qemu-system-x86_64: -drive > file=rbd:rbd/t2:auth_supported=none:mon_host=10.202.127.11\:6789,format=raw, > if=none,id=drive-ide0-0-0: > could not open disk image > rbd:rbd/t2:auth_supported=none:mon_host=10.202.127.11\:6789: Unknown > protocol Domain installation does not appear to have been successful. > If it was, you can restart your domain by running: > virsh --connect qemu:///system start puppy2 otherwise, please restart your > installation. > > > google it , change the /usr/share/virt-manager/virtinst/guest.py file , like > this > > def _build_xml(self): > install_xml = self._get_install_xml(install=True) > final_xml = self._get_install_xml(install=False) > > auth_secret = ''' > > uuid='f75248a0-ac7e-4c0e-a90a-2d8d00833132'/> > > ''' > import re > rgx_auth = re.compile('(?<= )([^>]*?">).*?(?= *?)', re.S) > > install_xml = rgx_auth.sub('\\1' + auth_secret, install_xml) > > final_xml = rgx_auth.sub('\\1' + auth_secret, final_xml) > logging.debug("Generated install XML: %s", > (install_xml and ("\n" + install_xml) or "None required")) > logging.debug("Generated boot XML: \n%s", final_xml) > > return install_xml, final_xml > > and run > > virt-install -n puppy2 --vcpu 1 -c /etc/ceph/tahr64-6.0.5.iso --disk > vol=rbd2/t2 --memory 1024 --vnc --vnclisten=0.0.0.0 -v --print-xml > > > > uuid='f75248a0-ac7e-4c0e-a90a-2d8d00833132'/> > > > > > > > > > > looks ok > > and run again > virt-install -n puppy2 --vcpu 1 -c /etc/ceph/tahr64-6.0.5.iso --disk > vol=rbd2/t2 --memory 1024 --vnc --vnclisten=0.0.0.0 -v > > still error > WARNING No operating system detected, VM performance may suffer. > Specify an OS with --os-variant for optimal results. > WARNING Unable to connect to graphical console: virt-viewer not installed. > Please install the 'virt-viewer' package. > WARNING No console to launch for the guest, defaulting to --wait -1 > > Starting install... > ERRORinternal error: qemu unexpectedly closed the monitor: > (process:663621): GLib-WARNING **: gmem.c:482: custom memory allocation > vtable not supported 2017-08-25T06:42:09.429057Z qemu-system-x86_64: -drive > file=rbd:rbd/puppy2:id=admin:key=AQCeuf5YUac6NxAAU3CCcwnaI5z7pgUww4Gyrg==:au > th_supported=cephx\;none:mon_host=10.202.127.1
[ceph-users] libvirt + rbd questions
Hi all: I have tried to install a vm using rbd as disk, following the steps from ceph doc, but met some problems. the packages environment is following: CentOS Linux release 7.2.1511 (Core) libvirt-2.0.0-10.el7_3.9.x86_64 libvirt-python-2.0.0-2.el7.x86_64 virt-manager-common-1.4.0-2.el7.noarch qemu-2.0.0-1.el7.6.x86_64 and the procedure is : create a pool rbd2 48283717632 133 46990565376 rbd create a vol t2 run the following cmd virt-install -n puppy2 --vcpu 1 -c /etc/ceph/tahr64-6.0.5.iso --disk vol=rbd2/t2 --memory 1024 --vnc --vnclisten=0.0.0.0 -v it failed , the error is : WARNING No operating system detected, VM performance may suffer. Specify an OS with --os-variant for optimal results. WARNING Unable to connect to graphical console: virt-viewer not installed. Please install the 'virt-viewer' package. WARNING No console to launch for the guest, defaulting to --wait -1 Starting install... ERRORinternal error: qemu unexpectedly closed the monitor: (process:663223): GLib-WARNING **: gmem.c:482: custom memory allocation vtable not supported 2017-08-25T06:33:59.392000Z qemu-system-x86_64: -drive file=rbd:rbd/t2:auth_supported=none:mon_host=10.202.127.11\:6789,format=raw,if=none,id=drive-ide0-0-0: could not open disk image rbd:rbd/t2:auth_supported=none:mon_host=10.202.127.11\:6789: Unknown protocol Domain installation does not appear to have been successful. If it was, you can restart your domain by running: virsh --connect qemu:///system start puppy2 otherwise, please restart your installation. google it , change the /usr/share/virt-manager/virtinst/guest.py file , like this def _build_xml(self): install_xml = self._get_install_xml(install=True) final_xml = self._get_install_xml(install=False) auth_secret = ''' ''' import re rgx_auth = re.compile('(?<=]*?">).*?(?= *?)', re.S) install_xml = rgx_auth.sub('\\1' + auth_secret, install_xml) final_xml = rgx_auth.sub('\\1' + auth_secret, final_xml) logging.debug("Generated install XML: %s", (install_xml and ("\n" + install_xml) or "None required")) logging.debug("Generated boot XML: \n%s", final_xml) return install_xml, final_xml and run virt-install -n puppy2 --vcpu 1 -c /etc/ceph/tahr64-6.0.5.iso --disk vol=rbd2/t2 --memory 1024 --vnc --vnclisten=0.0.0.0 -v --print-xml looks ok and run again virt-install -n puppy2 --vcpu 1 -c /etc/ceph/tahr64-6.0.5.iso --disk vol=rbd2/t2 --memory 1024 --vnc --vnclisten=0.0.0.0 -v still error WARNING No operating system detected, VM performance may suffer. Specify an OS with --os-variant for optimal results. WARNING Unable to connect to graphical console: virt-viewer not installed. Please install the 'virt-viewer' package. WARNING No console to launch for the guest, defaulting to --wait -1 Starting install... ERRORinternal error: qemu unexpectedly closed the monitor: (process:663621): GLib-WARNING **: gmem.c:482: custom memory allocation vtable not supported 2017-08-25T06:42:09.429057Z qemu-system-x86_64: -drive file=rbd:rbd/puppy2:id=admin:key=AQCeuf5YUac6NxAAU3CCcwnaI5z7pgUww4Gyrg==:auth_supported=cephx\;none:mon_host=10.202.127.11\:6789,format=raw,if=none,id=drive-ide0-0-0: could not open disk image rbd:rbd/puppy2:id=admin:key=AQCeuf5YUac6NxAAU3CCcwnaI5z7pgUww4Gyrg==:auth_supported=cephx\;none:mon_host=10.202.127.11\:6789: Unknown protocol Domain installation does not appear to have been successful. If it was, you can restart your domain by running: virsh --connect qemu:///system start puppy otherwise, please restart your installation. I tried to just create rbd as disk and attach it to the existed VM, it is OK , so I don't think it is the packages version problem. Has anyone met this problem ? Or where am I wrong in the above procedure ? ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] Speeding up garbage collection in RGW
I think if you want to delete through gc, increase this OPTION(rgw_gc_processor_max_time, OPT_INT, 3600) // total run time for a single gc processor work decrease this OPTION(rgw_gc_processor_period, OPT_INT, 3600) // gc processor cycle time Or , I think if there is some option to bypass the gc On Tue, Jul 25, 2017 at 5:05 AM, Bryan Stillwellwrote: > Wouldn't doing it that way cause problems since references to the objects > wouldn't be getting removed from .rgw.buckets.index? > > Bryan > > From: Roger Brown > Date: Monday, July 24, 2017 at 2:43 PM > To: Bryan Stillwell , "ceph-users@lists.ceph.com" > > Subject: Re: [ceph-users] Speeding up garbage collection in RGW > > I hope someone else can answer your question better, but in my case I found > something like this helpful to delete objects faster than I could through the > gateway: > > rados -p default.rgw.buckets.data ls | grep 'replace this with pattern > matching files you want to delete' | xargs -d '\n' -n 200 rados -p > default.rgw.buckets.data rm > > > On Mon, Jul 24, 2017 at 2:02 PM Bryan Stillwell > wrote: > I'm in the process of cleaning up a test that an internal customer did on our > production cluster that produced over a billion objects spread across 6000 > buckets. So far I've been removing the buckets like this: > > printf %s\\n bucket{1..6000} | xargs -I{} -n 1 -P 32 radosgw-admin bucket rm > --bucket={} --purge-objects > > However, the disk usage doesn't seem to be getting reduced at the same rate > the objects are being removed. From what I can tell a large number of the > objects are waiting for garbage collection. > > When I first read the docs it sounded like the garbage collector would only > remove 32 objects every hour, but after looking through the logs I'm seeing > about 55,000 objects removed every hour. That's about 1.3 million a day, so > at this rate it'll take a couple years to clean up the rest! For comparison, > the purge-objects command above is removing (but not GC'ing) about 30 million > objects a day, so a much more manageable 33 days to finish. > > I've done some digging and it appears like I should be changing these > configuration options: > > rgw gc max objs (default: 32) > rgw gc obj min wait (default: 7200) > rgw gc processor max time (default: 3600) > rgw gc processor period (default: 3600) > > A few questions I have though are: > > Should 'rgw gc processor max time' and 'rgw gc processor period' always be > set to the same value? > > Which would be better, increasing 'rgw gc max objs' to something like 1024, > or reducing the 'rgw gc processor' times to something like 60 seconds? > > Any other guidance on the best way to adjust these values? > > Thanks, > Bryan > > > ___ > ceph-users mailing list > ceph-users@lists.ceph.com > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com > > > ___ > ceph-users mailing list > ceph-users@lists.ceph.com > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] Monitor as local VM on top of the server pool cluster?
For large cluster , there will be a lot of change at any time, this means the pressure of mon will be big at some time, because all change will go through leader , so for this , the local storage for mon should be good enough, I think this maybe a conderation . On Tue, Jul 11, 2017 at 11:29 AM, Brad Hubbardwrote: > On Tue, Jul 11, 2017 at 3:44 AM, David Turner wrote: >> Mons are a paxos quorum and as such want to be in odd numbers. 5 is >> generally what people go with. I think I've heard of a few people use 7 >> mons, but you do not want to have an even number of mons or an ever growing > > Unless your cluster is very large three should be sufficient. > >> number of mons. The reason you do not want mons running on the same >> hardware as osds is resource contention during recovery. As long as the Xen >> servers you are putting the mons on are not going to cause any source of >> resource limitation/contention, then virtualizing them should be fine for >> you. Make sure that you aren't configuring the mon to run using an RBD for >> its storage, that would be very bad. >> >> The mon Quorum elects a leader and that leader will be in charge of the >> quorum. Having local mons doesn't do anything as the clients will still be >> talking to the mons as a quorum and won't necessarily talk to the mon >> running on them. The vast majority of communication to the cluster that >> your Xen servers will be doing is to the OSDs anyway, very little >> communication to the mons. >> >> On Mon, Jul 10, 2017 at 1:21 PM Massimiliano Cuttini >> wrote: >>> >>> Hi everybody, >>> >>> i would like to separate MON from OSD as reccomended. >>> In order to do so without new hardware I'm planning to create all the >>> monitor as a Virtual Machine on top of my hypervisors (Xen). >>> I'm testing a pool of 8 nodes of Xen. >>> >>> I'm thinking about create 8 monitor and pin one monitor for one Xen node. >>> So, i'm guessing, every Ceph monitor'll be local for each node client. >>> This should speed up the system by local connecting monitors with a >>> little overflown for the monitors sync between nodes. >>> >>> Is it a good idea have a local monitor virtualized on top of each >>> hypervisor node? >>> Did you see any understimation or wrong design in this? >>> >>> Thanks for every helpfull info. >>> >>> >>> Regards, >>> Max >>> >>> ___ >>> ceph-users mailing list >>> ceph-users@lists.ceph.com >>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com >> >> >> ___ >> ceph-users mailing list >> ceph-users@lists.ceph.com >> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com >> > > > > -- > Cheers, > Brad > ___ > ceph-users mailing list > ceph-users@lists.ceph.com > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] ceph-mon leader election problem, should it be improved ?
Hi Joao: > Basically, this would be something similar to heartbeats. If a monitor can't > reach all monitors in an existing quorum, then just don't do anything. Based on your solution, I make a little change : - send a probe to all monitors - if get a quorum , it will join current quorum through join_quorum message, when leader receive this , it will change the quorum and claim victory again, If timeout , it means it can't reach leader , do nothing and try later from bootstrap , - if get > 1/2 acks, do as before, call election With this , sometimes the leader do not have the smallest rank num , I think this is fine. In quorum message , there will be one more byte to point out the leader rank num . I think this will perform as same as before and can tolerate some network partition error, and it only need to change little code, any suggesstion for this ? Do I lack of any considerations ? On Wed, Jul 5, 2017 at 6:26 PM, Joao Eduardo Luis <j...@suse.de> wrote: > On 07/05/2017 08:01 AM, Z Will wrote: >> >> Hi Joao: >> I think this is all because we choose the monitor with the >> smallest rank number to be leader. For this kind of network error, no >> matter which mon has lost connection with the mon who has the >> smallest rank num , will be constantly calling an election, that say >> ,will constantly affact the cluster until it is stopped by human . So >> do you think it make sense if I try to figure out a way to choose the >> monitor who can see the most monitors , or with the smallest rank >> num if the view num is same , to be leader ? >> In probing phase: >>they will know there own view, so can set a view num. >> In election phase: >>they send the view num , rank num . >>when receiving the election message, it compare the view num ( >> higher is leader ) and rank num ( lower is leader). > > > As I understand it, our elector trades-off reliability in case of network > failure for expediency in forming a quorum. This by itself is not a problem > since we don't see many real-world cases where this behaviour happens, and > we are a lot more interested in making sure we have a quorum - given without > a quorum your cluster is effectively unusable. > > Currently, we form a quorum with a minimal number of messages passed. > From my poor recollection, I think the Elector works something like > > - 1 probe message to each monitor in the monmap > - receives defer from a monitor, or defers to a monitor > - declares victory if number of defers is an absolute majority (including > one's defer). > > An election cycle takes about 4-5 messages to complete, with roughly two > round-trips (in the best case scenario). > > Figuring out which monitor is able to contact the highest number of > monitors, and having said monitor being elected the leader, will necessarily > increase the number of messages transferred. > > A rough idea would be > > - all monitors will send probes to all other monitors in the monmap; > - all monitors need to ack the other's probes; > - each monitor will count the number of monitors it can reach, and then send > a message proposing itself as the leader to the other monitors, with the > list of monitors they see; > - each monitor will propose itself as the leader, or defer to some other > monitor. > > This is closer to 3 round-trips. > > Additionally, we'd have to account for the fact that some monitors may be > able to reach all other monitors, while some may only be able to reach a > portion. How do we handle this scenario? > > - What do we do with monitors that do not reach all other monitors? > - Do we ignore them for electoral purposes? > - Are they part of the final quorum? > - What if we need those monitors to form a quorum? > > Personally, I think the easiest solution to this problem would be > blacklisting a problematic monitor (for a given amount a time, or until a > new election is needed due to loss of quorum, or by human intervention). > > For example, if a monitor believes it should be the leader, and if all other > monitors are deferring to someone else that is not reachable, the monitor > could then enter a special case branch: > > - send a probe to all monitors > - receive acks > - share that with other monitors > - if that list is missing monitors, then blacklist the monitor for a period, > and send a message to that monitor with that decision > - the monitor would blacklist itself and retry in a given amount of time. > > Basically, this would be something similar to heartbeats. If a monitor can't > reach all monitors in an existing quorum, then just don't do anything. > &
Re: [ceph-users] ceph-mon leader election problem, should it be improved ?
Hi Sage: After these days consideration, and reading some related papers , I think we can just make a very little change to solve the problems above and make monitors to tolerate most of the network partition . The most of logic are still same as before except one : - send a probe to each monitor in monmap - receive acks, and remember it , if got > 1/2 , or get a quorum - if get a quorum, it will try to reach current leader to join current quorum , not try to call a new election. If timeout to joining , it should stand by and try later, and it will sync states from other mons when needed for increasing performance. other logic is same as before What do you think of it ? On Thu, Jul 6, 2017 at 10:31 PM, Sage Weil <s...@newdream.net> wrote: > On Thu, 6 Jul 2017, Z Will wrote: >> Hi Joao : >> >> Thanks for thorough analysis . My initial concern is that , I think >> in some cases , network failure will make low rank monitor see little >> siblings (not enough to form a quorum ) , but some high rank mointor >> can see more siblings, so I want to try to choose the one who can see >> the most to be leader, to tolerate the netwok error to the biggiest >> extent , not just to solve the corner case. Yes , you are right. >> This kind of complex network failure is rare to occure. Trying to find >> out who can contact the highest number of monitors can only cover >> some of the situation , and will introduce some other complexities >> and slow effcient. This is not good. Blacklisting a problematic >> monitor is simple and good idea. The implementation in monitor now is >> like this, no matter which one with high rank num lost connection >> with the leader, this lost monitor will constantly try to call leader >> election, affect its siblings, and then affect the whole cluster. >> Because the leader election procedure is fast, it will be OK for a >> short time , but soon leader election start again, the cluster will >> become unstable. I think the probability of this kind of network error >> is high, YES ? So based on your idea, make a little change : >> >> - send a probe to all monitors >> - receive acks >> - After receiving acks, it will konw the current quorum and how much >> monitors it can reach to . >>If it can reach to current leader, then it will try to join >> current quorum >>If it can not reach to current leader, then it will decide >> whether to stand by for a while and try later or start a leader >> election based on the information got from probing phase. >> >> Do you think this will be OK ? > > I'm worried that even if we can form an initial quorum, we are currently > very casual about the "call new election" logic. If a mon is not part of > the quorum it will currently trigger a new election... and with this > change it will then not be included in it because it can't reach all mons. > The logic there will also have to change so that it confirms that it can > reach a majority of mon peers before requesting a new election. > > sage > > >> >> >> On Wed, Jul 5, 2017 at 6:26 PM, Joao Eduardo Luis <j...@suse.de> wrote: >> > On 07/05/2017 08:01 AM, Z Will wrote: >> >> >> >> Hi Joao: >> >> I think this is all because we choose the monitor with the >> >> smallest rank number to be leader. For this kind of network error, no >> >> matter which mon has lost connection with the mon who has the >> >> smallest rank num , will be constantly calling an election, that say >> >> ,will constantly affact the cluster until it is stopped by human . So >> >> do you think it make sense if I try to figure out a way to choose the >> >> monitor who can see the most monitors , or with the smallest rank >> >> num if the view num is same , to be leader ? >> >> In probing phase: >> >>they will know there own view, so can set a view num. >> >> In election phase: >> >>they send the view num , rank num . >> >>when receiving the election message, it compare the view num ( >> >> higher is leader ) and rank num ( lower is leader). >> > >> > >> > As I understand it, our elector trades-off reliability in case of network >> > failure for expediency in forming a quorum. This by itself is not a problem >> > since we don't see many real-world cases where this behaviour happens, and >> > we are a lot more interested in making sure we have a quorum - given >> > without >> > a quorum your cluster is effec
Re: [ceph-users] ceph-mon leader election problem, should it be improved ?
Hi Joao : Thanks for thorough analysis . My initial concern is that , I think in some cases , network failure will make low rank monitor see little siblings (not enough to form a quorum ) , but some high rank mointor can see more siblings, so I want to try to choose the one who can see the most to be leader, to tolerate the netwok error to the biggiest extent , not just to solve the corner case. Yes , you are right. This kind of complex network failure is rare to occure. Trying to find out who can contact the highest number of monitors can only cover some of the situation , and will introduce some other complexities and slow effcient. This is not good. Blacklisting a problematic monitor is simple and good idea. The implementation in monitor now is like this, no matter which one with high rank num lost connection with the leader, this lost monitor will constantly try to call leader election, affect its siblings, and then affect the whole cluster. Because the leader election procedure is fast, it will be OK for a short time , but soon leader election start again, the cluster will become unstable. I think the probability of this kind of network error is high, YES ? So based on your idea, make a little change : - send a probe to all monitors - receive acks - After receiving acks, it will konw the current quorum and how much monitors it can reach to . If it can reach to current leader, then it will try to join current quorum If it can not reach to current leader, then it will decide whether to stand by for a while and try later or start a leader election based on the information got from probing phase. Do you think this will be OK ? On Wed, Jul 5, 2017 at 6:26 PM, Joao Eduardo Luis <j...@suse.de> wrote: > On 07/05/2017 08:01 AM, Z Will wrote: >> >> Hi Joao: >> I think this is all because we choose the monitor with the >> smallest rank number to be leader. For this kind of network error, no >> matter which mon has lost connection with the mon who has the >> smallest rank num , will be constantly calling an election, that say >> ,will constantly affact the cluster until it is stopped by human . So >> do you think it make sense if I try to figure out a way to choose the >> monitor who can see the most monitors , or with the smallest rank >> num if the view num is same , to be leader ? >> In probing phase: >>they will know there own view, so can set a view num. >> In election phase: >>they send the view num , rank num . >>when receiving the election message, it compare the view num ( >> higher is leader ) and rank num ( lower is leader). > > > As I understand it, our elector trades-off reliability in case of network > failure for expediency in forming a quorum. This by itself is not a problem > since we don't see many real-world cases where this behaviour happens, and > we are a lot more interested in making sure we have a quorum - given without > a quorum your cluster is effectively unusable. > > Currently, we form a quorum with a minimal number of messages passed. > From my poor recollection, I think the Elector works something like > > - 1 probe message to each monitor in the monmap > - receives defer from a monitor, or defers to a monitor > - declares victory if number of defers is an absolute majority (including > one's defer). > > An election cycle takes about 4-5 messages to complete, with roughly two > round-trips (in the best case scenario). > > Figuring out which monitor is able to contact the highest number of > monitors, and having said monitor being elected the leader, will necessarily > increase the number of messages transferred. > > A rough idea would be > > - all monitors will send probes to all other monitors in the monmap; > - all monitors need to ack the other's probes; > - each monitor will count the number of monitors it can reach, and then send > a message proposing itself as the leader to the other monitors, with the > list of monitors they see; > - each monitor will propose itself as the leader, or defer to some other > monitor. > > This is closer to 3 round-trips. > > Additionally, we'd have to account for the fact that some monitors may be > able to reach all other monitors, while some may only be able to reach a > portion. How do we handle this scenario? > > - What do we do with monitors that do not reach all other monitors? > - Do we ignore them for electoral purposes? > - Are they part of the final quorum? > - What if we need those monitors to form a quorum? > > Personally, I think the easiest solution to this problem would be > blacklisting a problematic monitor (for a given amount a time, or until a > new election is needed due to loss of quorum, or by human in
Re: [ceph-users] ceph-mon leader election problem, should it be improved ?
Hi Joao: I think this is all because we choose the monitor with the smallest rank number to be leader. For this kind of network error, no matter which mon has lost connection with the mon who has the smallest rank num , will be constantly calling an election, that say ,will constantly affact the cluster until it is stopped by human . So do you think it make sense if I try to figure out a way to choose the monitor who can see the most monitors , or with the smallest rank num if the view num is same , to be leader ? In probing phase: they will know there own view, so can set a view num. In election phase: they send the view num , rank num . when receiving the election message, it compare the view num ( higher is leader ) and rank num ( lower is leader). On Tue, Jul 4, 2017 at 9:25 PM, Joao Eduardo Luis <j...@suse.de> wrote: > On 07/04/2017 06:57 AM, Z Will wrote: >> >> Hi: >>I am testing ceph-mon brain split . I have read the code . If I >> understand it right , I know it won't be brain split. But I think >> there is still another problem. My ceph version is 0.94.10. And here >> is my test detail : >> >> 3 ceph-mons , there ranks are 0, 1, 2 respectively.I stop the rank 1 >> mon , and use iptables to block the communication between mon 0 and >> mon 1. When the cluster is stable, start mon.1 . I found the 3 >> monitors will all can not work well. They are all trying to call new >> leader election . This means the cluster can't work anymore. >> >> Here is my analysis. Because mon will always respond to leader >> election message, so , in my test, communication between mon.0 and >> mon.1 is blocked , so mon.1 will always try to be leader, because it >> will always see mon.2, and it should win over mon.2. Mon.0 should >> always win over mon.2. But mon.2 will always responsd to the election >> message issued by mon.1, so this loop will never end. Am I right ? >> >> This should be a problem? Or is it was just designed like this , and >> should be handled by human ? > > > This is a known behaviour, quite annoying, but easily identifiable by having > the same monitor constantly calling an election and usually timing out > because the peon did not defer to it. > > In a way, the elector algorithm does what it is intended to. Solving this > corner case would be nice, but I don't think there's a good way to solve it. > We may be able to presume a monitor is in trouble during the probe phase, to > disqualify a given monitor from the election, but in the end this is a > network issue that may be transient or unpredictable and there's only so > much we can account for. > > Dealing with it automatically would be nice, but I think, thus far, the > easiest way to address this particular issue is human intervention. > > -Joao ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] ceph-mon leader election problem, should it be improved ?
Hi Alvaro: From the code , I see unsigned need = monmap->size() / 2 + 1; So for 2 mons , the quorum must be 2 so that it can start election. That's why I use 3 mons. I know if I stop mon.0 or mon.1 , everything will work fine. And if this failure happens, it must be handled by human ? Is there any way to handle it automaticly from design as you know ? On Tue, Jul 4, 2017 at 2:25 PM, Alvaro Soto <alsot...@gmail.com> wrote: > Z, > You are forcing a byzantine failure, the paxos implemented to form the > consensus ring of the mon daemons does not support this kind of failures, > that is why you get and erratic behaviour, I believe is the common paxos > algorithm implemented in mon daemon code. > > If you just gracefully shutdown a mon daemon everything will work fine, but > with this you can not prove a split brain situation, because you will force > the election of the leader by quorum. > > Maybe with 2 mon daemons and closing the communication between each of them > every mon daemon will believe that can be a leader because every daemon will > have the que quorum of 1 with no other vote. > > Just saying :) > > > On Jul 4, 2017 12:57 AM, "Z Will" <zhao6...@gmail.com> wrote: >> >> Hi: >>I am testing ceph-mon brain split . I have read the code . If I >> understand it right , I know it won't be brain split. But I think >> there is still another problem. My ceph version is 0.94.10. And here >> is my test detail : >> >> 3 ceph-mons , there ranks are 0, 1, 2 respectively.I stop the rank 1 >> mon , and use iptables to block the communication between mon 0 and >> mon 1. When the cluster is stable, start mon.1 . I found the 3 >> monitors will all can not work well. They are all trying to call new >> leader election . This means the cluster can't work anymore. >> >> Here is my analysis. Because mon will always respond to leader >> election message, so , in my test, communication between mon.0 and >> mon.1 is blocked , so mon.1 will always try to be leader, because it >> will always see mon.2, and it should win over mon.2. Mon.0 should >> always win over mon.2. But mon.2 will always responsd to the election >> message issued by mon.1, so this loop will never end. Am I right ? >> >> This should be a problem? Or is it was just designed like this , and >> should be handled by human ? >> ___ >> ceph-users mailing list >> ceph-users@lists.ceph.com >> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
[ceph-users] ceph-mon leader election problem, should it be improved ?
Hi: I am testing ceph-mon brain split . I have read the code . If I understand it right , I know it won't be brain split. But I think there is still another problem. My ceph version is 0.94.10. And here is my test detail : 3 ceph-mons , there ranks are 0, 1, 2 respectively.I stop the rank 1 mon , and use iptables to block the communication between mon 0 and mon 1. When the cluster is stable, start mon.1 . I found the 3 monitors will all can not work well. They are all trying to call new leader election . This means the cluster can't work anymore. Here is my analysis. Because mon will always respond to leader election message, so , in my test, communication between mon.0 and mon.1 is blocked , so mon.1 will always try to be leader, because it will always see mon.2, and it should win over mon.2. Mon.0 should always win over mon.2. But mon.2 will always responsd to the election message issued by mon.1, so this loop will never end. Am I right ? This should be a problem? Or is it was just designed like this , and should be handled by human ? ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] Object storage performance tools
You can try this binary cosbench release package. https://github.com/intel-cloud/cosbench/releases/download/v0.4.2.c4/0.4.2.c4.zip . On Fri, Jun 16, 2017 at 3:08 PM, fridifree <fridif...@gmail.com> wrote: > Thank you for your comment. > > I feel that this tool is very buggy. > Do you have up to date guide for this tool(I have problems to bring it up) ? > > Thank you very much > > On Jun 16, 2017 10:00 AM, "Piotr Nowosielski" > <piotr.nowosiel...@allegrogroup.com> wrote: > > We are using https://github.com/intel-cloud/cosbench. In my opinion it The > only true S3 performance testing tool. > > > > -- > > Piotr Nowosielski > > Senior Systems Engineer > Zespół Infrastruktury 5 > Grupa Allegro sp. z o.o. > Tel: +48 512 08 55 92 > > Grupa Allegro Sp. z o.o. z siedzibą w Poznaniu, 60-166 Poznań, przy ul. > Grunwaldzka 182, wpisana do rejestru przedsiębiorców prowadzonego przez Sąd > Rejonowy Poznań - Nowe Miasto i Wilda, Wydział VIII Gospodarczy Krajowego > Rejestru Sądowego pod numerem KRS 268796, o kapitale zakładowym w > wysokości 33 976 500,00 zł, posiadająca numer identyfikacji podatkowej NIP: > 5272525995. > > > > From: ceph-users [mailto:ceph-users-boun...@lists.ceph.com] On Behalf Of > fridifree > Sent: Thursday, June 15, 2017 4:13 PM > To: Ceph Users <ceph-users@lists.ceph.com> > Subject: [ceph-users] Object storage performance tools > > > > Hi cephers, > > > > I want to benchmark my rgws and I cannot find any tool to benchmark Amazon > s3 api(for Swift I am using getput which is really great) > > I really don't want to use Cosbench. > > > > any ideas? > > > > Thanks > > > > ___ > ceph-users mailing list > ceph-users@lists.ceph.com > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com > ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] Help build a drive reliability service!
Hi Patrick: I want to ask a very tiny question. How much 9s do you claim your storage durability? And how is it calculated ? Based on the data you provided , have you find some failure model to refine the storage durability ? On Thu, Jun 15, 2017 at 12:09 AM, David Turnerwrote: > I understand concern over annoying drive manufacturers, but if you have data > to back it up you aren't slandering a drive manufacturer. If they don't > like the numbers that are found, then they should up their game or at least > request that you put in how your tests negatively affected their drive > endurance. For instance, WD Red drives are out of warranty just by being > placed in a chassis with more than 4 disks because they aren't rated for the > increased vibration from that many disks in a chassis. > > OTOH, if you are testing the drives within the bounds of the drives > warranty, and not doing anything against the recommendation of the > manufacturer in the test use case (both physical and software), then there > is no slander when you say that drive A outperformed drive B. I know that > the drives I run at home are not nearly as resilient as the drives that I > use at the office, but I don't put my home cluster through a fraction of the > strain that I do at the office. The manufacturer knows that their cheaper > drive isn't as resilient as the more robust enterprise drives. Anyway, I'm > sure you guys have thought about all of that and are _generally_ pretty > smart. ;) > > In an early warning system that detects a drive that is close to failing, > you could implement a command to migrate off of the disk and then run > non-stop IO on it to finish off the disk to satisfy warranties. Potentially > this could be implemented with the osd daemon via a burn-in start-up option. > Where it can be an OSD in the cluster that does not check in as up, but with > a different status so you can still monitor the health of the failing drive > from a ceph status. This could also be useful for people that would like to > burn-in their drives, but don't want to dedicate infrastructure to > burning-in new disks before deploying them. Making this as easy as possible > on the end user/ceph admin, there could even be a ceph.conf option for OSDs > that are added to the cluster and have never been been marked in to run > through a burn-in of X seconds (changeable in the config and defaults to 0 > as to not change the default behavior). I don't know if this is > over-thinking it or adding complexity where it shouldn't be, but it could be > used to get a drive to fail to use for an RMA. OTOH, for large deployments > we would RMA drives in batches and were never asked to prove that the drive > failed. We would RMA drives off of medium errors for HDDs and smart info > for SSDs and of course for full failures. > > On Wed, Jun 14, 2017 at 11:38 AM Dan van der Ster > wrote: >> >> Hi Patrick, >> >> We've just discussed this internally and I wanted to share some notes. >> >> First, there are at least three separate efforts in our IT dept to >> collect and analyse SMART data -- its clearly a popular idea and >> simple to implement, but this leads to repetition and begs for a >> common, good solution. >> >> One (perhaps trivial) issue is that it is hard to define exactly when >> a drive has failed -- it varies depending on the storage system. For >> Ceph I would define failure as EIO, which normally correlates with a >> drive medium error, but there were other ideas here. So if this should >> be a general purpose service, the sensor should have a pluggable >> failure indicator. >> >> There was also debate about what exactly we could do with a failure >> prediction model. Suppose the predictor told us a drive should fail in >> one week. We could proactively drain that disk, but then would it >> still fail? Will the vendor replace that drive under warranty only if >> it was *about to fail*? >> >> Lastly, and more importantly, there is a general hesitation to publish >> this kind of data openly, given how negatively it could impact a >> manufacturer. Our lab certainly couldn't publish a report saying "here >> are the most and least reliable drives". I don't know if anonymising >> the data sources would help here, but anyway I'm curious what are your >> thoughts on that point. Maybe what can come out of this are the >> _components_ of a drive reliability service, which could then be >> deployed privately or publicly as appropriate. >> >> Thanks! >> >> Dan >> >> >> >> >> On Wed, May 24, 2017 at 8:57 PM, Patrick McGarry >> wrote: >> > Hey cephers, >> > >> > Just wanted to share the genesis of a new community project that could >> > use a few helping hands (and any amount of feedback/discussion that >> > you might like to offer). >> > >> > As a bit of backstory, around 2013 the Backblaze folks started >> > publishing statistics about hard drive reliability from within their
[ceph-users] ceph durability calculation and test method
Hi all : I have some questions about the durability of ceph. I am trying to mesure the durability of ceph .I konw it should be related with host and disk failing probability, failing detection time, when to trigger the recover and the recovery time . I use it with multiple replication, say k replication. If I have N hosts, R racks, O osds per host, ignoring the swich, how should I define the failure probability of disk and host ? I think they should be independent, and should be time-dependent . I google it , but find little thing about it . I see AWS says it delivers 99.9% durability. How this is claimed ? And can I design some test method to prove the durability ? Or just let it run long enough time and make the statistics ? ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
[ceph-users] should I use rocdsdb ?
Hello gurus: My name is will . I have just study ceph and have a lot of interest in it . We are using ceph 0.94.10. And I am tring to tune the performance of ceph to satisfy our requirements. We are using it as object store now. Even though I have tried some different configuration. But I still have some questions about it . I wonder how much impact the performance of the omap kv database have on cluster ? Can I have some way to measure it ? I know rocksdb performs better than leveldb on fast device like ssd . So I wonder If I put bucket index on ssd, and change the omap kvdb to rocksdb . Will the performance be improved ? Has anyone tried this ? Or I still use filestore . But I use EnhanceIO + ssd as filesystem cache tier. And change the omap kv to rocksdb of all osds, will the cluster performance be improved ? Has anyone tried this ? ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
[ceph-users] radosgw 100-continue problem
Hi: I used nginx + fastcti + radosgw , when configure radosgw with "rgw print continue = true " In RFC 2616 , it says An origin server that sends a 100 (Continue) response MUST ultimately send a final status code, once the request body is received and processed, unless it terminates the transport connection prematurely. But in RADOSGW, for PUT requests, when radosgw receive EXCEPT: 100-continue, it responds 100-continue , the code is like this int RGWFCGX::send_status(int status, const char *status_name) { status_num = status; return print("Status: %d %s\r\n", status, status_name); } int RGWFCGX::send_100_continue() { int r = send_status(100, "Continue"); if (r >= 0) { flush(); } return r; } there is just one \r\n, this means the header is not end. So when it run to send_response(), it continue to send the rest header. I wonder if this is the cause. Nginx doesn't work due to this. I think this should split into two response. ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
[ceph-users] radosgw fastcgi problem
Hi: Very sorry for the last email, it was an accident. Recently, I trid to configure radosgw (0.94.7) with nginx as frontend, benchmark it with cosbench. And I found a strange thing. My os-related configuration are, net.core.netdev_max_backlog = 1000 net.core.somaxconn = 1024 In rgw_main.cc, I see #define SOCKET_BACKLOG 1024. so here is the problem. 1、When I configure cosbench with 1000 workers , everything is OK. But when it was 2000workers,there will be a lot of CLOSE_WAIT state connections on rgw side, and rgw will no longer response to any request.Use netstat command , I can see there are 1024 CLOSE_WAIT connections. 2、I change net.core.somaxconn to 128, and run cosbench with 2000 workers again, very soon, there will be no response from radosgw, and use netstat command , I can see there are 128 CLOSE_WAIT connections. I tried a lot things, googled it , but the problem still exist. I thought if there is some bug on fastcgi lib. Have anyone met this problem? ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
[ceph-users] radosgw fastcgi problem
Hi: Recently, I trid to configure radosgw with nginx as frontend, benchmark it with cosbench. And I found a strange thing. My os-related configuration are, net.core.netdev_max_backlog = 1000 ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
[ceph-users] any nginx + rgw best practice ?
Hi: I have tried to use nginx + fastcgi + radosgw , and benchmarked it with cosbench. I tuned the nginx and centos configration , but coundn't get the desirable performance. I met the following problem. 1、 I tried to tuned the centos net.ipv4... parameters, and I got a lot of CLOSE_WAIT state on radosgw side. 2、 I use the default centos configuration, and configure conbench with 800 workers, then I found cosbench got a lot timeout requests, and it will stuck on that workload . But , there will be no problem if I use Apache httpd . I am new to nginx. I wonder what is the best practice on Nginx + rgw. What is best configuration about nginx and centos combination. ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
[ceph-users] Ceph-fuse single read limitation?
Hi Guys, Now we have a very small cluster with 3 OSDs but using 40Gb NIC. We use ceph-fuse as cephfs client and enable readahead, but testing single reading a large file from cephfs via fio, dd or cp can only achieve ~70+MB/s, even if fio or dd's block size is set to 1MB or 4MB. From the ceph client log, we found each read request's size (ll_read) is limited to 128KB, which should be the limitation of kernel fuse's FUSE_MAX_PAGES_PER_REQ = 32. We may try to increase this value to see the performance difference, but are there other options we can try to increase single read performance? Thanks. Zhi Zhang (David) ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
[ceph-users] Write performance issue under rocksdb kvstore
Hi Guys, I am trying latest ceph-9.1.0 with rocksdb 4.1 and ceph-9.0.3 with rocksdb 3.11 as OSD backend. I use rbd to test performance and following is my cluster info. [ceph@xxx ~]$ ceph -s cluster b74f3944-d77f-4401-a531-fa5282995808 health HEALTH_OK monmap e1: 1 mons at {xxx=xxx.xxx.xxx.xxx:6789/0} election epoch 1, quorum 0 xxx osdmap e338: 44 osds: 44 up, 44 in flags sortbitwise pgmap v1476: 2048 pgs, 1 pools, 158 MB data, 59 objects 1940 MB used, 81930 GB / 81932 GB avail 2048 active+clean All the disks are spinning ones with write cache turning on. Rocksdb's WAL and sst files are on the same disk as every OSD. Using fio to generate following write load: fio -direct=1 -rw=randwrite -ioengine=sync -size=10M -bs=4K -group_reporting -directory /mnt/rbd_test/ -name xxx.1 -numjobs=1 Test result: WAL enabled + sync: false + disk write cache: on will get ~700 IOPS. WAL enabled + sync: true (default) + disk write cache: on|off will get only ~25 IOPS. I tuned some other rocksdb options, but with no lock. I tracked down the rocksdb code and found each writer's Sync operation would take ~30ms to finish. And as shown above, it is strange that performance has no much difference no matters disk write cache is on or off. Do your guys encounter the similar issue? Or do I miss something to cause rocksdb's poor write performance? Thanks. Zhi Zhang (David) ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] Write performance issue under rocksdb kvstore
Haimao, you're right. I add such sync option as configurable for our test purpose. Thanks.Zhi Zhang (David) > Date: Tue, 20 Oct 2015 21:24:49 +0800 > From: haomaiw...@gmail.com > To: zhangz.da...@outlook.com > CC: ceph-users@lists.ceph.com; ceph-de...@vger.kernel.org > Subject: Re: [ceph-users] Write performance issue under rocksdb kvstore > > Actually keyvaluestore would submit transaction with sync flag > too(rely to keyvaluedb impl journal/logfile). > > Yes, if we disable sync flag, keyvaluestore's performance will > increase a lot. But we dont provide with this option now > > On Tue, Oct 20, 2015 at 9:22 PM, Z Zhang <zhangz.da...@outlook.com> wrote: > > Thanks, Sage, for pointing out the PR and ceph branch. I will take a closer > > look. Yes, I am trying KVStore backend. The reason we are trying it is that > > few user doesn't have such high requirement on data loss occasionally. It > > seems KVStore backend without synchronized WAL could achieve better > > performance than filestore. And only data still in page cache would get lost > > on machine crashing, not process crashing, if we use WAL but no > > synchronization. What do you think? ? ? Thanks. Zhi Zhang (David) Date: Tue, > > 20 Oct 2015 05:47:44 -0700 From: s...@newdream.net To: > > zhangz.da...@outlook.com CC: ceph-users@lists.ceph.com; > > ceph-de...@vger.kernel.org Subject: Re: [ceph-users] Write performance issue > > under rocksdb kvstore On Tue, 20 Oct 2015, Z Zhang wrote: > Hi Guys, > > I > > am trying latest ceph-9.1.0 with rocksdb 4.1 and ceph-9.0.3 with > rocksdb > > 3.11 as OSD backend. I use rbd to test performance and following > is my > > cluster info. > > [ceph@xxx ~]$ ceph -s > ? ? cluster > > b74f3944-d77f-4401-a531-fa5282995808 > ? ? ?health HEALTH_OK > ? ? ?monmap > > e1: 1 mons at {xxx=xxx.xxx.xxx.xxx:6789/0} > ? ? ? ? ? ? election epoch 1, > > quorum 0 xxx > ? ? ?osdmap e338: 44 osds: 44 up, 44 in > ? ? ? ? ? ? flags > > sortbitwise > ? ? ? pgmap v1476: 2048 pgs, 1 pools, 158 MB data, 59 objects > >> ? ? ? ? ? ? 1940 MB used, 81930 GB / 81932 GB avail > ? ? ? ? ? ? ? ? 2048 > > active+clean > > All the disks are spinning ones with write cache turning > > on. Rocksdb's > WAL and sst files are on the same disk as every OSD. Are you > > using the KeyValueStore backend? > Using fio to generate following write > > load:? > fio -direct=1 -rw=randwrite -ioengine=sync -size=10M -bs=4K > > -group_reporting -directory /mnt/rbd_test/ -name xxx.1 -numjobs=1?? > > Test > > result: > WAL enabled + sync: false + disk write cache: on ?will get ~700 > > IOPS. > WAL enabled + sync: true (default) + disk write cache: on|off ?will > > get only ~25 IOPS. > > I tuned some other rocksdb options, but with no lock. > > The wip-newstore-frags branch sets some defaults for rocksdb that I think > > look pretty reasonable (at least given how newstore is using rocksdb). > I > > tracked down the rocksdb code and found each writer's Sync operation > would > > take ~30ms to finish. And as shown above, it is strange that > performance > > has no much difference no matters disk write cache is on or > off. > > Do > > your guys encounter the similar issue? Or do I miss something to > cause > > rocksdb's poor write performance? Yes, I saw the same thing. This PR > > addresses the problem and is nearing merge upstream: > > https://github.com/facebook/rocksdb/pull/746 There is also an XFS > > performance bug that is contributing to the problem, but it looks like Dave > > Chinner just put together a fix for that. But... we likely won't be using > > KeyValueStore in its current form over rocksdb (or any other kv backend). It > > stripes object data over key/value pairs, which IMO is not the best > > approach. sage ___ ceph-users > > mailing list ceph-users@lists.ceph.com > > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com > > > > ___ > > ceph-users mailing list > > ceph-users@lists.ceph.com > > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com > > > > > > -- > Best Regards, > > Wheat > ___ > ceph-users mailing list > ceph-users@lists.ceph.com > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] Write performance issue under rocksdb kvstore
Got your point. It is not only about the object data itself, but also ceph internal metadata. The best option seems to be your RP and wip-newstore-frags branch. :-) Thanks.Zhi Zhang (David) > Date: Tue, 20 Oct 2015 06:25:43 -0700 > From: s...@newdream.net > To: zhangz.da...@outlook.com > CC: ceph-users@lists.ceph.com; ceph-de...@vger.kernel.org > Subject: Re: [ceph-users] Write performance issue under rocksdb kvstore > > On Tue, 20 Oct 2015, Z Zhang wrote: > > Thanks, Sage, for pointing out the PR and ceph branch. I will take a > > closer look. > > > > Yes, I am trying KVStore backend. The reason we are trying it is that > > few user doesn't have such high requirement on data loss occasionally. > > It seems KVStore backend without synchronized WAL could achieve better > > performance than filestore. And only data still in page cache would get > > lost on machine crashing, not process crashing, if we use WAL but no > > synchronization. What do you think? > > That sounds dangerous. The OSDs are recording internal metadata about the > cluster (peering, replication, etc.)... even if you don't care so much > about recent user data writes you probably don't want to risk breaking > RADOS itself. If the kv backend is giving you a stale point-in-time > consistent copy it's not so bad, but in a power-loss event it could give > you problems... > > sage > > > > > Thanks. Zhi Zhang (David) > > > > Date: Tue, 20 Oct 2015 05:47:44 -0700 > > From: s...@newdream.net > > To: zhangz.da...@outlook.com > > CC: ceph-users@lists.ceph.com; ceph-de...@vger.kernel.org > > Subject: Re: [ceph-users] Write performance issue under rocksdb kvstore > > > > On Tue, 20 Oct 2015, Z Zhang wrote: > > > Hi Guys, > > > > > > I am trying latest ceph-9.1.0 with rocksdb 4.1 and ceph-9.0.3 with > > > rocksdb 3.11 as OSD backend. I use rbd to test performance and following > > > is my cluster info. > > > > > > [ceph@xxx ~]$ ceph -s > > > cluster b74f3944-d77f-4401-a531-fa5282995808 > > > health HEALTH_OK > > > monmap e1: 1 mons at {xxx=xxx.xxx.xxx.xxx:6789/0} > > > election epoch 1, quorum 0 xxx > > > osdmap e338: 44 osds: 44 up, 44 in > > > flags sortbitwise > > > pgmap v1476: 2048 pgs, 1 pools, 158 MB data, 59 objects > > > 1940 MB used, 81930 GB / 81932 GB avail > > > 2048 active+clean > > > > > > All the disks are spinning ones with write cache turning on. Rocksdb's > > > WAL and sst files are on the same disk as every OSD. > > > > Are you using the KeyValueStore backend? > > > > > Using fio to generate following write load: > > > fio -direct=1 -rw=randwrite -ioengine=sync -size=10M -bs=4K > > > -group_reporting -directory /mnt/rbd_test/ -name xxx.1 -numjobs=1 > > > > > > Test result: > > > WAL enabled + sync: false + disk write cache: on will get ~700 IOPS. > > > WAL enabled + sync: true (default) + disk write cache: on|off will get > > > only ~25 IOPS. > > > > > > I tuned some other rocksdb options, but with no lock. > > > > The wip-newstore-frags branch sets some defaults for rocksdb that I think > > look pretty reasonable (at least given how newstore is using rocksdb). > > > > > I tracked down the rocksdb code and found each writer's Sync operation > > > would take ~30ms to finish. And as shown above, it is strange that > > > performance has no much difference no matters disk write cache is on or > > > off. > > > > > > Do your guys encounter the similar issue? Or do I miss something to > > > cause rocksdb's poor write performance? > > > > Yes, I saw the same thing. This PR addresses the problem and is nearing > > merge upstream: > > > > https://github.com/facebook/rocksdb/pull/746 > > > > There is also an XFS performance bug that is contributing to the problem, > > but it looks like Dave Chinner just put together a fix for that. > > > > But... we likely won't be using KeyValueStore in its current form over > > rocksdb (or any other kv backend). It stripes object data over key/value > > pairs, which IMO is not the best approach. > > > > sage > > > > ___ > > ceph-users mailing list > > ceph-users@lists.ceph.com > > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com > > > ___ > ceph-users mailing list > ceph-users@lists.ceph.com > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] Write performance issue under rocksdb kvstore
Thanks, Sage, for pointing out the PR and ceph branch. I will take a closer look. Yes, I am trying KVStore backend. The reason we are trying it is that few user doesn't have such high requirement on data loss occasionally. It seems KVStore backend without synchronized WAL could achieve better performance than filestore. And only data still in page cache would get lost on machine crashing, not process crashing, if we use WAL but no synchronization. What do you think? Thanks. Zhi Zhang (David) Date: Tue, 20 Oct 2015 05:47:44 -0700 From: s...@newdream.net To: zhangz.da...@outlook.com CC: ceph-users@lists.ceph.com; ceph-de...@vger.kernel.org Subject: Re: [ceph-users] Write performance issue under rocksdb kvstore On Tue, 20 Oct 2015, Z Zhang wrote: > Hi Guys, > > I am trying latest ceph-9.1.0 with rocksdb 4.1 and ceph-9.0.3 with > rocksdb 3.11 as OSD backend. I use rbd to test performance and following > is my cluster info. > > [ceph@xxx ~]$ ceph -s > cluster b74f3944-d77f-4401-a531-fa5282995808 > health HEALTH_OK > monmap e1: 1 mons at {xxx=xxx.xxx.xxx.xxx:6789/0} > election epoch 1, quorum 0 xxx > osdmap e338: 44 osds: 44 up, 44 in > flags sortbitwise > pgmap v1476: 2048 pgs, 1 pools, 158 MB data, 59 objects > 1940 MB used, 81930 GB / 81932 GB avail > 2048 active+clean > > All the disks are spinning ones with write cache turning on. Rocksdb's > WAL and sst files are on the same disk as every OSD. Are you using the KeyValueStore backend? > Using fio to generate following write load: > fio -direct=1 -rw=randwrite -ioengine=sync -size=10M -bs=4K -group_reporting > -directory /mnt/rbd_test/ -name xxx.1 -numjobs=1 > > Test result: > WAL enabled + sync: false + disk write cache: on will get ~700 IOPS. > WAL enabled + sync: true (default) + disk write cache: on|off will get only > ~25 IOPS. > > I tuned some other rocksdb options, but with no lock. The wip-newstore-frags branch sets some defaults for rocksdb that I think look pretty reasonable (at least given how newstore is using rocksdb). > I tracked down the rocksdb code and found each writer's Sync operation > would take ~30ms to finish. And as shown above, it is strange that > performance has no much difference no matters disk write cache is on or > off. > > Do your guys encounter the similar issue? Or do I miss something to > cause rocksdb's poor write performance? Yes, I saw the same thing. This PR addresses the problem and is nearing merge upstream: https://github.com/facebook/rocksdb/pull/746 There is also an XFS performance bug that is contributing to the problem, but it looks like Dave Chinner just put together a fix for that. But... we likely won't be using KeyValueStore in its current form over rocksdb (or any other kv backend). It stripes object data over key/value pairs, which IMO is not the best approach. sage ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
[ceph-users] FW: Long tail latency due to journal aio io_submit takes long time to return
FW to ceph-user Thanks. Zhi Zhang (David) From: zhangz.da...@outlook.com To: ceph-de...@vger.kernel.org Subject: Long tail latency due to journal aio io_submit takes long time to return Date: Tue, 25 Aug 2015 18:46:34 +0800 Hi Ceph-devel, Recently we find there are some (not much) long tail latency due to journal aio io_submit takes long time to return. We added some logs in the FileJouranl.cc and get following output as an example. AIO's io_submit function takes ~400ms to return, occasionally even takes up to 900ms. 2015-08-25 15:24:39.535206 7f22699c2700 20 journal write_thread_entry woke up 2015-08-25 15:24:39.535212 7f22699c2700 10 journal room 10460594175 max_size 10484711424 pos 4786679808 header.start 4762566656 top 4096 2015-08-25 15:24:39.535215 7f22699c2700 10 journal check_for_full at 4786679808 : 532480 10460594175 2015-08-25 15:24:39.535216 7f22699c2700 15 journal prepare_single_write 1 will write 4786679808 : seq 13075224 len 526537 - 532480 (head 40 pre_pad 4046 ebl 526537 post_pad 1817 tail 40) (ebl alignment 4086) 2015-08-25 15:24:39.535322 7f22699c2700 20 journal prepare_multi_write queue_pos now 4787212288 2015-08-25 15:24:39.535324 7f22699c2700 15 journal do_aio_write writing 4786679808~532480 2015-08-25 15:24:39.535434 7f22699c2700 10 journal align_bl total memcopy: 532480 2015-08-25 15:24:39.535439 7f22699c2700 20 journal write_aio_bl 4786679808~532480 seq 13075224 2015-08-25 15:24:39.535442 7f22699c2700 20 journal write_aio_bl .. 4786679808~532480 in 1 2015-08-25 15:24:39.535444 7f22699c2700 5 journal io_submit starting... 2015-08-25 15:24:39.990372 7f22699c2700 5 journal io_submit return value: 1 2015-08-25 15:24:39.990396 7f22699c2700 5 journal put_throttle finished 1 ops and 526537 bytes, now 12 ops and 6318444 bytes 2015-08-25 15:24:39.990401 7f22691c1700 20 journal write_finish_thread_entry waiting for aio(s) 2015-08-25 15:24:39.990406 7f22699c2700 20 journal write_thread_entry aio throttle: aio num 1 bytes 532480 ... exp 2 min_new 4 ... pending 6318444 2015-08-25 15:24:39.990410 7f22699c2700 10 journal room 10460061695 max_size 10484711424 pos 4787212288 header.start 4762566656 top 4096 2015-08-25 15:24:39.990413 7f22699c2700 10 journal check_for_full at 4787212288 : 532480 10460061695 2015-08-25 15:24:39.990415 7f22699c2700 15 journal prepare_single_write 1 will write 4787212288 : seq 13075225 len 526537 - 532480 (head 40 pre_pad 4046 ebl 526537 post_pad 1817 tail 40) (ebl alignment 4086) And write_aio_bl will hold aio_lock until it returns, so write_finish_thread_entry can't process completed aio when io_submit is blocking. We reduced the aio_lock's locking scope in write_aio_bl, specifically not locking io_submit. It doesn't seem to need lock on io_submit in journal because there is only 1 write_thread_entry and 1 write_finish_thread_entry. This might help those completed aio to get processed asap. But for the one blocked in io_submit and the ones in the queue, they still need to take more time. Any idea about this? Thanks, Zhi Zhang (David) ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] which kernel version can help avoid kernel client deadlock
Hi Ilya, We just tried the 3.10.83 kernel with more rbd fixes back-ported from higher kernel version. At this time, we tried again to run rbd and 3 OSD deamons on the same node, but rbd IO will still hang and OSD filestore thread will time out to suicide when the memory becomes very low under high load. When this happened, enabling rbd log can even cause system unresponsive, so haven't collected some logs. Only osd log says filestore thread was doing filestore::_write before time-out, will look into it further. I know this is not an appropriate usage of CEPH and rbd, but I still want to ask if there is a way or workaround to do this? Is there any successful case in the community? Thanks. David Zhang From: zhangz.da...@outlook.com To: idryo...@gmail.com Date: Fri, 31 Jul 2015 09:21:40 +0800 CC: ceph-users@lists.ceph.com; chaofa...@owtware.com Subject: Re: [ceph-users] which kernel version can help avoid kernel client deadlock Date: Thu, 30 Jul 2015 13:11:11 +0300 Subject: Re: [ceph-users] which kernel version can help avoid kernel client deadlock From: idryo...@gmail.com To: zhangz.da...@outlook.com CC: chaofa...@owtware.com; ceph-users@lists.ceph.com On Thu, Jul 30, 2015 at 12:46 PM, Z Zhang zhangz.da...@outlook.com wrote: Date: Thu, 30 Jul 2015 11:37:37 +0300 Subject: Re: [ceph-users] which kernel version can help avoid kernel client deadlock From: idryo...@gmail.com To: zhangz.da...@outlook.com CC: chaofa...@owtware.com; ceph-users@lists.ceph.com On Thu, Jul 30, 2015 at 10:29 AM, Z Zhang zhangz.da...@outlook.com wrote: Subject: Re: [ceph-users] which kernel version can help avoid kernel client deadlock From: chaofa...@owtware.com Date: Thu, 30 Jul 2015 13:16:16 +0800 CC: idryo...@gmail.com; ceph-users@lists.ceph.com To: zhangz.da...@outlook.com On Jul 30, 2015, at 12:48 PM, Z Zhang zhangz.da...@outlook.com wrote: We also hit the similar issue from time to time on centos with 3.10.x kernel. By iostat, we can see kernel rbd client's util is 100%, but no r/w io, and we can't umount/unmap this rbd client. After restarting OSDs, it will become normal. 3.10.x is rather vague, what is the exact version you saw this on? Can you provide syslog logs (I'm interested in dmesg)? The kernel version should be 3.10.0. I don't have sys logs at hand. It is not easily reproduced, and it happened at very low memory situation. We are running DB instances over rbd as storage. DB instances will use lot of memory when running high concurrent rw, and after running for a long time, rbd might hit this problem, but not always. Enabling rbd log makes our system behave strange during our test. I back-ported one of your fixes: https://git.kernel.org/cgit/linux/kernel/git/torvalds/linux.git/commit/drivers/block/rbd.c?id=5a60e87603c4c533492c515b7f62578189b03c9c So far test looks fine for few days, but still under observation. So want to know if there are some other fixes? I'd suggest following 3.10 stable series (currently at 3.10.84). The fix you backported is crucial in low memory situations, so I wouldn't be surprised if it alone fixed your problem. (It is not in 3.10.84, I assume it'll show up in 3.10.85 - for now just apply your backport.) cool, looking forward 3.10.85 to see what else would be brought in. Thanks. Thanks, Ilya ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] which kernel version can help avoid kernel client deadlock
Date: Thu, 30 Jul 2015 13:11:11 +0300 Subject: Re: [ceph-users] which kernel version can help avoid kernel client deadlock From: idryo...@gmail.com To: zhangz.da...@outlook.com CC: chaofa...@owtware.com; ceph-users@lists.ceph.com On Thu, Jul 30, 2015 at 12:46 PM, Z Zhang zhangz.da...@outlook.com wrote: Date: Thu, 30 Jul 2015 11:37:37 +0300 Subject: Re: [ceph-users] which kernel version can help avoid kernel client deadlock From: idryo...@gmail.com To: zhangz.da...@outlook.com CC: chaofa...@owtware.com; ceph-users@lists.ceph.com On Thu, Jul 30, 2015 at 10:29 AM, Z Zhang zhangz.da...@outlook.com wrote: Subject: Re: [ceph-users] which kernel version can help avoid kernel client deadlock From: chaofa...@owtware.com Date: Thu, 30 Jul 2015 13:16:16 +0800 CC: idryo...@gmail.com; ceph-users@lists.ceph.com To: zhangz.da...@outlook.com On Jul 30, 2015, at 12:48 PM, Z Zhang zhangz.da...@outlook.com wrote: We also hit the similar issue from time to time on centos with 3.10.x kernel. By iostat, we can see kernel rbd client's util is 100%, but no r/w io, and we can't umount/unmap this rbd client. After restarting OSDs, it will become normal. 3.10.x is rather vague, what is the exact version you saw this on? Can you provide syslog logs (I'm interested in dmesg)? The kernel version should be 3.10.0. I don't have sys logs at hand. It is not easily reproduced, and it happened at very low memory situation. We are running DB instances over rbd as storage. DB instances will use lot of memory when running high concurrent rw, and after running for a long time, rbd might hit this problem, but not always. Enabling rbd log makes our system behave strange during our test. I back-ported one of your fixes: https://git.kernel.org/cgit/linux/kernel/git/torvalds/linux.git/commit/drivers/block/rbd.c?id=5a60e87603c4c533492c515b7f62578189b03c9c So far test looks fine for few days, but still under observation. So want to know if there are some other fixes? I'd suggest following 3.10 stable series (currently at 3.10.84). The fix you backported is crucial in low memory situations, so I wouldn't be surprised if it alone fixed your problem. (It is not in 3.10.84, I assume it'll show up in 3.10.85 - for now just apply your backport.) cool, looking forward 3.10.85 to see what else would be brought in. Thanks. Thanks, Ilya ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] which kernel version can help avoid kernel client deadlock
Date: Thu, 30 Jul 2015 11:37:37 +0300 Subject: Re: [ceph-users] which kernel version can help avoid kernel client deadlock From: idryo...@gmail.com To: zhangz.da...@outlook.com CC: chaofa...@owtware.com; ceph-users@lists.ceph.com On Thu, Jul 30, 2015 at 10:29 AM, Z Zhang zhangz.da...@outlook.com wrote: Subject: Re: [ceph-users] which kernel version can help avoid kernel client deadlock From: chaofa...@owtware.com Date: Thu, 30 Jul 2015 13:16:16 +0800 CC: idryo...@gmail.com; ceph-users@lists.ceph.com To: zhangz.da...@outlook.com On Jul 30, 2015, at 12:48 PM, Z Zhang zhangz.da...@outlook.com wrote: We also hit the similar issue from time to time on centos with 3.10.x kernel. By iostat, we can see kernel rbd client's util is 100%, but no r/w io, and we can't umount/unmap this rbd client. After restarting OSDs, it will become normal. 3.10.x is rather vague, what is the exact version you saw this on? Can you provide syslog logs (I'm interested in dmesg)? The kernel version should be 3.10.0. I don't have sys logs at hand. It is not easily reproduced, and it happened at very low memory situation. We are running DB instances over rbd as storage. DB instances will use lot of memory when running high concurrent rw, and after running for a long time, rbd might hit this problem, but not always. Enabling rbd log makes our system behave strange during our test. I back-ported one of your fixes: https://git.kernel.org/cgit/linux/kernel/git/torvalds/linux.git/commit/drivers/block/rbd.c?id=5a60e87603c4c533492c515b7f62578189b03c9c So far test looks fine for few days, but still under observation. So want to know if there are some other fixes? Thanks, Ilya ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] which kernel version can help avoid kernel client deadlock
We also hit the similar issue from time to time on centos with 3.10.x kernel. By iostat, we can see kernel rbd client's util is 100%, but no r/w io, and we can't umount/unmap this rbd client. After restarting OSDs, it will become normal. @Ilya, could you pls point us the possible fixes on 3.18.19 towards this issue? Then we can try to back-port them to our old kernel because we can't jump to a major kernel version. Thanks. David Zhang From: chaofa...@owtware.com Date: Thu, 30 Jul 2015 10:30:12 +0800 To: idryo...@gmail.com CC: ceph-users@lists.ceph.com Subject: Re: [ceph-users] which kernel version can help avoid kernel client deadlock On Jul 29, 2015, at 12:40 AM, Ilya Dryomov idryo...@gmail.com wrote:On Tue, Jul 28, 2015 at 7:20 PM, van chaofa...@owtware.com wrote: On Jul 28, 2015, at 7:57 PM, Ilya Dryomov idryo...@gmail.com wrote: On Tue, Jul 28, 2015 at 2:46 PM, van chaofa...@owtware.com wrote: Hi, Ilya, In the dmesg, there is also a lot of libceph socket error, which I think may be caused by my stopping ceph service without unmap rbd. Well, sure enough, if you kill all OSDs, the filesystem mounted on top of rbd device will get stuck. Sure it will get stuck if osds are stopped. And since rados requests have retry policy, the stucked requests will recover after I start the daemon again. But in my case, the osds are running in normal state and librbd API can read/write normally. Meanwhile, heavy fio test for the filesystem mounted on top of rbd device will get stuck. I wonder if this phenomenon is triggered by running rbd kernel client on machines have ceph daemons, i.e. the annoying loopback mount deadlock issue. In my opinion, if it’s due to the loopback mount deadlock, the OSDs will become unresponsive. No matter the requests are from user space requests (like API) or from kernel client. Am I right? Not necessarily. If so, my case seems to be triggered by another bug. Anyway, it seems that I should separate client and daemons at least. Try 3.18.19 if you can. I'd be interested in your results. It’s strange, after I drop the page cache and restart my OSDs, same heavy IO tests on rbd folder now works fine.The deadlock seems not that easy to trigger. Maybe I need longer tests. I’ll try 3.18.19 LTS, thanks. Thanks, Ilya ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
[ceph-users] Timeout mechanism in ceph client tick
Hi Guys, By reading through ceph client codes, there is timeout mechanism in tick when doing mount. Recently we met some client requests to mds spending long time to reply when doing massive test to cephfs. And if we want cephfs user to know the timeout instead of waiting for the reply, can we make following change to leverage such timeout mechanism? Is there any other concern that we didn't enable client request's timeout in tick after mount succeeds? If this works, we can either use client_mount_timeout, or introduce another parameter for normal client request's timeout value. diff --git a/src/client/Client.cc b/src/client/Client.ccindex 63a9faa..91a8cc9 100644--- a/src/client/Client.cc+++ b/src/client/Client.cc@@ -4850,7 +4850,7 @@ void Client::tick()utime_t now = ceph_clock_now(cct); - if (!mounted !mds_requests.empty()) {+ if (!mds_requests.empty()) { MetaRequest *req = mds_requests.begin()-second; if (req-op_stamp + cct-_conf-client_mount_timeout now) { req-aborted = true; Thanks. David Zhang ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] krbd splitting large IO's into smaller IO's
Hi Ilya, Thanks for your explanation. This makes sense. Will you make max_segments to be configurable? Could you pls point me the fix you have made? We might help to test it. Thanks. David Zhang Date: Fri, 26 Jun 2015 18:21:55 +0300 Subject: Re: [ceph-users] krbd splitting large IO's into smaller IO's From: idryo...@gmail.com To: zhangz.da...@outlook.com CC: ceph-users@lists.ceph.com On Fri, Jun 26, 2015 at 3:17 PM, Z Zhang zhangz.da...@outlook.com wrote: Hi Ilya, I am seeing your recent email talking about krbd splitting large IO's into smaller IO's, see below link. https://www.mail-archive.com/ceph-users@lists.ceph.com/msg20587.html I just tried it on my ceph cluster using kernel 3.10.0-1. I adjust both max_sectors_kb and max_hw_sectors_kb of rbd device to 4096. Use fio with 4M block size for read: Device: rrqm/s wrqm/s r/s w/srMB/swMB/s avgrq-sz avgqu-sz await r_await w_await svctm %util rbd3 81.00 0.00 135.000.00 108.00 0.00 1638.40 2.72 20.15 20.150.00 7.41 100.00 Use fio with 1M or 2M block size for read: Device: rrqm/s wrqm/s r/s w/srMB/swMB/s avgrq-sz avgqu-sz await r_await w_await svctm %util rbd3 0.00 0.00 213.000.00 106.50 0.00 1024.00 2.56 12.02 12.020.00 4.69 100.00 Use fio with 4M block size for write: Device: rrqm/s wrqm/s r/s w/srMB/swMB/s avgrq-sz avgqu-sz await r_await w_await svctm %util rbd3 0.0040.000.00 40.00 0.0040.00 2048.00 2.87 70.900.00 70.90 24.90 99.60 Use fio with 1M or 2M block size for write: Device: rrqm/s wrqm/s r/s w/srMB/swMB/s avgrq-sz avgqu-sz await r_await w_await svctm %util rbd3 0.00 0.000.00 80.00 0.0040.00 1024.00 3.55 48.200.00 48.20 12.50 100.00 So why the IO size here is far less than 4096 (If using default value 512, all the IO size is 1024)? Is there some other parameters need to adjust, or is it about this kernel version? It's about this kernel version. Assuming you are doing direct I/Os with fio, setting max_sectors_kb to 4096 is really the only thing you can do, and that's enough to *sometimes* see 8192 sector (i.e. 4M) I/Os. The problem is the max_segments value, which in 3.10 is 128 and which you cannot adjust via sysfs. It all comes down to a memory allocator. To get a 4M I/O, the total number of segments (physically contiguous chunks of memory) in the 8 bios (8*512k = 4M) that need to be merged has to be = 128. When you are allocated such nice and contiguous bios, you get 4M I/Os. In other cases you don't. This will be fixed in 4.2, along with a bunch of other things. This particular max_segment fix is a one liner, so we will probably backport it to older kernels, including 3.10. Thanks, Ilya ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
[ceph-users] krbd splitting large IO's into smaller IO's
Hi Ilya, I am seeing your recent email talking about krbd splitting large IO's into smaller IO's, see below link. https://www.mail-archive.com/ceph-users@lists.ceph.com/msg20587.html I just tried it on my ceph cluster using kernel 3.10.0-1. I adjust both max_sectors_kb and max_hw_sectors_kb of rbd device to 4096. Use fio with 4M block size for read: Device: rrqm/s wrqm/s r/s w/srMB/swMB/s avgrq-sz avgqu-sz await r_await w_await svctm %utilrbd3 81.00 0.00 135.000.00 108.00 0.00 1638.40 2.72 20.15 20.150.00 7.41 100.00 Use fio with 1M or 2M block size for read: Device: rrqm/s wrqm/s r/s w/srMB/swMB/s avgrq-sz avgqu-sz await r_await w_await svctm %utilrbd3 0.00 0.00 213.000.00 106.50 0.00 1024.00 2.56 12.02 12.020.00 4.69 100.00 Use fio with 4M block size for write: Device: rrqm/s wrqm/s r/s w/srMB/swMB/s avgrq-sz avgqu-sz await r_await w_await svctm %utilrbd3 0.0040.00 0.00 40.00 0.0040.00 2048.00 2.87 70.900.00 70.90 24.90 99.60 Use fio with 1M or 2M block size for write: Device: rrqm/s wrqm/s r/s w/srMB/swMB/s avgrq-sz avgqu-sz await r_await w_await svctm %utilrbd3 0.00 0.00 0.00 80.00 0.0040.00 1024.00 3.55 48.200.00 48.20 12.50 100.00 So why the IO size here is far less than 4096 (If using default value 512, all the IO size is 1024)? Is there some other parameters need to adjust, or is it about this kernel version? Thanks! Regards,David Zhang ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
[ceph-users] CephFS client issue
On Monday, June 15, 2015 3:05 AM, ceph-users-requ...@lists.ceph.com ceph-users-requ...@lists.ceph.com wrote: Send ceph-users mailing list submissions to ceph-users@lists.ceph.com To subscribe or unsubscribe via the World Wide Web, visit http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com or, via email, send a message with subject or body 'help' to ceph-users-requ...@lists.ceph.com You can reach the person managing the list at ceph-users-ow...@lists.ceph.com When replying, please edit your Subject line so it is more specific than Re: Contents of ceph-users digest... Today's Topics: 1. Re: Erasure coded pools and bit-rot protection (Pawe? Sadowski) 2. CephFS client issue (Matteo Dacrema) 3. Re: Erasure coded pools and bit-rot protection (Gregory Farnum) 4. Re: CephFS client issue (Lincoln Bryant) 5. Re: .New Ceph cluster - cannot add additional monitor (Mike Carlson) 6. Re: CephFS client issue (Matteo Dacrema) -- Message: 1 Date: Sat, 13 Jun 2015 21:08:25 +0200 From: Pawe? Sadowski c...@sadziu.pl To: Gregory Farnum g...@gregs42.com Cc: ceph-users ceph-us...@ceph.com Subject: Re: [ceph-users] Erasure coded pools and bit-rot protection Message-ID: 557c7fa9.1020...@sadziu.pl Content-Type: text/plain; charset=utf-8 Thanks for taking care of this so fast. Yes, I'm getting broken object. I haven't checked this on other versions but is this bug present only in Hammer or in all versions? W dniu 12.06.2015 o 21:43, Gregory Farnum pisze: Okay, Sam thinks he knows what's going on; here's a ticket: http://tracker.ceph.com/issues/12000 On Fri, Jun 12, 2015 at 12:32 PM, Gregory Farnum g...@gregs42.com wrote: On Fri, Jun 12, 2015 at 1:07 AM, Pawe? Sadowski c...@sadziu.pl wrote: Hi All, I'm testing erasure coded pools. Is there any protection from bit-rot errors on object read? If I modify one bit in object part (directly on OSD) I'm getting *broken*object: Sorry, are you saying that you're getting a broken object if you flip a bit in an EC pool? That should detect the chunk as invalid and reconstruct on read... -Greg mon-01:~ # rados --pool ecpool get `hostname -f`_16 - | md5sum bb2d82bbb95be6b9a039d135cc7a5d0d - # modify one bit directly on OSD mon-01:~ # rados --pool ecpool get `hostname -f`_16 - | md5sum 02f04f590010b4b0e6af4741c4097b4f - # restore bit to original value mon-01:~ # rados --pool ecpool get `hostname -f`_16 - | md5sum bb2d82bbb95be6b9a039d135cc7a5d0d - If I run deep-scrub on modified bit I'm getting inconsistent PG which is correct in this case. After restoring bit and running deep-scrub again all PGs are clean. [ceph version 0.94.1 (e4bfad3a3c51054df7e537a724c8d0bf9be972ff)] -- PS -- Message: 2 Date: Sun, 14 Jun 2015 15:26:54 + From: Matteo Dacrema mdacr...@enter.it To: ceph-users ceph-us...@ceph.com Subject: [ceph-users] CephFS client issue Message-ID: d28e061762104ed68e06effd5199ef06@Exch2013Mb.enter.local Content-Type: text/plain; charset=us-ascii ?Hi all, I'm using CephFS on Hammer and sometimes I need to reboot one or more clients because , as ceph -s tells me, it's failing to respond to capability release.After tha?t all clients stop to respond: can't access files or mount/umont cephfs. I've 1.5 million files , 2 metadata servers in active/standby configuration with 8 GB of RAM , 20 clients with 2 GB of RAM each and 2 OSD nodes with 4 80GB osd and 4GB of RAM. Here my configuration: [global] fsid = 2de7b17f-0a3e-4109-b878-c035dd2f7735 mon_initial_members = cephmds01 mon_host = 10.29.81.161 auth_cluster_required = cephx auth_service_required = cephx auth_client_required = cephx public network = 10.29.81.0/24 tcp nodelay = true tcp rcvbuf = 0 ms tcp read timeout = 600 #Capacity mon osd full ratio = .95 mon osd nearfull ratio = .85 [osd] osd journal size = 1024 journal dio = true journal aio = true osd op threads = 2 osd op thread timeout = 60 osd disk threads = 2 osd recovery threads = 1 osd recovery max active = 1 osd max backfills = 2 # Pool osd pool default size = 2 #XFS osd mkfs type = xfs osd mkfs options xfs = -f -i size=2048 osd mount options xfs = rw,noatime,inode64,logbsize=256k,delaylog #FileStore Settings filestore xattr use omap = false filestore max inline xattr size = 512 filestore max sync interval = 10 filestore merge threshold = 40 filestore split multiple = 8 filestore flusher = false filestore queue max ops = 2000 filestore queue max bytes = 536870912 filestore queue committing max ops = 500 filestore queue committing max bytes =
Re: [ceph-users] rbd format v2 support
Hi Ilya, Thanks for the reply. I knew that v2 image can be mapped if using default striping parameters without --stripe-unit or --stripe-count. It is just the rbd performance (IOPS bandwidth) we tested hasn't met our goal. We found at this point OSDs seemed not to be the bottleneck, so we want to try fancy striping. Do you know if there is an approximate ETA for this feature? Or it would be great that you could share some info on tuning rbd performance. Anything will be appreciated. Thanks. Zhi (David) On Sunday, June 7, 2015 3:50 PM, Ilya Dryomov idryo...@gmail.com wrote: On Fri, Jun 5, 2015 at 6:47 AM, David Z david.z1...@yahoo.com wrote: Hi Ceph folks, We want to use rbd format v2, but find it is not supported on kernel 3.10.0 of centos 7: [ceph@ ~]$ sudo rbd map zhi_rbd_test_1 rbd: sysfs write failed rbd: map failed: (22) Invalid argument [ceph@ ~]$ dmesg | tail [662453.664746] rbd: image zhi_rbd_test_1: unsupported stripe unit (got 8192 want 4194304) As it described in ceph doc, it should be available from kernel 3.11. But I checked the code of kernel 3.12, 3.14 and even 4.1. This piece of code is still there, see below links. Do I miss some codes or info? What you are referring to is called fancy striping and it is unsupported (work is underway but it's been slow going). However because v2 images with *default* striping parameters disk format wise are the same as v1 images, you can map a v2 image provided you didn't specify custom --stripe-unit or --stripe-count on rbd create. Thanks, Ilya ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
[ceph-users] rbd format v2 support
Hi Ceph folks, We want to use rbd format v2, but find it is not supported on kernel 3.10.0 of centos 7: [ceph@ ~]$ sudo rbd map zhi_rbd_test_1 rbd: sysfs write failed rbd: map failed: (22) Invalid argument [ceph@ ~]$ dmesg | tail [662453.664746] rbd: image zhi_rbd_test_1: unsupported stripe unit (got 8192 want 4194304) As it described in ceph doc, it should be available from kernel 3.11. But I checked the code of kernel 3.12, 3.14 and even 4.1. This piece of code is still there, see below links. Do I miss some codes or info? https://git.kernel.org/cgit/linux/kernel/git/torvalds/linux.git/tree/drivers/block/rbd.c?id=refs/tags/v4.1-rc6#n4352 https://git.kernel.org/cgit/linux/kernel/git/torvalds/linux.git/tree/drivers/block/rbd.c?id=refs/tags/v4.1-rc6#n4359 And I also found one email that Sage mentioned there was a commit 764684ef34af685cd8d46830a73826443f9129df should resolve such problem, but I didn't find this commit's details. Could some one know about it? http://lists.ceph.com/pipermail/ceph-users-ceph.com/2013-June/031897.html Thanks a lot! Regards, Zhi (David) ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
[ceph-users] The strategy of auto-restarting crashed OSD
Hi Guys, We are experiencing some OSD crashing issues recently, like messenger crash, some strange crash (still being investigating), etc. Those crashes seems not to reproduce after restarting OSD. So we are thinking about the strategy of auto-restarting crashed OSD for 1 or 2 times, then leave it as down if restarting doesn't work. This strategy might help us on pg peering and recovering impact to online traffic to some extent, since we won't mark OSD out automatically even if it is down unless we are sure it is disk failure. However, we are also aware that this strategy may bring us some problems. Since your guys have more experience on CEPH, so we would like to hear some suggestions from you. Thanks. David Zhang ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] CephFS - limit space available to individual users.
Thanks Wildo, I have to admit its slightly disappointing (but completely understandable) since it basically means it's not safe for us to use CephFS :( Without userquotas, it would be sufficient to have multiple CephFS filesystems and to be able to set the size of each one. Is it part of the core design that there can only be one filesystem in the cluster? This seems like a 'single point of failure'. I have been testing CephFS on our computational cluster of about 30 computers. I've got 4 machines, 4 disks, 4 osd, 4 mon and 1 mds at the moment for testing. The testing has been going very well apart from one problem that needs to be resolved before we can use Ceph in place of our existing 'system' of NFS exports. Our users run simulations that are easily capable of writing out data at a rate limited only by the storage device. These jobs also often run for days or weeks unattended. This unfortunately means that using CephFS, if a user doesn't setup their simulation carefully enough or if their code has some bug, they are able to fill the entire filesystem (shared by aroud 10 other users) in around a day leaving no room for any other users and potentially crashing the entire cluster. I've read the FAQ entry about quotas but I'm not sure what to make of it. Is it correct that you can only have one CephFS per cluster? I guess I was imagining creating a separate file-system of known size for each user. The talks about quotas were indeed userquotas, but nothing about enforcing them. The first step is to do accounting and maybe in a later stage soft and hard enforcement can be added. I don't think it's on the roadmap currently. ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
[ceph-users] CephFS - limit space available to individual users.
I have been testing CephFS on our computational cluster of about 30 computers. I've got 4 machines, 4 disks, 4 osd, 4 mon and 1 mds at the moment for testing. The testing has been going very well apart from one problem that needs to be resolved before we can use Ceph in place of our existing 'system' of NFS exports. Our users run simulations that are easily capable of writing out data at a rate limited only by the storage device. These jobs also often run for days or weeks unattended. This unfortunately means that using CephFS, if a user doesn't setup their simulation carefully enough or if their code has some bug, they are able to fill the entire filesystem (shared by aroud 10 other users) in around a day leaving no room for any other users and potentially crashing the entire cluster. I've read the FAQ entry about quotas but I'm not sure what to make of it. Is it correct that you can only have one CephFS per cluster? I guess I was imagining creating a separate file-system of known size for each user. Any help would be greatly appreciated and since this is my first message, thanks to Sage and everyone else involved for creating this excellent project! ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com