from:"Z Will"

Re: [ceph-users] libvirt + rbd questions

2017-08-27 Thread Z Will

Hi Dajka:
Yes, you are right. Thanks for your reply . I changed a computer
to  do the same thing , it worked. And I compared the packages , but
found nothing different yet . There must be something I am still
missing.

On Fri, Aug 25, 2017 at 3:25 PM, Dajka Tamás <vi...@vipernet.hu> wrote:
> Hi,
>
> is qemu-img working for you ont he VM Host machine?
>
> Did you create the vol? What does 'rbd ls' say?
>
> Did you feed the secret (and the key created with ceph auth) to virsh?
>
> Cheers,
>
> Tom
> p.s.: you maybe need anyother package for qemu-rbd support - I did so on
> latest Debain (stretch)
>
> -Original Message-
> From: ceph-users [mailto:ceph-users-boun...@lists.ceph.com] On Behalf Of Z
> Will
> Sent: Friday, August 25, 2017 8:49 AM
> To: Ceph-User <ceph-us...@ceph.com>
> Subject: [ceph-users] libvirt + rbd questions
>
> Hi all:
> I have tried to install a vm using rbd as disk, following the steps from
> ceph doc, but met some problems. the packages environment is following:
>
> CentOS Linux release 7.2.1511 (Core)
> libvirt-2.0.0-10.el7_3.9.x86_64
> libvirt-python-2.0.0-2.el7.x86_64
> virt-manager-common-1.4.0-2.el7.noarch
> qemu-2.0.0-1.el7.6.x86_64
>
> and the procedure  is :
> create a pool
> 
> rbd2
> 48283717632
> 133  unit='bytes'>46990565376
> 
>  rbd  username='admin'> 
> 
> 
> 
>
> create a vol t2
>
> run the following cmd
>
> virt-install -n puppy2 --vcpu 1 -c /etc/ceph/tahr64-6.0.5.iso --disk
> vol=rbd2/t2  --memory 1024 --vnc  --vnclisten=0.0.0.0 -v
>
> it failed , the error is :
> WARNING  No operating system detected, VM performance may suffer.
> Specify an OS with --os-variant for optimal results.
> WARNING  Unable to connect to graphical console: virt-viewer not installed.
> Please install the 'virt-viewer' package.
> WARNING  No console to launch for the guest, defaulting to --wait -1
>
> Starting install...
> ERRORinternal error: qemu unexpectedly closed the monitor:
> (process:663223): GLib-WARNING **: gmem.c:482: custom memory allocation
> vtable not supported 2017-08-25T06:33:59.392000Z qemu-system-x86_64: -drive
> file=rbd:rbd/t2:auth_supported=none:mon_host=10.202.127.11\:6789,format=raw,
> if=none,id=drive-ide0-0-0:
> could not open disk image
> rbd:rbd/t2:auth_supported=none:mon_host=10.202.127.11\:6789: Unknown
> protocol Domain installation does not appear to have been successful.
> If it was, you can restart your domain by running:
>   virsh --connect qemu:///system start puppy2 otherwise, please restart your
> installation.
>
>
> google it , change the /usr/share/virt-manager/virtinst/guest.py file , like
> this
>
> def _build_xml(self):
> install_xml = self._get_install_xml(install=True)
> final_xml = self._get_install_xml(install=False)
>
> auth_secret = '''
> 
>  uuid='f75248a0-ac7e-4c0e-a90a-2d8d00833132'/>
> 
> '''
> import re
> rgx_auth = re.compile('(?<= )([^>]*?">).*?(?= *?)', re.S)
>
> install_xml = rgx_auth.sub('\\1' + auth_secret, install_xml)
>
> final_xml = rgx_auth.sub('\\1' + auth_secret, final_xml)
> logging.debug("Generated install XML: %s",
> (install_xml and ("\n" + install_xml) or "None required"))
> logging.debug("Generated boot XML: \n%s", final_xml)
>
> return install_xml, final_xml
>
> and run
>
> virt-install -n puppy2 --vcpu 1 -c /etc/ceph/tahr64-6.0.5.iso --disk
> vol=rbd2/t2  --memory 1024 --vnc  --vnclisten=0.0.0.0 -v --print-xml
>
> 
> 
>  uuid='f75248a0-ac7e-4c0e-a90a-2d8d00833132'/>
> 
>
>   
>   
> 
>   
>   
> 
>
> looks ok
>
> and run again
> virt-install -n puppy2 --vcpu 1 -c /etc/ceph/tahr64-6.0.5.iso --disk
> vol=rbd2/t2  --memory 1024 --vnc  --vnclisten=0.0.0.0 -v
>
> still error
> WARNING  No operating system detected, VM performance may suffer.
> Specify an OS with --os-variant for optimal results.
> WARNING  Unable to connect to graphical console: virt-viewer not installed.
> Please install the 'virt-viewer' package.
> WARNING  No console to launch for the guest, defaulting to --wait -1
>
> Starting install...
> ERRORinternal error: qemu unexpectedly closed the monitor:
> (process:663621): GLib-WARNING **: gmem.c:482: custom memory allocation
> vtable not supported 2017-08-25T06:42:09.429057Z qemu-system-x86_64: -drive
> file=rbd:rbd/puppy2:id=admin:key=AQCeuf5YUac6NxAAU3CCcwnaI5z7pgUww4Gyrg==:au
> th_supported=cephx\;none:mon_host=10.202.127.1

[ceph-users] libvirt + rbd questions

2017-08-25 Thread Z Will

Hi all:
I have tried to install a vm using rbd as disk, following the
steps from ceph doc, but met some problems. the packages environment
is following:

CentOS Linux release 7.2.1511 (Core)
libvirt-2.0.0-10.el7_3.9.x86_64
libvirt-python-2.0.0-2.el7.x86_64
virt-manager-common-1.4.0-2.el7.noarch
qemu-2.0.0-1.el7.6.x86_64

and the procedure  is :
create a pool

rbd2
48283717632
133
46990565376


rbd






create a vol t2

run the following cmd

virt-install -n puppy2 --vcpu 1 -c /etc/ceph/tahr64-6.0.5.iso --disk
vol=rbd2/t2  --memory 1024 --vnc  --vnclisten=0.0.0.0 -v

it failed , the error is :
WARNING  No operating system detected, VM performance may suffer.
Specify an OS with --os-variant for optimal results.
WARNING  Unable to connect to graphical console: virt-viewer not
installed. Please install the 'virt-viewer' package.
WARNING  No console to launch for the guest, defaulting to --wait -1

Starting install...
ERRORinternal error: qemu unexpectedly closed the monitor:
(process:663223): GLib-WARNING **: gmem.c:482: custom memory
allocation vtable not supported
2017-08-25T06:33:59.392000Z qemu-system-x86_64: -drive
file=rbd:rbd/t2:auth_supported=none:mon_host=10.202.127.11\:6789,format=raw,if=none,id=drive-ide0-0-0:
could not open disk image
rbd:rbd/t2:auth_supported=none:mon_host=10.202.127.11\:6789: Unknown
protocol
Domain installation does not appear to have been successful.
If it was, you can restart your domain by running:
  virsh --connect qemu:///system start puppy2
otherwise, please restart your installation.


google it , change the /usr/share/virt-manager/virtinst/guest.py file
, like this

def _build_xml(self):
install_xml = self._get_install_xml(install=True)
final_xml = self._get_install_xml(install=False)

auth_secret = '''



'''
import re
rgx_auth = re.compile('(?<=]*?">).*?(?= *?)', re.S)

install_xml = rgx_auth.sub('\\1' + auth_secret, install_xml)

final_xml = rgx_auth.sub('\\1' + auth_secret, final_xml)
logging.debug("Generated install XML: %s",
(install_xml and ("\n" + install_xml) or "None required"))
logging.debug("Generated boot XML: \n%s", final_xml)

return install_xml, final_xml

and run

virt-install -n puppy2 --vcpu 1 -c /etc/ceph/tahr64-6.0.5.iso --disk
vol=rbd2/t2  --memory 1024 --vnc  --vnclisten=0.0.0.0 -v --print-xml






  
  

  
  


looks ok

and run again
virt-install -n puppy2 --vcpu 1 -c /etc/ceph/tahr64-6.0.5.iso --disk
vol=rbd2/t2  --memory 1024 --vnc  --vnclisten=0.0.0.0 -v

still error
WARNING  No operating system detected, VM performance may suffer.
Specify an OS with --os-variant for optimal results.
WARNING  Unable to connect to graphical console: virt-viewer not
installed. Please install the 'virt-viewer' package.
WARNING  No console to launch for the guest, defaulting to --wait -1

Starting install...
ERRORinternal error: qemu unexpectedly closed the monitor:
(process:663621): GLib-WARNING **: gmem.c:482: custom memory
allocation vtable not supported
2017-08-25T06:42:09.429057Z qemu-system-x86_64: -drive
file=rbd:rbd/puppy2:id=admin:key=AQCeuf5YUac6NxAAU3CCcwnaI5z7pgUww4Gyrg==:auth_supported=cephx\;none:mon_host=10.202.127.11\:6789,format=raw,if=none,id=drive-ide0-0-0:
could not open disk image
rbd:rbd/puppy2:id=admin:key=AQCeuf5YUac6NxAAU3CCcwnaI5z7pgUww4Gyrg==:auth_supported=cephx\;none:mon_host=10.202.127.11\:6789:
Unknown protocol
Domain installation does not appear to have been successful.
If it was, you can restart your domain by running:
  virsh --connect qemu:///system start puppy
otherwise, please restart your installation.



I tried to just create rbd as disk and attach it to the existed VM, it
is OK , so I don't think it is the packages version problem. Has
anyone met this problem ? Or where am I wrong in the above procedure ?
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] Speeding up garbage collection in RGW

2017-07-24 Thread Z Will

I think if you want to delete through gc,
increase this
OPTION(rgw_gc_processor_max_time, OPT_INT, 3600)  // total run time
for a single gc processor work
decrease this
OPTION(rgw_gc_processor_period, OPT_INT, 3600)  // gc processor cycle time

Or , I think if there is some option to bypass the gc


On Tue, Jul 25, 2017 at 5:05 AM, Bryan Stillwell  wrote:
> Wouldn't doing it that way cause problems since references to the objects 
> wouldn't be getting removed from .rgw.buckets.index?
>
> Bryan
>
> From: Roger Brown 
> Date: Monday, July 24, 2017 at 2:43 PM
> To: Bryan Stillwell , "ceph-users@lists.ceph.com" 
> 
> Subject: Re: [ceph-users] Speeding up garbage collection in RGW
>
> I hope someone else can answer your question better, but in my case I found 
> something like this helpful to delete objects faster than I could through the 
> gateway:
>
> rados -p default.rgw.buckets.data ls | grep 'replace this with pattern 
> matching files you want to delete' | xargs -d '\n' -n 200 rados -p 
> default.rgw.buckets.data rm
>
>
> On Mon, Jul 24, 2017 at 2:02 PM Bryan Stillwell  
> wrote:
> I'm in the process of cleaning up a test that an internal customer did on our 
> production cluster that produced over a billion objects spread across 6000 
> buckets.  So far I've been removing the buckets like this:
>
> printf %s\\n bucket{1..6000} | xargs -I{} -n 1 -P 32 radosgw-admin bucket rm 
> --bucket={} --purge-objects
>
> However, the disk usage doesn't seem to be getting reduced at the same rate 
> the objects are being removed.  From what I can tell a large number of the 
> objects are waiting for garbage collection.
>
> When I first read the docs it sounded like the garbage collector would only 
> remove 32 objects every hour, but after looking through the logs I'm seeing 
> about 55,000 objects removed every hour.  That's about 1.3 million a day, so 
> at this rate it'll take a couple years to clean up the rest!  For comparison, 
> the purge-objects command above is removing (but not GC'ing) about 30 million 
> objects a day, so a much more manageable 33 days to finish.
>
> I've done some digging and it appears like I should be changing these 
> configuration options:
>
> rgw gc max objs (default: 32)
> rgw gc obj min wait (default: 7200)
> rgw gc processor max time (default: 3600)
> rgw gc processor period (default: 3600)
>
> A few questions I have though are:
>
> Should 'rgw gc processor max time' and 'rgw gc processor period' always be 
> set to the same value?
>
> Which would be better, increasing 'rgw gc max objs' to something like 1024, 
> or reducing the 'rgw gc processor' times to something like 60 seconds?
>
> Any other guidance on the best way to adjust these values?
>
> Thanks,
> Bryan
>
>
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>
>
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] Monitor as local VM on top of the server pool cluster?

2017-07-10 Thread Z Will

For  large cluster , there will be a lot of change at any time,  this
means the pressure of mon will be big at some time, because all change
will go through  leader , so for this , the local storage for mon
should be good enough, I think this maybe a conderation .

On Tue, Jul 11, 2017 at 11:29 AM, Brad Hubbard  wrote:
> On Tue, Jul 11, 2017 at 3:44 AM, David Turner  wrote:
>> Mons are a paxos quorum and as such want to be in odd numbers.  5 is
>> generally what people go with.  I think I've heard of a few people use 7
>> mons, but you do not want to have an even number of mons or an ever growing
>
> Unless your cluster is very large three should be sufficient.
>
>> number of mons.  The reason you do not want mons running on the same
>> hardware as osds is resource contention during recovery.  As long as the Xen
>> servers you are putting the mons on are not going to cause any source of
>> resource limitation/contention, then virtualizing them should be fine for
>> you.  Make sure that you aren't configuring the mon to run using an RBD for
>> its storage, that would be very bad.
>>
>> The mon Quorum elects a leader and that leader will be in charge of the
>> quorum.  Having local mons doesn't do anything as the clients will still be
>> talking to the mons as a quorum and won't necessarily talk to the mon
>> running on them.  The vast majority of communication to the cluster that
>> your Xen servers will be doing is to the OSDs anyway, very little
>> communication to the mons.
>>
>> On Mon, Jul 10, 2017 at 1:21 PM Massimiliano Cuttini 
>> wrote:
>>>
>>> Hi everybody,
>>>
>>> i would like to separate MON from OSD as reccomended.
>>> In order to do so without new hardware I'm planning to create all the
>>> monitor as a Virtual Machine on top of my hypervisors (Xen).
>>> I'm testing a pool of 8 nodes of Xen.
>>>
>>> I'm thinking about create 8 monitor and pin one monitor for one Xen node.
>>> So, i'm guessing, every Ceph monitor'll be local for each node client.
>>> This should speed up the system by local connecting monitors with a
>>> little overflown for the monitors sync between nodes.
>>>
>>> Is it a good idea have a local monitor virtualized on top of each
>>> hypervisor node?
>>> Did you see any understimation or wrong design in this?
>>>
>>> Thanks for every helpfull info.
>>>
>>>
>>> Regards,
>>> Max
>>>
>>> ___
>>> ceph-users mailing list
>>> ceph-users@lists.ceph.com
>>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>>
>>
>> ___
>> ceph-users mailing list
>> ceph-users@lists.ceph.com
>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>>
>
>
>
> --
> Cheers,
> Brad
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] ceph-mon leader election problem, should it be improved ?

2017-07-10 Thread Z Will

Hi Joao:

> Basically, this would be something similar to heartbeats. If a
monitor can't
> reach all monitors in an existing quorum, then just don't do anything.

 Based on your solution, I make a little change :
 - send a probe to all monitors
 - if  get a quorum ,
 it will join current quorum through join_quorum message,
when leader  receive this , it will change the quorum and
 claim victory again,
 If timeout , it means it can't reach leader , do nothing
and try later from bootstrap ,
 - if  get > 1/2 acks, do as before, call election

 With this , sometimes the leader do not have  the  smallest rank
num , I think this is fine. In quorum message , there will be one more
byte to point out the leader rank num .
 I think this will perform as same as before and can tolerate some
network partition error, and it only need to change little code,  any
suggesstion for this ? Do I lack of any  considerations ?


On Wed, Jul 5, 2017 at 6:26 PM, Joao Eduardo Luis <j...@suse.de> wrote:
> On 07/05/2017 08:01 AM, Z Will wrote:
>>
>> Hi Joao:
>> I think this is all because we choose the monitor with the
>> smallest rank number to be leader. For this kind of network error, no
>> matter which mon has lost connection with the  mon who has the
>> smallest rank num , will be constantly calling an election, that say
>> ,will constantly affact the cluster until it is stopped by human . So
>> do you think it make sense if I try to figure out a way to choose the
>> monitor who can see the most monitors ,  or with  the smallest rank
>> num if the view num is same , to be leader ?
>> In probing phase:
>>they will know there own view, so can set a view num.
>> In election phase:
>>they send the view num , rank num .
>>when receiving the election message, it compare the view num (
>> higher is leader ) and rank num ( lower is leader).
>
>
> As I understand it, our elector trades-off reliability in case of network
> failure for expediency in forming a quorum. This by itself is not a problem
> since we don't see many real-world cases where this behaviour happens, and
> we are a lot more interested in making sure we have a quorum - given without
> a quorum your cluster is effectively unusable.
>
> Currently, we form a quorum with a minimal number of messages passed.
> From my poor recollection, I think the Elector works something like
>
> - 1 probe message to each monitor in the monmap
> - receives defer from a monitor, or defers to a monitor
> - declares victory if number of defers is an absolute majority (including
> one's defer).
>
> An election cycle takes about 4-5 messages to complete, with roughly two
> round-trips (in the best case scenario).
>
> Figuring out which monitor is able to contact the highest number of
> monitors, and having said monitor being elected the leader, will necessarily
> increase the number of messages transferred.
>
> A rough idea would be
>
> - all monitors will send probes to all other monitors in the monmap;
> - all monitors need to ack the other's probes;
> - each monitor will count the number of monitors it can reach, and then send
> a message proposing itself as the leader to the other monitors, with the
> list of monitors they see;
> - each monitor will propose itself as the leader, or defer to some other
> monitor.
>
> This is closer to 3 round-trips.
>
> Additionally, we'd have to account for the fact that some monitors may be
> able to reach all other monitors, while some may only be able to reach a
> portion. How do we handle this scenario?
>
> - What do we do with monitors that do not reach all other monitors?
> - Do we ignore them for electoral purposes?
> - Are they part of the final quorum?
> - What if we need those monitors to form a quorum?
>
> Personally, I think the easiest solution to this problem would be
> blacklisting a problematic monitor (for a given amount a time, or until a
> new election is needed due to loss of quorum, or by human intervention).
>
> For example, if a monitor believes it should be the leader, and if all other
> monitors are deferring to someone else that is not reachable, the monitor
> could then enter a special case branch:
>
> - send a probe to all monitors
> - receive acks
> - share that with other monitors
> - if that list is missing monitors, then blacklist the monitor for a period,
> and send a message to that monitor with that decision
> - the monitor would blacklist itself and retry in a given amount of time.
>
> Basically, this would be something similar to heartbeats. If a monitor can't
> reach all monitors in an existing quorum, then just don't do anything.
>
&

Re: [ceph-users] ceph-mon leader election problem, should it be improved ?

2017-07-08 Thread Z Will

Hi Sage:
After these days consideration, and reading some  related  papers
, I think we can just make a very little change to solve the problems
above and make monitors to tolerate most of the network partition .
The most of logic are still same as before except one :

- send a probe to each  monitor in monmap
- receive acks, and remember it , if got > 1/2 , or get a quorum
- if get a quorum, it will try to reach current leader to  join
current quorum , not try to call a new election. If timeout to joining
, it should stand by and try later, and it will  sync states from
other mons when needed for increasing performance.
  other logic  is  same as before

 What do you think of it ?

On Thu, Jul 6, 2017 at 10:31 PM, Sage Weil <s...@newdream.net> wrote:
> On Thu, 6 Jul 2017, Z Will wrote:
>> Hi Joao :
>>
>>  Thanks for thorough analysis . My initial concern is that , I think
>> in some cases ,  network failure will make low rank monitor see little
>> siblings (not enough to form a quorum ) , but some high rank mointor
>> can see more siblings, so I want to try to choose  the one who can see
>> the most to be leader, to tolerate the netwok error to the biggiest
>> extent , not just to solve the corner case.   Yes , you are right.
>> This kind of complex network failure is rare to occure. Trying to find
>> out  who can contact the highest number of monitors can only cover
>> some of the situation , and will  introduce some other complexities
>> and slow effcient. This is not good. Blacklisting a problematic
>> monitor is simple and good idea.  The implementation in monitor now is
>> like this, no matter which one  with high rank num lost connection
>> with the leader, this lost monitor  will constantly try to call leader
>> election, affect its siblings, and then affect the whole cluster.
>> Because the leader election procedure is fast, it will be OK for a
>> short time , but soon leader election start again, the cluster will
>> become unstable. I think the probability of this kind of network error
>> is high, YES ?  So based on your idea,  make a little change :
>>
>>  - send a probe to all monitors
>>  - receive acks
>>  - After receiving acks, it will konw the current quorum and how much
>> monitors it can reach to .
>>If it can reach to current leader, then it will try to join
>> current quorum
>>If it can not reach to current leader, then it will decide
>> whether to stand by for a while and try later or start a leader
>> election  based on the information got from probing phase.
>>
>> Do you think this will be OK ?
>
> I'm worried that even if we can form an initial quorum, we are currently
> very casual about the "call new election" logic.  If a mon is not part of
> the quorum it will currently trigger a new election... and with this
> change it will then not be included in it because it can't reach all mons.
> The logic there will also have to change so that it confirms that it can
> reach a majority of mon peers before requesting a new election.
>
> sage
>
>
>>
>>
>> On Wed, Jul 5, 2017 at 6:26 PM, Joao Eduardo Luis <j...@suse.de> wrote:
>> > On 07/05/2017 08:01 AM, Z Will wrote:
>> >>
>> >> Hi Joao:
>> >> I think this is all because we choose the monitor with the
>> >> smallest rank number to be leader. For this kind of network error, no
>> >> matter which mon has lost connection with the  mon who has the
>> >> smallest rank num , will be constantly calling an election, that say
>> >> ,will constantly affact the cluster until it is stopped by human . So
>> >> do you think it make sense if I try to figure out a way to choose the
>> >> monitor who can see the most monitors ,  or with  the smallest rank
>> >> num if the view num is same , to be leader ?
>> >> In probing phase:
>> >>they will know there own view, so can set a view num.
>> >> In election phase:
>> >>they send the view num , rank num .
>> >>when receiving the election message, it compare the view num (
>> >> higher is leader ) and rank num ( lower is leader).
>> >
>> >
>> > As I understand it, our elector trades-off reliability in case of network
>> > failure for expediency in forming a quorum. This by itself is not a problem
>> > since we don't see many real-world cases where this behaviour happens, and
>> > we are a lot more interested in making sure we have a quorum - given 
>> > without
>> > a quorum your cluster is effec

Re: [ceph-users] ceph-mon leader election problem, should it be improved ?

2017-07-06 Thread Z Will

Hi Joao :

 Thanks for thorough analysis . My initial concern is that , I think
in some cases ,  network failure will make low rank monitor see little
siblings (not enough to form a quorum ) , but some high rank mointor
can see more siblings, so I want to try to choose  the one who can see
the most to be leader, to tolerate the netwok error to the biggiest
extent , not just to solve the corner case.   Yes , you are right.
This kind of complex network failure is rare to occure. Trying to find
out  who can contact the highest number of monitors can only cover
some of the situation , and will  introduce some other complexities
and slow effcient. This is not good. Blacklisting a problematic
monitor is simple and good idea.  The implementation in monitor now is
like this, no matter which one  with high rank num lost connection
with the leader, this lost monitor  will constantly try to call leader
election, affect its siblings, and then affect the whole cluster.
Because the leader election procedure is fast, it will be OK for a
short time , but soon leader election start again, the cluster will
become unstable. I think the probability of this kind of network error
is high, YES ?  So based on your idea,  make a little change :

 - send a probe to all monitors
 - receive acks
 - After receiving acks, it will konw the current quorum and how much
monitors it can reach to .
   If it can reach to current leader, then it will try to join
current quorum
   If it can not reach to current leader, then it will decide
whether to stand by for a while and try later or start a leader
election  based on the information got from probing phase.

Do you think this will be OK ?


On Wed, Jul 5, 2017 at 6:26 PM, Joao Eduardo Luis <j...@suse.de> wrote:
> On 07/05/2017 08:01 AM, Z Will wrote:
>>
>> Hi Joao:
>> I think this is all because we choose the monitor with the
>> smallest rank number to be leader. For this kind of network error, no
>> matter which mon has lost connection with the  mon who has the
>> smallest rank num , will be constantly calling an election, that say
>> ,will constantly affact the cluster until it is stopped by human . So
>> do you think it make sense if I try to figure out a way to choose the
>> monitor who can see the most monitors ,  or with  the smallest rank
>> num if the view num is same , to be leader ?
>> In probing phase:
>>they will know there own view, so can set a view num.
>> In election phase:
>>they send the view num , rank num .
>>when receiving the election message, it compare the view num (
>> higher is leader ) and rank num ( lower is leader).
>
>
> As I understand it, our elector trades-off reliability in case of network
> failure for expediency in forming a quorum. This by itself is not a problem
> since we don't see many real-world cases where this behaviour happens, and
> we are a lot more interested in making sure we have a quorum - given without
> a quorum your cluster is effectively unusable.
>
> Currently, we form a quorum with a minimal number of messages passed.
> From my poor recollection, I think the Elector works something like
>
> - 1 probe message to each monitor in the monmap
> - receives defer from a monitor, or defers to a monitor
> - declares victory if number of defers is an absolute majority (including
> one's defer).
>
> An election cycle takes about 4-5 messages to complete, with roughly two
> round-trips (in the best case scenario).
>
> Figuring out which monitor is able to contact the highest number of
> monitors, and having said monitor being elected the leader, will necessarily
> increase the number of messages transferred.
>
> A rough idea would be
>
> - all monitors will send probes to all other monitors in the monmap;
> - all monitors need to ack the other's probes;
> - each monitor will count the number of monitors it can reach, and then send
> a message proposing itself as the leader to the other monitors, with the
> list of monitors they see;
> - each monitor will propose itself as the leader, or defer to some other
> monitor.
>
> This is closer to 3 round-trips.
>
> Additionally, we'd have to account for the fact that some monitors may be
> able to reach all other monitors, while some may only be able to reach a
> portion. How do we handle this scenario?
>
> - What do we do with monitors that do not reach all other monitors?
> - Do we ignore them for electoral purposes?
> - Are they part of the final quorum?
> - What if we need those monitors to form a quorum?
>
> Personally, I think the easiest solution to this problem would be
> blacklisting a problematic monitor (for a given amount a time, or until a
> new election is needed due to loss of quorum, or by human in

Re: [ceph-users] ceph-mon leader election problem, should it be improved ?

2017-07-05 Thread Z Will

Hi Joao:
I think this is all because we choose the monitor with the
smallest rank number to be leader. For this kind of network error, no
matter which mon has lost connection with the  mon who has the
smallest rank num , will be constantly calling an election, that say
,will constantly affact the cluster until it is stopped by human . So
do you think it make sense if I try to figure out a way to choose the
monitor who can see the most monitors ,  or with  the smallest rank
num if the view num is same , to be leader ?
In probing phase:
   they will know there own view, so can set a view num.
In election phase:
   they send the view num , rank num .
   when receiving the election message, it compare the view num (
higher is leader ) and rank num ( lower is leader).

On Tue, Jul 4, 2017 at 9:25 PM, Joao Eduardo Luis <j...@suse.de> wrote:
> On 07/04/2017 06:57 AM, Z Will wrote:
>>
>> Hi:
>>I am testing ceph-mon brain split . I have read the code . If I
>> understand it right , I know it won't be brain split. But I think
>> there is still another problem. My ceph version is 0.94.10. And here
>> is my test detail :
>>
>> 3 ceph-mons , there ranks are 0, 1, 2 respectively.I stop the rank 1
>> mon , and use iptables to block the communication between mon 0 and
>> mon 1. When the cluster is stable, start mon.1 .  I found the 3
>> monitors will all can not work well. They are all trying to call  new
>> leader  election . This means the cluster can't work anymore.
>>
>> Here is my analysis. Because mon will always respond to leader
>> election message, so , in my test, communication between  mon.0 and
>> mon.1 is blocked , so mon.1 will always try to be leader, because it
>> will always see mon.2, and it should win over mon.2. Mon.0 should
>> always win over mon.2. But mon.2 will always responsd to the election
>> message issued by mon.1, so this loop will never end. Am I right ?
>>
>> This should be a problem? Or is it  was just designed like this , and
>> should be handled by human ?
>
>
> This is a known behaviour, quite annoying, but easily identifiable by having
> the same monitor constantly calling an election and usually timing out
> because the peon did not defer to it.
>
> In a way, the elector algorithm does what it is intended to. Solving this
> corner case would be nice, but I don't think there's a good way to solve it.
> We may be able to presume a monitor is in trouble during the probe phase, to
> disqualify a given monitor from the election, but in the end this is a
> network issue that may be transient or unpredictable and there's only so
> much we can account for.
>
> Dealing with it automatically would be nice, but I think, thus far, the
> easiest way to address this particular issue is human intervention.
>
>   -Joao
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] ceph-mon leader election problem, should it be improved ?

2017-07-04 Thread Z Will

Hi Alvaro:
From the code , I see unsigned need = monmap->size() / 2 + 1; So
for 2 mons , the quorum must be 2 so that it can start election.
That's  why I use 3 mons. I know if I stop mon.0 or mon.1 , everything
will work fine. And if this failure happens,  it must be handled by
human ?  Is there any way to handle it automaticly from design as you
know ?



On Tue, Jul 4, 2017 at 2:25 PM, Alvaro Soto <alsot...@gmail.com> wrote:
> Z,
> You are forcing a byzantine failure, the paxos implemented to form the
> consensus ring of the mon daemons does not support this kind of failures,
> that is why you get and erratic behaviour, I believe is the common paxos
> algorithm implemented in mon daemon code.
>
> If you just gracefully shutdown a mon daemon everything will work fine, but
> with this you can not prove a split brain situation, because you will force
> the election of the leader by quorum.
>
> Maybe with 2 mon daemons and closing the communication between each of them
> every mon daemon will believe that can be a leader because every daemon will
> have the que quorum of 1 with no other vote.
>
> Just saying :)
>
>
> On Jul 4, 2017 12:57 AM, "Z Will" <zhao6...@gmail.com> wrote:
>>
>> Hi:
>>I am testing ceph-mon brain split . I have read the code . If I
>> understand it right , I know it won't be brain split. But I think
>> there is still another problem. My ceph version is 0.94.10. And here
>> is my test detail :
>>
>> 3 ceph-mons , there ranks are 0, 1, 2 respectively.I stop the rank 1
>> mon , and use iptables to block the communication between mon 0 and
>> mon 1. When the cluster is stable, start mon.1 .  I found the 3
>> monitors will all can not work well. They are all trying to call  new
>> leader  election . This means the cluster can't work anymore.
>>
>> Here is my analysis. Because mon will always respond to leader
>> election message, so , in my test, communication between  mon.0 and
>> mon.1 is blocked , so mon.1 will always try to be leader, because it
>> will always see mon.2, and it should win over mon.2. Mon.0 should
>> always win over mon.2. But mon.2 will always responsd to the election
>> message issued by mon.1, so this loop will never end. Am I right ?
>>
>> This should be a problem? Or is it  was just designed like this , and
>> should be handled by human ?
>> ___
>> ceph-users mailing list
>> ceph-users@lists.ceph.com
>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

[ceph-users] ceph-mon leader election problem, should it be improved ?

2017-07-03 Thread Z Will

Hi:
   I am testing ceph-mon brain split . I have read the code . If I
understand it right , I know it won't be brain split. But I think
there is still another problem. My ceph version is 0.94.10. And here
is my test detail :

3 ceph-mons , there ranks are 0, 1, 2 respectively.I stop the rank 1
mon , and use iptables to block the communication between mon 0 and
mon 1. When the cluster is stable, start mon.1 .  I found the 3
monitors will all can not work well. They are all trying to call  new
leader  election . This means the cluster can't work anymore.

Here is my analysis. Because mon will always respond to leader
election message, so , in my test, communication between  mon.0 and
mon.1 is blocked , so mon.1 will always try to be leader, because it
will always see mon.2, and it should win over mon.2. Mon.0 should
always win over mon.2. But mon.2 will always responsd to the election
message issued by mon.1, so this loop will never end. Am I right ?

This should be a problem? Or is it  was just designed like this , and
should be handled by human ?
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] Object storage performance tools

2017-06-16 Thread Z Will

You can try this binary cosbench release package.
https://github.com/intel-cloud/cosbench/releases/download/v0.4.2.c4/0.4.2.c4.zip
.

On Fri, Jun 16, 2017 at 3:08 PM, fridifree <fridif...@gmail.com> wrote:
> Thank you for your comment.
>
> I feel that this tool is very buggy.
> Do you have up to date guide for this tool(I have problems to bring it up) ?
>
> Thank you very much
>
> On Jun 16, 2017 10:00 AM, "Piotr Nowosielski"
> <piotr.nowosiel...@allegrogroup.com> wrote:
>
> We are using https://github.com/intel-cloud/cosbench. In my opinion it The
> only true S3 performance testing tool.
>
>
>
> --
>
> Piotr Nowosielski
>
> Senior Systems Engineer
> Zespół Infrastruktury 5
> Grupa Allegro sp. z o.o.
> Tel: +48 512 08 55 92
>
> Grupa Allegro Sp. z o.o. z siedzibą w Poznaniu, 60-166 Poznań, przy ul.
> Grunwaldzka 182, wpisana do rejestru przedsiębiorców prowadzonego przez Sąd
> Rejonowy Poznań - Nowe Miasto i Wilda, Wydział VIII Gospodarczy Krajowego
> Rejestru Sądowego pod numerem KRS 268796, o kapitale zakładowym w
> wysokości 33 976 500,00 zł, posiadająca numer identyfikacji podatkowej NIP:
> 5272525995.
>
>
>
> From: ceph-users [mailto:ceph-users-boun...@lists.ceph.com] On Behalf Of
> fridifree
> Sent: Thursday, June 15, 2017 4:13 PM
> To: Ceph Users <ceph-users@lists.ceph.com>
> Subject: [ceph-users] Object storage performance tools
>
>
>
> Hi cephers,
>
>
>
> I want to benchmark my rgws and I cannot find any tool to benchmark Amazon
> s3 api(for Swift I am using getput which is really great)
>
> I really don't want to use Cosbench.
>
>
>
> any ideas?
>
>
>
> Thanks
>
>
>
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] Help build a drive reliability service!

2017-06-15 Thread Z Will

Hi Patrick:
I want to ask a  very tiny question. How much 9s do you claim your
storage durability? And how is it calculated ? Based on the data you
provided , have you find some failure model to refine the storage
durability ?

On Thu, Jun 15, 2017 at 12:09 AM, David Turner  wrote:
> I understand concern over annoying drive manufacturers, but if you have data
> to back it up you aren't slandering a drive manufacturer.  If they don't
> like the numbers that are found, then they should up their game or at least
> request that you put in how your tests negatively affected their drive
> endurance.  For instance, WD Red drives are out of warranty just by being
> placed in a chassis with more than 4 disks because they aren't rated for the
> increased vibration from that many disks in a chassis.
>
> OTOH, if you are testing the drives within the bounds of the drives
> warranty, and not doing anything against the recommendation of the
> manufacturer in the test use case (both physical and software), then there
> is no slander when you say that drive A outperformed drive B.  I know that
> the drives I run at home are not nearly as resilient as the drives that I
> use at the office, but I don't put my home cluster through a fraction of the
> strain that I do at the office.  The manufacturer knows that their cheaper
> drive isn't as resilient as the more robust enterprise drives.  Anyway, I'm
> sure you guys have thought about all of that and are _generally_ pretty
> smart. ;)
>
> In an early warning system that detects a drive that is close to failing,
> you could implement a command to migrate off of the disk and then run
> non-stop IO on it to finish off the disk to satisfy warranties.  Potentially
> this could be implemented with the osd daemon via a burn-in start-up option.
> Where it can be an OSD in the cluster that does not check in as up, but with
> a different status so you can still monitor the health of the failing drive
> from a ceph status.  This could also be useful for people that would like to
> burn-in their drives, but don't want to dedicate infrastructure to
> burning-in new disks before deploying them.  Making this as easy as possible
> on the end user/ceph admin, there could even be a ceph.conf option for OSDs
> that are added to the cluster and have never been been marked in to run
> through a burn-in of X seconds (changeable in the config and defaults to 0
> as to not change the default behavior).  I don't know if this is
> over-thinking it or adding complexity where it shouldn't be, but it could be
> used to get a drive to fail to use for an RMA.  OTOH, for large deployments
> we would RMA drives in batches and were never asked to prove that the drive
> failed.  We would RMA drives off of medium errors for HDDs and smart info
> for SSDs and of course for full failures.
>
> On Wed, Jun 14, 2017 at 11:38 AM Dan van der Ster 
> wrote:
>>
>> Hi Patrick,
>>
>> We've just discussed this internally and I wanted to share some notes.
>>
>> First, there are at least three separate efforts in our IT dept to
>> collect and analyse SMART data -- its clearly a popular idea and
>> simple to implement, but this leads to repetition and begs for a
>> common, good solution.
>>
>> One (perhaps trivial) issue is that it is hard to define exactly when
>> a drive has failed -- it varies depending on the storage system. For
>> Ceph I would define failure as EIO, which normally correlates with a
>> drive medium error, but there were other ideas here. So if this should
>> be a general purpose service, the sensor should have a pluggable
>> failure indicator.
>>
>> There was also debate about what exactly we could do with a failure
>> prediction model. Suppose the predictor told us a drive should fail in
>> one week. We could proactively drain that disk, but then would it
>> still fail? Will the vendor replace that drive under warranty only if
>> it was *about to fail*?
>>
>> Lastly, and more importantly, there is a general hesitation to publish
>> this kind of data openly, given how negatively it could impact a
>> manufacturer. Our lab certainly couldn't publish a report saying "here
>> are the most and least reliable drives". I don't know if anonymising
>> the data sources would help here, but anyway I'm curious what are your
>> thoughts on that point. Maybe what can come out of this are the
>> _components_ of a drive reliability service, which could then be
>> deployed privately or publicly as appropriate.
>>
>> Thanks!
>>
>> Dan
>>
>>
>>
>>
>> On Wed, May 24, 2017 at 8:57 PM, Patrick McGarry 
>> wrote:
>> > Hey cephers,
>> >
>> > Just wanted to share the genesis of a new community project that could
>> > use a few helping hands (and any amount of feedback/discussion that
>> > you might like to offer).
>> >
>> > As a bit of backstory, around 2013 the Backblaze folks started
>> > publishing statistics about hard drive reliability from within their

[ceph-users] ceph durability calculation and test method

2017-06-13 Thread Z Will

Hi all :
 I have some questions about the durability of ceph.  I am trying
to mesure the durability of ceph .I konw it should be related with
host and disk failing probability, failing detection time, when to
trigger the recover and the recovery time .  I use it with multiple
replication, say k replication. If I have N hosts, R racks, O osds per
host, ignoring the swich, how should I define the failure probability
of disk and host ? I think they should be independent, and should be
time-dependent . I google it , but find little thing about it . I see
AWS says it delivers 99.9% durability. How this is claimed ?
And  can I design some test method to prove the durability ? Or just
let it run long enough time and make the statistics ?
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

[ceph-users] should I use rocdsdb ?

2017-06-02 Thread Z Will

Hello gurus:
My name is will . I have just study ceph and have a lot of
interest in it . We are using ceph 0.94.10. And I am tring to tune the
performance of ceph to satisfy our requirements. We are using it as
object store now. Even though I have tried some different
configuration.   But I still have  some questions about it .
I wonder how much impact the performance of the omap kv database
have on cluster ?  Can I have some way to measure it ?

I know rocksdb performs better than leveldb on fast device like
ssd . So I wonder If I put bucket index on ssd, and change the omap
kvdb to rocksdb . Will the performance be improved ? Has anyone tried
this ?

Or I still use filestore . But I use EnhanceIO + ssd as filesystem
cache tier. And change the omap kv to rocksdb of all osds, will the
cluster performance be improved  ?  Has anyone tried this ?
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

[ceph-users] radosgw 100-continue problem

2017-02-13 Thread Z Will

Hi:
I used nginx + fastcti  + radosgw , when configure radosgw with "rgw
print continue = true " In RFC 2616 , it says  An origin server that
sends a 100 (Continue) response MUST ultimately send a final status
code, once the request body is received and processed, unless it
terminates the transport connection prematurely. But in RADOSGW,  for
PUT requests, when radosgw receive EXCEPT: 100-continue, it responds
100-continue , the code is like this
int RGWFCGX::send_status(int status, const char *status_name)
{
  status_num = status;
  return print("Status: %d %s\r\n", status, status_name);
}
int RGWFCGX::send_100_continue()
{
  int r = send_status(100, "Continue");
  if (r >= 0) {
flush();
  }
  return r;
}

there is just one \r\n, this means the header is not end. So when it
run to send_response(), it continue to send the rest header. I wonder
if this is the cause. Nginx doesn't work due to this.  I think this
should split into two response.
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

[ceph-users] radosgw fastcgi problem

2016-12-14 Thread Z Will

Hi:
Very sorry for the last email, it was an accident. Recently, I
trid to configure radosgw (0.94.7) with nginx as frontend, benchmark
it with cosbench. And I found a strange thing. My os-related
 configuration are,
net.core.netdev_max_backlog = 1000
net.core.somaxconn = 1024
 In rgw_main.cc, I see #define SOCKET_BACKLOG 1024. so here is the problem.

1、When I configure cosbench with 1000 workers , everything is OK. But
when it was 2000workers,there will be a lot of CLOSE_WAIT state
connections on rgw side, and rgw will no longer response to any
request.Use netstat command , I can see there are 1024 CLOSE_WAIT
connections.

2、I change  net.core.somaxconn to 128, and run cosbench with 2000
workers again, very soon, there will be no response from radosgw, and
use netstat command , I can see there are 128 CLOSE_WAIT connections.


I tried a lot things, googled it , but the problem still exist. I
thought if there is some bug on fastcgi lib. Have anyone met this
problem?
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

[ceph-users] radosgw fastcgi problem

2016-12-14 Thread Z Will

Hi:
Recently, I trid to configure radosgw with nginx as frontend,
benchmark it with cosbench. And I found a strange thing. My os-related
 configuration are,
net.core.netdev_max_backlog = 1000
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

[ceph-users] any nginx + rgw best practice ?

2016-11-26 Thread Z Will

Hi:
  I have tried to use nginx + fastcgi + radosgw , and benchmarked
it with cosbench. I tuned the nginx and centos configration , but
coundn't get the desirable performance. I met the following problem.
1、 I tried to tuned the centos net.ipv4... parameters, and I got a
lot of CLOSE_WAIT state on radosgw side.
2、  I use the default centos configuration, and configure conbench
with 800 workers, then I found cosbench got a lot timeout requests,
and it will stuck on that workload .


But , there will be no problem if I use Apache httpd . I am new to
nginx. I wonder what is the best practice on Nginx + rgw. What is best
configuration about  nginx and centos combination.
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

[ceph-users] Ceph-fuse single read limitation?‏‏

2015-11-20 Thread Z Zhang

Hi Guys,
 
Now we have a very small cluster with 3 OSDs but using 40Gb NIC. We use 
ceph-fuse as cephfs client and enable readahead, 
but testing single reading a large file from cephfs via fio, dd or cp can only 
achieve ~70+MB/s, even if fio or dd's block size is set to 1MB or 4MB.
 
From the ceph client log, we found each read request's size (ll_read) is 
limited to 128KB, 
which should be the limitation of kernel fuse's FUSE_MAX_PAGES_PER_REQ = 32. 

We may try to increase this value to see the performance difference, but are 
there other options we can try to increase single read performance?

Thanks.
Zhi Zhang (David) 
  ___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

[ceph-users] Write performance issue under rocksdb kvstore

2015-10-20 Thread Z Zhang

Hi Guys,

I am trying latest ceph-9.1.0 with rocksdb 4.1 and ceph-9.0.3 with rocksdb 3.11 
as OSD backend. I use rbd to test performance and following is my cluster info.

[ceph@xxx ~]$ ceph -s
    cluster b74f3944-d77f-4401-a531-fa5282995808
     health HEALTH_OK
     monmap e1: 1 mons at {xxx=xxx.xxx.xxx.xxx:6789/0}
            election epoch 1, quorum 0 xxx
     osdmap e338: 44 osds: 44 up, 44 in
            flags sortbitwise
      pgmap v1476: 2048 pgs, 1 pools, 158 MB data, 59 objects
            1940 MB used, 81930 GB / 81932 GB avail
                2048 active+clean

All the disks are spinning ones with write cache turning on. Rocksdb's WAL and 
sst files are on the same disk as every OSD.

Using fio to generate following write load: 
fio -direct=1 -rw=randwrite -ioengine=sync -size=10M -bs=4K -group_reporting 
-directory /mnt/rbd_test/ -name xxx.1 -numjobs=1  

Test result:
WAL enabled + sync: false + disk write cache: on  will get ~700 IOPS.
WAL enabled + sync: true (default) + disk write cache: on|off  will get only 
~25 IOPS.

I tuned some other rocksdb options, but with no lock. I tracked down the 
rocksdb code and found each writer's Sync operation would take ~30ms to finish. 
And as shown above, it is strange that performance has no much difference no 
matters disk write cache is on or off.

Do your guys encounter the similar issue? Or do I miss something to cause 
rocksdb's poor write performance?

Thanks.
Zhi Zhang (David) 
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] Write performance issue under rocksdb kvstore

2015-10-20 Thread Z Zhang

Haimao, you're right. I add such sync option as configurable for our test 
purpose.


Thanks.Zhi Zhang (David)

> Date: Tue, 20 Oct 2015 21:24:49 +0800
> From: haomaiw...@gmail.com
> To: zhangz.da...@outlook.com
> CC: ceph-users@lists.ceph.com; ceph-de...@vger.kernel.org
> Subject: Re: [ceph-users] Write performance issue under rocksdb kvstore
> 
> Actually keyvaluestore would submit transaction with sync flag
> too(rely to keyvaluedb impl journal/logfile).
> 
> Yes, if we disable sync flag, keyvaluestore's performance will
> increase a lot. But we dont provide with this option now
> 
> On Tue, Oct 20, 2015 at 9:22 PM, Z Zhang <zhangz.da...@outlook.com> wrote:
> > Thanks, Sage, for pointing out the PR and ceph branch. I will take a closer
> > look. Yes, I am trying KVStore backend. The reason we are trying it is that
> > few user doesn't have such high requirement on data loss occasionally. It
> > seems KVStore backend without synchronized WAL could achieve better
> > performance than filestore. And only data still in page cache would get lost
> > on machine crashing, not process crashing, if we use WAL but no
> > synchronization. What do you think? ? ? Thanks. Zhi Zhang (David) Date: Tue,
> > 20 Oct 2015 05:47:44 -0700 From: s...@newdream.net To:
> > zhangz.da...@outlook.com CC: ceph-users@lists.ceph.com;
> > ceph-de...@vger.kernel.org Subject: Re: [ceph-users] Write performance issue
> > under rocksdb kvstore On Tue, 20 Oct 2015, Z Zhang wrote: > Hi Guys, > > I
> > am trying latest ceph-9.1.0 with rocksdb 4.1 and ceph-9.0.3 with > rocksdb
> > 3.11 as OSD backend. I use rbd to test performance and following > is my
> > cluster info. > > [ceph@xxx ~]$ ceph -s > ? ? cluster
> > b74f3944-d77f-4401-a531-fa5282995808 > ? ? ?health HEALTH_OK > ? ? ?monmap
> > e1: 1 mons at {xxx=xxx.xxx.xxx.xxx:6789/0} > ? ? ? ? ? ? election epoch 1,
> > quorum 0 xxx > ? ? ?osdmap e338: 44 osds: 44 up, 44 in > ? ? ? ? ? ? flags
> > sortbitwise > ? ? ? pgmap v1476: 2048 pgs, 1 pools, 158 MB data, 59 objects
> >> ? ? ? ? ? ? 1940 MB used, 81930 GB / 81932 GB avail > ? ? ? ? ? ? ? ? 2048
> > active+clean > > All the disks are spinning ones with write cache turning
> > on. Rocksdb's > WAL and sst files are on the same disk as every OSD. Are you
> > using the KeyValueStore backend? > Using fio to generate following write
> > load:? > fio -direct=1 -rw=randwrite -ioengine=sync -size=10M -bs=4K
> > -group_reporting -directory /mnt/rbd_test/ -name xxx.1 -numjobs=1?? > > Test
> > result: > WAL enabled + sync: false + disk write cache: on ?will get ~700
> > IOPS. > WAL enabled + sync: true (default) + disk write cache: on|off ?will
> > get only ~25 IOPS. > > I tuned some other rocksdb options, but with no lock.
> > The wip-newstore-frags branch sets some defaults for rocksdb that I think
> > look pretty reasonable (at least given how newstore is using rocksdb). > I
> > tracked down the rocksdb code and found each writer's Sync operation > would
> > take ~30ms to finish. And as shown above, it is strange that > performance
> > has no much difference no matters disk write cache is on or > off. > > Do
> > your guys encounter the similar issue? Or do I miss something to > cause
> > rocksdb's poor write performance? Yes, I saw the same thing. This PR
> > addresses the problem and is nearing merge upstream:
> > https://github.com/facebook/rocksdb/pull/746 There is also an XFS
> > performance bug that is contributing to the problem, but it looks like Dave
> > Chinner just put together a fix for that. But... we likely won't be using
> > KeyValueStore in its current form over rocksdb (or any other kv backend). It
> > stripes object data over key/value pairs, which IMO is not the best
> > approach. sage ___ ceph-users
> > mailing list ceph-users@lists.ceph.com
> > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
> >
> > ___
> > ceph-users mailing list
> > ceph-users@lists.ceph.com
> > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
> >
> 
> 
> 
> -- 
> Best Regards,
> 
> Wheat
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
  ___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] Write performance issue under rocksdb kvstore

2015-10-20 Thread Z Zhang

Got your point. It is not only about the object data itself, but also ceph 
internal metadata.

The best option seems to be your RP and wip-newstore-frags branch. :-)


Thanks.Zhi Zhang (David)

> Date: Tue, 20 Oct 2015 06:25:43 -0700
> From: s...@newdream.net
> To: zhangz.da...@outlook.com
> CC: ceph-users@lists.ceph.com; ceph-de...@vger.kernel.org
> Subject: Re: [ceph-users] Write performance issue under rocksdb kvstore
> 
> On Tue, 20 Oct 2015, Z Zhang wrote:
> > Thanks, Sage, for pointing out the PR and ceph branch. I will take a 
> > closer look.
> > 
> > Yes, I am trying KVStore backend. The reason we are trying it is that 
> > few user doesn't have such high requirement on data loss occasionally. 
> > It seems KVStore backend without synchronized WAL could achieve better 
> > performance than filestore. And only data still in page cache would get 
> > lost on machine crashing, not process crashing, if we use WAL but no 
> > synchronization. What do you think?
> 
> That sounds dangerous.  The OSDs are recording internal metadata about the 
> cluster (peering, replication, etc.)... even if you don't care so much 
> about recent user data writes you probably don't want to risk breaking 
> RADOS itself.  If the kv backend is giving you a stale point-in-time 
> consistent copy it's not so bad, but in a power-loss event it could give 
> you problems...
> 
> sage
> 
> > 
> > Thanks. Zhi Zhang (David)
> > 
> > Date: Tue, 20 Oct 2015 05:47:44 -0700
> > From: s...@newdream.net
> > To: zhangz.da...@outlook.com
> > CC: ceph-users@lists.ceph.com; ceph-de...@vger.kernel.org
> > Subject: Re: [ceph-users] Write performance issue under rocksdb kvstore
> > 
> > On Tue, 20 Oct 2015, Z Zhang wrote:
> > > Hi Guys,
> > > 
> > > I am trying latest ceph-9.1.0 with rocksdb 4.1 and ceph-9.0.3 with 
> > > rocksdb 3.11 as OSD backend. I use rbd to test performance and following 
> > > is my cluster info.
> > > 
> > > [ceph@xxx ~]$ ceph -s
> > > cluster b74f3944-d77f-4401-a531-fa5282995808
> > >  health HEALTH_OK
> > >  monmap e1: 1 mons at {xxx=xxx.xxx.xxx.xxx:6789/0}
> > > election epoch 1, quorum 0 xxx
> > >  osdmap e338: 44 osds: 44 up, 44 in
> > > flags sortbitwise
> > >   pgmap v1476: 2048 pgs, 1 pools, 158 MB data, 59 objects
> > > 1940 MB used, 81930 GB / 81932 GB avail
> > > 2048 active+clean
> > > 
> > > All the disks are spinning ones with write cache turning on. Rocksdb's 
> > > WAL and sst files are on the same disk as every OSD.
> >  
> > Are you using the KeyValueStore backend?
> >  
> > > Using fio to generate following write load: 
> > > fio -direct=1 -rw=randwrite -ioengine=sync -size=10M -bs=4K 
> > > -group_reporting -directory /mnt/rbd_test/ -name xxx.1 -numjobs=1  
> > > 
> > > Test result:
> > > WAL enabled + sync: false + disk write cache: on  will get ~700 IOPS.
> > > WAL enabled + sync: true (default) + disk write cache: on|off  will get 
> > > only ~25 IOPS.
> > > 
> > > I tuned some other rocksdb options, but with no lock.
> >  
> > The wip-newstore-frags branch sets some defaults for rocksdb that I think 
> > look pretty reasonable (at least given how newstore is using rocksdb).
> >  
> > > I tracked down the rocksdb code and found each writer's Sync operation 
> > > would take ~30ms to finish. And as shown above, it is strange that 
> > > performance has no much difference no matters disk write cache is on or 
> > > off.
> > > 
> > > Do your guys encounter the similar issue? Or do I miss something to 
> > > cause rocksdb's poor write performance?
> >  
> > Yes, I saw the same thing.  This PR addresses the problem and is nearing 
> > merge upstream:
> >  
> > https://github.com/facebook/rocksdb/pull/746
> >  
> > There is also an XFS performance bug that is contributing to the problem, 
> > but it looks like Dave Chinner just put together a fix for that.
> >  
> > But... we likely won't be using KeyValueStore in its current form over 
> > rocksdb (or any other kv backend).  It stripes object data over key/value 
> > pairs, which IMO is not the best approach.
> >  
> > sage
> > 
> > ___
> > ceph-users mailing list
> > ceph-users@lists.ceph.com
> > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com  
> >   
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
  ___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] Write performance issue under rocksdb kvstore

2015-10-20 Thread Z Zhang

Thanks, Sage, for pointing out the PR and ceph branch. I will take a closer 
look.

Yes, I am trying KVStore backend. The reason we are trying it is that few user 
doesn't have such high requirement on data loss occasionally. It seems KVStore 
backend without synchronized WAL could achieve better performance than 
filestore. And only data still in page cache would get lost on machine 
crashing, not process crashing, if we use WAL but no synchronization. What do 
you think?

Thanks.
Zhi Zhang (David)

Date: Tue, 20 Oct 2015 05:47:44 -0700
From: s...@newdream.net
To: zhangz.da...@outlook.com
CC: ceph-users@lists.ceph.com; ceph-de...@vger.kernel.org
Subject: Re: [ceph-users] Write performance issue under rocksdb kvstore

On Tue, 20 Oct 2015, Z Zhang wrote:
> Hi Guys,
> 
> I am trying latest ceph-9.1.0 with rocksdb 4.1 and ceph-9.0.3 with 
> rocksdb 3.11 as OSD backend. I use rbd to test performance and following 
> is my cluster info.
> 
> [ceph@xxx ~]$ ceph -s
> cluster b74f3944-d77f-4401-a531-fa5282995808
>  health HEALTH_OK
>  monmap e1: 1 mons at {xxx=xxx.xxx.xxx.xxx:6789/0}
> election epoch 1, quorum 0 xxx
>  osdmap e338: 44 osds: 44 up, 44 in
> flags sortbitwise
>   pgmap v1476: 2048 pgs, 1 pools, 158 MB data, 59 objects
> 1940 MB used, 81930 GB / 81932 GB avail
> 2048 active+clean
> 
> All the disks are spinning ones with write cache turning on. Rocksdb's 
> WAL and sst files are on the same disk as every OSD.

Are you using the KeyValueStore backend?

> Using fio to generate following write load: 
> fio -direct=1 -rw=randwrite -ioengine=sync -size=10M -bs=4K -group_reporting 
> -directory /mnt/rbd_test/ -name xxx.1 -numjobs=1  
> 
> Test result:
> WAL enabled + sync: false + disk write cache: on  will get ~700 IOPS.
> WAL enabled + sync: true (default) + disk write cache: on|off  will get only 
> ~25 IOPS.
> 
> I tuned some other rocksdb options, but with no lock.

The wip-newstore-frags branch sets some defaults for rocksdb that I think 
look pretty reasonable (at least given how newstore is using rocksdb).

> I tracked down the rocksdb code and found each writer's Sync operation 
> would take ~30ms to finish. And as shown above, it is strange that 
> performance has no much difference no matters disk write cache is on or 
> off.
> 
> Do your guys encounter the similar issue? Or do I miss something to 
> cause rocksdb's poor write performance?

Yes, I saw the same thing.  This PR addresses the problem and is nearing 
merge upstream:

https://github.com/facebook/rocksdb/pull/746

There is also an XFS performance bug that is contributing to the problem, 
but it looks like Dave Chinner just put together a fix for that.

But... we likely won't be using KeyValueStore in its current form over 
rocksdb (or any other kv backend).  It stripes object data over key/value 
pairs, which IMO is not the best approach.

sage

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com  
  ___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

[ceph-users] FW: Long tail latency due to journal aio io_submit takes long time to return

2015-08-25 Thread Z Zhang

FW to ceph-user

Thanks.
Zhi Zhang (David)


 From: zhangz.da...@outlook.com
 To: ceph-de...@vger.kernel.org
 Subject: Long tail latency due to journal aio io_submit takes long time to 
 return
 Date: Tue, 25 Aug 2015 18:46:34 +0800

 Hi Ceph-devel,

 Recently we find there are some (not much) long tail latency due to journal 
 aio io_submit takes long time to return. We added some logs in the 
 FileJouranl.cc and get following output as an example. AIO's io_submit 
 function takes ~400ms to return, occasionally even takes up to 900ms.

 
 2015-08-25 15:24:39.535206 7f22699c2700 20 journal write_thread_entry woke up
 2015-08-25 15:24:39.535212 7f22699c2700 10 journal room 10460594175 max_size 
 10484711424 pos 4786679808 header.start 4762566656 top 4096
 2015-08-25 15:24:39.535215 7f22699c2700 10 journal check_for_full at 
 4786679808 : 532480  10460594175
 2015-08-25 15:24:39.535216 7f22699c2700 15 journal prepare_single_write 1 
 will write 4786679808 : seq 13075224 len 526537 - 532480 (head 40 pre_pad 
 4046 ebl 526537 post_pad 1817 tail 40) (ebl alignment 4086)
 2015-08-25 15:24:39.535322 7f22699c2700 20 journal prepare_multi_write 
 queue_pos now 4787212288
 2015-08-25 15:24:39.535324 7f22699c2700 15 journal do_aio_write writing 
 4786679808~532480
 2015-08-25 15:24:39.535434 7f22699c2700 10 journal align_bl total memcopy: 
 532480
 2015-08-25 15:24:39.535439 7f22699c2700 20 journal write_aio_bl 
 4786679808~532480 seq 13075224
 2015-08-25 15:24:39.535442 7f22699c2700 20 journal write_aio_bl .. 
 4786679808~532480 in 1
 2015-08-25 15:24:39.535444 7f22699c2700 5 journal io_submit starting...

 2015-08-25 15:24:39.990372 7f22699c2700 5 journal io_submit return value: 1
 2015-08-25 15:24:39.990396 7f22699c2700 5 journal put_throttle finished 1 ops 
 and 526537 bytes, now 12 ops and 6318444 bytes
 2015-08-25 15:24:39.990401 7f22691c1700 20 journal write_finish_thread_entry 
 waiting for aio(s)
 2015-08-25 15:24:39.990406 7f22699c2700 20 journal write_thread_entry aio 
 throttle: aio num 1 bytes 532480 ... exp 2 min_new 4 ... pending 6318444
 2015-08-25 15:24:39.990410 7f22699c2700 10 journal room 10460061695 max_size 
 10484711424 pos 4787212288 header.start 4762566656 top 4096
 2015-08-25 15:24:39.990413 7f22699c2700 10 journal check_for_full at 
 4787212288 : 532480  10460061695
 2015-08-25 15:24:39.990415 7f22699c2700 15 journal prepare_single_write 1 
 will write 4787212288 : seq 13075225 len 526537 - 532480 (head 40 pre_pad 
 4046 ebl 526537 post_pad 1817 tail 40) (ebl alignment 4086)
 

 And write_aio_bl will hold aio_lock until it returns, so 
 write_finish_thread_entry can't process completed aio when io_submit is 
 blocking. We reduced the aio_lock's locking scope in write_aio_bl, 
 specifically not locking io_submit. It doesn't seem to need lock on io_submit 
 in journal because there is only 1 write_thread_entry and 1 
 write_finish_thread_entry.

 This might help those completed aio to get processed asap. But for the one 
 blocked in io_submit and the ones in the queue, they still need to take more 
 time.

 Any idea about this?

 Thanks,
 Zhi Zhang (David)
  
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] which kernel version can help avoid kernel client deadlock

2015-08-06 Thread Z Zhang

Hi Ilya,

We just tried the 3.10.83 kernel with more rbd fixes back-ported from higher 
kernel version. At this time, we tried again to run rbd and 3 OSD deamons on 
the same node, but rbd IO will still hang and OSD filestore thread will time 
out to suicide when the memory becomes very low under high load.  

When this happened, enabling rbd log can even cause system unresponsive, so 
haven't collected some logs. Only osd log says filestore thread was doing 
filestore::_write before time-out, will look into it further.

I know this is not an appropriate usage of CEPH and rbd, but I still want to 
ask if there is a way or workaround to do this? Is there any successful case in 
the community?  

Thanks.
David Zhang
From: zhangz.da...@outlook.com
To: idryo...@gmail.com
Date: Fri, 31 Jul 2015 09:21:40 +0800
CC: ceph-users@lists.ceph.com; chaofa...@owtware.com
Subject: Re: [ceph-users] which kernel version can help avoid kernel client 
deadlock





 Date: Thu, 30 Jul 2015 13:11:11 +0300
 Subject: Re: [ceph-users] which kernel version can help avoid kernel client 
 deadlock
 From: idryo...@gmail.com
 To: zhangz.da...@outlook.com
 CC: chaofa...@owtware.com; ceph-users@lists.ceph.com
 
 On Thu, Jul 30, 2015 at 12:46 PM, Z Zhang zhangz.da...@outlook.com wrote:
 
  Date: Thu, 30 Jul 2015 11:37:37 +0300
  Subject: Re: [ceph-users] which kernel version can help avoid kernel
  client deadlock
  From: idryo...@gmail.com
  To: zhangz.da...@outlook.com
  CC: chaofa...@owtware.com; ceph-users@lists.ceph.com
 
  On Thu, Jul 30, 2015 at 10:29 AM, Z Zhang zhangz.da...@outlook.com
  wrote:
  
   
   Subject: Re: [ceph-users] which kernel version can help avoid kernel
   client
   deadlock
   From: chaofa...@owtware.com
   Date: Thu, 30 Jul 2015 13:16:16 +0800
   CC: idryo...@gmail.com; ceph-users@lists.ceph.com
   To: zhangz.da...@outlook.com
  
  
   On Jul 30, 2015, at 12:48 PM, Z Zhang zhangz.da...@outlook.com wrote:
  
   We also hit the similar issue from time to time on centos with 3.10.x
   kernel. By iostat, we can see kernel rbd client's util is 100%, but no
   r/w
   io, and we can't umount/unmap this rbd client. After restarting OSDs, it
   will become normal.
 
  3.10.x is rather vague, what is the exact version you saw this on? Can you
  provide syslog logs (I'm interested in dmesg)?
 
  The kernel version should be 3.10.0.
 
  I don't have sys logs at hand. It is not easily reproduced, and it happened
  at very low memory situation. We are running DB instances over rbd as
  storage. DB instances will use lot of memory when running high concurrent
  rw, and after running for a long time, rbd might hit this problem, but not
  always. Enabling rbd log makes our system behave strange during our test.
 
  I back-ported one of your fixes:
  https://git.kernel.org/cgit/linux/kernel/git/torvalds/linux.git/commit/drivers/block/rbd.c?id=5a60e87603c4c533492c515b7f62578189b03c9c
 
  So far test looks fine for few days, but still under observation. So want to
  know if there are some other fixes?
 
 I'd suggest following 3.10 stable series (currently at 3.10.84).  The
 fix you backported is crucial in low memory situations, so I wouldn't
 be surprised if it alone fixed your problem.  (It is not in 3.10.84,
 I assume it'll show up in 3.10.85 - for now just apply your backport.)
 
cool, looking forward 3.10.85 to see what else would be brought in.
Thanks.
 Thanks,
 
 Ilya
  

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com  
  ___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] which kernel version can help avoid kernel client deadlock

2015-07-30 Thread Z Zhang

 Date: Thu, 30 Jul 2015 13:11:11 +0300
 Subject: Re: [ceph-users] which kernel version can help avoid kernel client 
 deadlock
 From: idryo...@gmail.com
 To: zhangz.da...@outlook.com
 CC: chaofa...@owtware.com; ceph-users@lists.ceph.com

 On Thu, Jul 30, 2015 at 12:46 PM, Z Zhang zhangz.da...@outlook.com wrote:

  Date: Thu, 30 Jul 2015 11:37:37 +0300
  Subject: Re: [ceph-users] which kernel version can help avoid kernel
  client deadlock
  From: idryo...@gmail.com
  To: zhangz.da...@outlook.com
  CC: chaofa...@owtware.com; ceph-users@lists.ceph.com

  On Thu, Jul 30, 2015 at 10:29 AM, Z Zhang zhangz.da...@outlook.com
  wrote:

   Subject: Re: [ceph-users] which kernel version can help avoid kernel
   client
   deadlock
   From: chaofa...@owtware.com
   Date: Thu, 30 Jul 2015 13:16:16 +0800
   CC: idryo...@gmail.com; ceph-users@lists.ceph.com
   To: zhangz.da...@outlook.com

   On Jul 30, 2015, at 12:48 PM, Z Zhang zhangz.da...@outlook.com wrote:

   We also hit the similar issue from time to time on centos with 3.10.x
   kernel. By iostat, we can see kernel rbd client's util is 100%, but no
   r/w
   io, and we can't umount/unmap this rbd client. After restarting OSDs, it
   will become normal.

  3.10.x is rather vague, what is the exact version you saw this on? Can you
  provide syslog logs (I'm interested in dmesg)?

  The kernel version should be 3.10.0.

  I don't have sys logs at hand. It is not easily reproduced, and it happened
  at very low memory situation. We are running DB instances over rbd as
  storage. DB instances will use lot of memory when running high concurrent
  rw, and after running for a long time, rbd might hit this problem, but not
  always. Enabling rbd log makes our system behave strange during our test.

  I back-ported one of your fixes:
  https://git.kernel.org/cgit/linux/kernel/git/torvalds/linux.git/commit/drivers/block/rbd.c?id=5a60e87603c4c533492c515b7f62578189b03c9c

  So far test looks fine for few days, but still under observation. So want to
  know if there are some other fixes?

 I'd suggest following 3.10 stable series (currently at 3.10.84).  The
 fix you backported is crucial in low memory situations, so I wouldn't
 be surprised if it alone fixed your problem.  (It is not in 3.10.84,
 I assume it'll show up in 3.10.85 - for now just apply your backport.)

cool, looking forward 3.10.85 to see what else would be brought in.
Thanks.
 Thanks,

 Ilya
  ___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] which kernel version can help avoid kernel client deadlock

2015-07-30 Thread Z Zhang

Date: Thu, 30 Jul 2015 11:37:37 +0300
Subject: Re: [ceph-users] which kernel version can help avoid kernel client
deadlock
From: idryo...@gmail.com
To: zhangz.da...@outlook.com
CC: chaofa...@owtware.com; ceph-users@lists.ceph.com

On Thu, Jul 30, 2015 at 10:29 AM, Z Zhang zhangz.da...@outlook.com wrote:

Subject: Re: [ceph-users] which kernel version can help avoid kernel client
deadlock
From: chaofa...@owtware.com
Date: Thu, 30 Jul 2015 13:16:16 +0800
CC: idryo...@gmail.com; ceph-users@lists.ceph.com
To: zhangz.da...@outlook.com

On Jul 30, 2015, at 12:48 PM, Z Zhang zhangz.da...@outlook.com wrote:

We also hit the similar issue from time to time on centos with 3.10.x
kernel. By iostat, we can see kernel rbd client's util is 100%, but no r/w
io, and we can't umount/unmap this rbd client. After restarting OSDs, it
will become normal.

3.10.x is rather vague, what is the exact version you saw this on? Can you
provide syslog logs (I'm interested in dmesg)?
The kernel version should be 3.10.0.
I don't have sys logs at hand. It is not easily reproduced, and it happened at
very low memory situation. We are running DB instances over rbd as storage. DB
instances will use lot of memory when running high concurrent rw, and after
running for a long time, rbd might hit this problem, but not always. Enabling
rbd log makes our system behave strange during our test.
I back-ported one of your fixes:
https://git.kernel.org/cgit/linux/kernel/git/torvalds/linux.git/commit/drivers/block/rbd.c?id=5a60e87603c4c533492c515b7f62578189b03c9c
So far test looks fine for few days, but still under observation. So want to
know if there are some other fixes?

Thanks,

Ilya
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] which kernel version can help avoid kernel client deadlock

2015-07-29 Thread Z Zhang

We also hit the similar issue from time to time on centos with 3.10.x kernel. 
By iostat, we can see kernel rbd client's util is 100%, but no r/w io, and we 
can't umount/unmap this rbd client. After restarting OSDs, it will become 
normal.
@Ilya, could you pls point us the possible fixes on 3.18.19 towards this issue? 
Then we can try to back-port them to our old kernel because we can't jump to a 
major kernel version.  

Thanks.
David Zhang

From: chaofa...@owtware.com
Date: Thu, 30 Jul 2015 10:30:12 +0800
To: idryo...@gmail.com
CC: ceph-users@lists.ceph.com
Subject: Re: [ceph-users] which kernel version can help avoid kernel client 
deadlock


On Jul 29, 2015, at 12:40 AM, Ilya Dryomov idryo...@gmail.com wrote:On Tue, 
Jul 28, 2015 at 7:20 PM, van chaofa...@owtware.com wrote:
On Jul 28, 2015, at 7:57 PM, Ilya Dryomov idryo...@gmail.com wrote:

On Tue, Jul 28, 2015 at 2:46 PM, van chaofa...@owtware.com wrote:
Hi, Ilya,

In the dmesg, there is also a lot of libceph socket error, which I think
may be caused by my stopping ceph service without unmap rbd.

Well, sure enough, if you kill all OSDs, the filesystem mounted on top
of rbd device will get stuck.

Sure it will get stuck if osds are stopped. And since rados requests have retry 
policy, the stucked requests will recover after I start the daemon again.

But in my case, the osds are running in normal state and librbd API can 
read/write normally.
Meanwhile, heavy fio test for the filesystem mounted on top of rbd device will 
get stuck.

I wonder if this phenomenon is triggered by running rbd kernel client on 
machines have ceph daemons, i.e. the annoying loopback mount deadlock issue.

In my opinion, if it’s due to the loopback mount deadlock, the OSDs will become 
unresponsive.
No matter the requests are from user space requests (like API) or from kernel 
client.
Am I right?
Not necessarily.
If so, my case seems to be triggered by another bug.

Anyway, it seems that I should separate client and daemons at least.
Try 3.18.19 if you can.  I'd be interested in your results.
It’s strange, after I drop the page cache and restart my OSDs, same heavy IO 
tests on rbd folder now works fine.The deadlock seems not that easy to trigger. 
Maybe I need longer tests.
I’ll try 3.18.19 LTS, thanks.
Thanks,   Ilya

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com  
  ___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

[ceph-users] Timeout mechanism in ceph client tick

2015-07-02 Thread Z Zhang

Hi Guys,
By reading through ceph client codes, there is timeout mechanism in tick when 
doing mount. Recently we met some client requests to mds spending long time to 
reply when doing massive test to cephfs. And if we want cephfs user to know the 
timeout instead of waiting for the reply, can we make following change to 
leverage such timeout mechanism? Is there any other concern that we didn't 
enable client request's timeout in tick after mount succeeds?
If this works, we can either use client_mount_timeout, or introduce another 
parameter for normal client request's timeout value.

diff --git a/src/client/Client.cc b/src/client/Client.ccindex 63a9faa..91a8cc9 
100644--- a/src/client/Client.cc+++ b/src/client/Client.cc@@ -4850,7 +4850,7 @@ 
void Client::tick()utime_t now = ceph_clock_now(cct); -  if (!mounted  
!mds_requests.empty()) {+  if (!mds_requests.empty()) { MetaRequest *req = 
mds_requests.begin()-second; if (req-op_stamp + 
cct-_conf-client_mount_timeout  now) {   req-aborted = true;

Thanks.
David Zhang   ___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] krbd splitting large IO's into smaller IO's

2015-06-29 Thread Z Zhang

Hi Ilya,
Thanks for your explanation. This makes sense. Will you make max_segments to be 
configurable? Could you pls point me the fix you have made? We might help to 
test it.
Thanks.
David Zhang 

 Date: Fri, 26 Jun 2015 18:21:55 +0300
 Subject: Re: [ceph-users] krbd splitting large IO's into smaller IO's
 From: idryo...@gmail.com
 To: zhangz.da...@outlook.com
 CC: ceph-users@lists.ceph.com

 On Fri, Jun 26, 2015 at 3:17 PM, Z Zhang zhangz.da...@outlook.com wrote:
  Hi Ilya,

  I am seeing your recent email talking about krbd splitting large IO's into
  smaller IO's, see below link.

  https://www.mail-archive.com/ceph-users@lists.ceph.com/msg20587.html

  I just tried it on my ceph cluster using kernel 3.10.0-1. I adjust both
  max_sectors_kb and max_hw_sectors_kb of rbd device to 4096.

  Use fio with 4M block size for read:

  Device: rrqm/s   wrqm/s r/s w/srMB/swMB/s avgrq-sz
  avgqu-sz   await r_await w_await  svctm  %util
  rbd3 81.00 0.00  135.000.00   108.00 0.00  1638.40
  2.72   20.15   20.150.00   7.41 100.00

  Use fio with 1M or 2M block size for read:

  Device: rrqm/s   wrqm/s r/s w/srMB/swMB/s avgrq-sz
  avgqu-sz   await r_await w_await  svctm  %util
  rbd3  0.00 0.00  213.000.00   106.50 0.00  1024.00
  2.56   12.02   12.020.00   4.69 100.00

  Use fio with 4M block size for write:

  Device: rrqm/s   wrqm/s r/s w/srMB/swMB/s avgrq-sz
  avgqu-sz   await r_await w_await  svctm  %util
  rbd3  0.0040.000.00   40.00 0.0040.00  2048.00
  2.87   70.900.00   70.90  24.90  99.60

  Use fio with 1M or 2M block size for write:

  Device: rrqm/s   wrqm/s r/s w/srMB/swMB/s avgrq-sz
  avgqu-sz   await r_await w_await  svctm  %util
  rbd3  0.00 0.000.00   80.00 0.0040.00  1024.00
  3.55   48.200.00   48.20  12.50 100.00

  So why the IO size here is far less than 4096 (If using default value 512,
  all the IO size is 1024)? Is there some other parameters need to adjust, or
  is it about this kernel version?

 It's about this kernel version.  Assuming you are doing direct I/Os
 with fio, setting max_sectors_kb to 4096 is really the only thing you
 can do, and that's enough to *sometimes* see 8192 sector (i.e. 4M) I/Os.
 The problem is the max_segments value, which in 3.10 is 128 and which
 you cannot adjust via sysfs.

 It all comes down to a memory allocator.  To get a 4M I/O, the total
 number of segments (physically contiguous chunks of memory) in the
 8 bios (8*512k = 4M) that need to be merged has to be = 128.  When you
 are allocated such nice and contiguous bios, you get 4M I/Os.  In other
 cases you don't.

 This will be fixed in 4.2, along with a bunch of other things.  This
 particular max_segment fix is a one liner, so we will probably backport
 it to older kernels, including 3.10.

 Thanks,

 Ilya
  ___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

[ceph-users] krbd splitting large IO's into smaller IO's

2015-06-26 Thread Z Zhang

Hi Ilya,
I am seeing your recent email talking about krbd splitting large IO's into 
smaller IO's, see below link.
https://www.mail-archive.com/ceph-users@lists.ceph.com/msg20587.html
I just tried it on my ceph cluster using kernel 3.10.0-1. I adjust both 
max_sectors_kb and max_hw_sectors_kb of rbd device to 4096.   
Use fio with 4M block size for read:
Device: rrqm/s   wrqm/s r/s w/srMB/swMB/s avgrq-sz 
avgqu-sz   await r_await w_await  svctm  %utilrbd3 81.00 0.00  
135.000.00   108.00 0.00  1638.40 2.72   20.15   20.150.00   
7.41 100.00

Use fio with 1M or 2M block size for read:
Device: rrqm/s   wrqm/s r/s w/srMB/swMB/s avgrq-sz 
avgqu-sz   await r_await w_await  svctm  %utilrbd3  0.00 0.00  
213.000.00   106.50 0.00  1024.00 2.56   12.02   12.020.00   
4.69 100.00

Use fio with 4M block size for write:
Device: rrqm/s   wrqm/s r/s w/srMB/swMB/s avgrq-sz 
avgqu-sz   await r_await w_await  svctm  %utilrbd3  0.0040.00   
 0.00   40.00 0.0040.00  2048.00 2.87   70.900.00   70.90  
24.90  99.60

Use fio with 1M or 2M block size for write:
Device: rrqm/s   wrqm/s r/s w/srMB/swMB/s avgrq-sz 
avgqu-sz   await r_await w_await  svctm  %utilrbd3  0.00 0.00   
 0.00   80.00 0.0040.00  1024.00 3.55   48.200.00   48.20  
12.50 100.00

So why the IO size here is far less than 4096 (If using default value 512, all 
the IO size is 1024)? Is there some other parameters need to adjust, or is it 
about this kernel version?
Thanks!
Regards,David Zhang 
  ___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

[ceph-users] CephFS client issue

2015-06-14 Thread David Z

 


 On Monday, June 15, 2015 3:05 AM, ceph-users-requ...@lists.ceph.com 
ceph-users-requ...@lists.ceph.com wrote:
   

 Send ceph-users mailing list submissions to
    ceph-users@lists.ceph.com

To subscribe or unsubscribe via the World Wide Web, visit
    http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
or, via email, send a message with subject or body 'help' to
    ceph-users-requ...@lists.ceph.com

You can reach the person managing the list at
    ceph-users-ow...@lists.ceph.com

When replying, please edit your Subject line so it is more specific
than Re: Contents of ceph-users digest...


Today's Topics:

  1. Re: Erasure coded pools and bit-rot protection (Pawe? Sadowski)
  2. CephFS client issue (Matteo Dacrema)
  3. Re: Erasure coded pools and bit-rot protection (Gregory Farnum)
  4. Re: CephFS client issue (Lincoln Bryant)
  5. Re: .New Ceph cluster - cannot add additional monitor
      (Mike Carlson)
  6. Re: CephFS client issue (Matteo Dacrema)


--

Message: 1
Date: Sat, 13 Jun 2015 21:08:25 +0200
From: Pawe? Sadowski c...@sadziu.pl
To: Gregory Farnum g...@gregs42.com
Cc: ceph-users ceph-us...@ceph.com
Subject: Re: [ceph-users] Erasure coded pools and bit-rot protection
Message-ID: 557c7fa9.1020...@sadziu.pl
Content-Type: text/plain; charset=utf-8

Thanks for taking care of this so fast. Yes, I'm getting broken object.
I haven't checked this on other versions but is this bug present
only in Hammer or in all versions?


W dniu 12.06.2015 o 21:43, Gregory Farnum pisze:
 Okay, Sam thinks he knows what's going on; here's a ticket:
 http://tracker.ceph.com/issues/12000

 On Fri, Jun 12, 2015 at 12:32 PM, Gregory Farnum g...@gregs42.com wrote:
 On Fri, Jun 12, 2015 at 1:07 AM, Pawe? Sadowski c...@sadziu.pl wrote:
 Hi All,

 I'm testing erasure coded pools. Is there any protection from bit-rot
 errors on object read? If I modify one bit in object part (directly on
 OSD) I'm getting *broken*object:
 Sorry, are you saying that you're getting a broken object if you flip
 a bit in an EC pool? That should detect the chunk as invalid and
 reconstruct on read...
 -Greg

    mon-01:~ # rados --pool ecpool get `hostname -f`_16 - | md5sum
    bb2d82bbb95be6b9a039d135cc7a5d0d  -

    # modify one bit directly on OSD

    mon-01:~ # rados --pool ecpool get `hostname -f`_16 - | md5sum
    02f04f590010b4b0e6af4741c4097b4f  -

    # restore bit to original value

    mon-01:~ # rados --pool ecpool get `hostname -f`_16 - | md5sum
    bb2d82bbb95be6b9a039d135cc7a5d0d  -

 If I run deep-scrub on modified bit I'm getting inconsistent PG which is
 correct in this case. After restoring bit and running deep-scrub again
 all PGs are clean.


 [ceph version 0.94.1 (e4bfad3a3c51054df7e537a724c8d0bf9be972ff)]
-- 
PS


--

Message: 2
Date: Sun, 14 Jun 2015 15:26:54 +
From: Matteo Dacrema mdacr...@enter.it
To: ceph-users ceph-us...@ceph.com
Subject: [ceph-users] CephFS client issue
Message-ID: d28e061762104ed68e06effd5199ef06@Exch2013Mb.enter.local
Content-Type: text/plain; charset=us-ascii

?Hi all,


I'm using CephFS on Hammer and sometimes I need to reboot one or more clients 
because , as ceph -s tells me, it's failing to respond to capability 
release.After tha?t all clients stop to respond: can't access files or 
mount/umont cephfs.

I've 1.5 million files , 2 metadata servers in active/standby configuration 
with 8 GB of RAM , 20 clients with 2 GB of RAM each and 2 OSD nodes with 4 80GB 
osd and 4GB of RAM.



Here my configuration:


[global]
        fsid = 2de7b17f-0a3e-4109-b878-c035dd2f7735
        mon_initial_members = cephmds01
        mon_host = 10.29.81.161
        auth_cluster_required = cephx
        auth_service_required = cephx
        auth_client_required = cephx
        public network = 10.29.81.0/24
        tcp nodelay = true
        tcp rcvbuf = 0
        ms tcp read timeout = 600

        #Capacity
        mon osd full ratio = .95
        mon osd nearfull ratio = .85


[osd]
        osd journal size = 1024
        journal dio = true
        journal aio = true

        osd op threads = 2
        osd op thread timeout = 60
        osd disk threads = 2
        osd recovery threads = 1
        osd recovery max active = 1
        osd max backfills = 2


        # Pool
        osd pool default size = 2

        #XFS
        osd mkfs type = xfs
        osd mkfs options xfs = -f -i size=2048
        osd mount options xfs = rw,noatime,inode64,logbsize=256k,delaylog

        #FileStore Settings
        filestore xattr use omap = false
        filestore max inline xattr size = 512
        filestore max sync interval = 10
        filestore merge threshold = 40
        filestore split multiple = 8
        filestore flusher = false
        filestore queue max ops = 2000
        filestore queue max bytes = 536870912
        filestore queue committing max ops = 500
        filestore queue committing max bytes =

Re: [ceph-users] rbd format v2 support

2015-06-08 Thread David Z

Hi Ilya,
Thanks for the reply. I knew that v2 image can be mapped if using default 
striping parameters without --stripe-unit or --stripe-count.

It is just the rbd performance (IOPS  bandwidth) we tested hasn't met our 
goal. We found at this point OSDs seemed not to be the bottleneck, so we want 
to try fancy striping.
Do you know if there is an approximate ETA for this feature? Or it would be 
great that you could share some info on tuning rbd performance. Anything will 
be appreciated.
Thanks.
Zhi (David) 

 On Sunday, June 7, 2015 3:50 PM, Ilya Dryomov idryo...@gmail.com wrote:
   

 On Fri, Jun 5, 2015 at 6:47 AM, David Z david.z1...@yahoo.com wrote:
 Hi Ceph folks,

 We want to use rbd format v2, but find it is not supported on kernel 3.10.0 
 of centos 7:

 [ceph@ ~]$ sudo rbd map zhi_rbd_test_1
 rbd: sysfs write failed
 rbd: map failed: (22) Invalid argument
 [ceph@ ~]$ dmesg | tail
 [662453.664746] rbd: image zhi_rbd_test_1: unsupported stripe unit (got 8192 
 want 4194304)

 As it described in ceph doc, it should be available from kernel 3.11. But I 
 checked the code of kernel 3.12, 3.14 and even 4.1. This piece of code is 
 still there, see below links. Do I miss some codes or info?

What you are referring to is called fancy striping and it is
unsupported (work is underway but it's been slow going).  However
because v2 images with *default* striping parameters disk format wise
are the same as v1 images, you can map a v2 image provided you didn't
specify custom --stripe-unit or --stripe-count on rbd create.

Thanks,

                Ilya


  ___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

[ceph-users] rbd format v2 support

2015-06-04 Thread David Z

Hi Ceph folks,

We want to use rbd format v2, but find it is not supported on kernel 3.10.0 of 
centos 7:

[ceph@ ~]$ sudo rbd map zhi_rbd_test_1 
rbd: sysfs write failed 
rbd: map failed: (22) Invalid argument 
[ceph@ ~]$ dmesg | tail 
[662453.664746] rbd: image zhi_rbd_test_1: unsupported stripe unit (got 8192 
want 4194304)

As it described in ceph doc, it should be available from kernel 3.11. But I 
checked the code of kernel 3.12, 3.14 and even 4.1. This piece of code is still 
there, see below links. Do I miss some codes or info?  

https://git.kernel.org/cgit/linux/kernel/git/torvalds/linux.git/tree/drivers/block/rbd.c?id=refs/tags/v4.1-rc6#n4352
https://git.kernel.org/cgit/linux/kernel/git/torvalds/linux.git/tree/drivers/block/rbd.c?id=refs/tags/v4.1-rc6#n4359

And I also found one email that Sage mentioned there was a commit 
764684ef34af685cd8d46830a73826443f9129df should resolve such problem, but I 
didn't find this commit's details. Could some one know about it?

http://lists.ceph.com/pipermail/ceph-users-ceph.com/2013-June/031897.html

Thanks a lot!

Regards,
Zhi (David)  
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

[ceph-users] The strategy of auto-restarting crashed OSD

2014-11-12 Thread David Z

Hi Guys,

We are experiencing some OSD crashing issues recently, like messenger crash, 
some strange crash (still being investigating), etc. Those crashes seems not to 
reproduce after restarting OSD.

So we are thinking about the strategy of auto-restarting crashed OSD for 1 or 2 
times, then leave it as down if restarting doesn't work. This strategy might 
help us on pg peering and recovering impact to online traffic to some extent, 
since we won't mark OSD out automatically even if it is down unless we are sure 
it is disk failure.

However, we are also aware that this strategy may bring us some problems. Since 
your guys have more experience on CEPH, so we would like to hear some 
suggestions from you.

Thanks.

David Zhang  
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] CephFS - limit space available to individual users.

2013-04-05 Thread Vanja Z

Thanks Wildo, I have to admit its slightly disappointing (but completely 
understandable) since it basically means it's not safe for us to use CephFS :(


Without userquotas, it would be sufficient to have multiple CephFS 
filesystems and to be able to set the size of each one.

Is it part of the core design that there can only be one filesystem in the 
cluster? This seems like a 'single point of failure'.


  I have been testing CephFS on our computational cluster of about 30 
 computers. I've got 4 machines, 4 disks, 4 osd, 4 mon and 1 mds at the 
 moment for testing. The testing has been going very well apart from one 
 problem 
 that needs to be resolved before we can use Ceph in place of our existing 
 'system' of NFS exports.
 
  Our users run simulations that are easily capable of writing out data at a 
 rate limited only by the storage device. These jobs also often run for days 
 or 
 weeks unattended. This unfortunately means that using CephFS, if a user 
 doesn't setup their simulation carefully enough or if their code has some 
 bug, they are able to fill the entire filesystem (shared by aroud 10 other 
 users) in around a day leaving no room for any other users and potentially 
 crashing the entire cluster. I've read the FAQ entry about quotas but 
 I'm not sure what to make of it. Is it correct that you can only have one 
 CephFS per cluster? I guess I was imagining creating a separate 
 file-system of known size for each user.
 
 
 The talks about quotas were indeed userquotas, but nothing about 
 enforcing them. The first step is to do accounting and maybe in a later 
 stage soft and hard enforcement can be added.
 
 I don't think it's on the roadmap currently.
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

[ceph-users] CephFS - limit space available to individual users.

2013-04-04 Thread Vanja Z

I have been testing CephFS on our computational cluster of about 30 computers. 
I've got 4 machines, 4 disks, 4 osd, 4 mon and 1 mds at the moment for testing. 
The testing has been going very well apart from one problem that needs to be 
resolved before we can use Ceph in place of our existing 'system' of NFS 
exports.

Our users run simulations that are easily capable of writing out data at a rate 
limited only by the storage device. These jobs also often run for days or weeks 
unattended. This unfortunately means that using CephFS, if a user doesn't setup 
their simulation carefully enough or if their code has some bug, they are able 
to fill the entire filesystem (shared by aroud 10 other users) in around a day 
leaving no room for any other users and potentially crashing the entire 
cluster. I've read the FAQ entry about quotas but I'm not sure what to make of 
it. Is it correct that you can only have one CephFS per cluster? I guess I 
was imagining creating a separate file-system of known size for each user.


Any help would be greatly appreciated and since this is my first message, 
thanks to Sage and everyone else involved for creating this excellent project!
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

37 matches

Mail list logo