question about futex calls in OSD
Hi, This is Xing, a graduate student at the University of Utah. I am playing with Ceph and have a few questions about futex calls in OSD process. I am using Ceph-v0.79. The data node machine has 7 disks, one for OS and the other 6 disks for running 6 OSDs. I set replica size = 1, for all three pools (this is to improve sequential read bandwidth. Since ceph is doing replication synchronously, it layouts data in a fashion similar as RAID 10 with the near format for replica size=2 with two OSDs. Each OSD will store exactly the same copy of whole data while during read back, only half blocks will be read from each OSD: it will read one block, seek over the next block and then read the third block. Ideally, to get full disk bandwidth, we should layout data as the far format in RAID 10. I do not know how to do that in Ceph and thus simply tried with replicasize=1). I created a rbd block device, initialized with an ext4 fs, stored a 10GB file into it and then read it back sequentially. I set the read_ahead to be 128M or 1G, to get the maximum bandwidth for a single rbd block device (I know this is not realistic). While these reads requests are being served, I used strace to capture all system calls for an OSD thread that actually does reads. I observed lots of futex system calls. By specifying -T for strace, it reports time spent by each system call and I summed it up to come up with total time spent for each system call. I further break down two main futex calls that contributes most to its overhead. -- syscallruntime (s) call num average (s) pread 11.1028 420 0.026435 fgetxattr 0.178381 1680 0.000106 futex 8.83125 5217 total runtime: 21 runtime (s) call num average (s) futex(WAIT_PRIVATE): 4.97166 1415 0.003513 futex(WAIT_BITSET_PRIVATE 3.79 51 0.0743 -- The overhead of lock seems to be quite high. I image this could become worse as I increase the number of workloads. I was wondering why there are so many futex calls and took some time to look into it. It appears to me that there are three locks used during the read path in OSD. These locks are Mutexes and I believe the futex calls I observed in straces are results of operations on these Mutexes. a. snapset_contexts_lock, used in functions such as get_snapset_context() and put_snapset_context() b. fdcache_lock, used in lfn_open() and such c. ondisk_read_lock, used in execute_ctx(). The one that affects most is the snapset_contexts_lock: this lock seems to be the monolithic lock for controlling access to all object files in a snapset or to all object files within the same placement group and belonging to the same snapset (not sure what a ’snapset’ means. It sounds like equivalent to a ’snapshot’.). To read a block from an object file, OSD needs to first read two extended attributes (OI_ATTR and SS_ATTR) for that file. For each read of these attributes, it seems the snapset_contexts_lock is involved: the SS_ATTR attribute is read inside the get_snapset_context() function; the OI_ATTR attribute is read in the function get_object_context() which could be called from the function find_object_context(). In the find_object_context(), I can also find a few calls to get_snapset_context() function. To release a snapset context, the snapset_contexts_lock is also involved (in the put_snapset_context()). Here are a few questions. 1. What is the snapset_contexts_lock used for? Is it used to control accesses to all files in a snapset or all files in the same placement group and belonging to the same snapset? Such a big lock design does not seem to be scale. Any comments? 2. Has anyone noticed the overhead in locking? I tried to comment out the lock in get_snapset_context() and put_snapset_context(), by commenting the Mutex instantiation statement but that turns the system to be not working: I could not list/mount rbd block devices. Thanks very much for reading this. Any comment is welcome. Thanks, Xing -- To unsubscribe from this list: send the line unsubscribe ceph-devel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
bandwidth with Ceph - v0.59 (Bobtail)
Hi, I also did a few other experiments, trying to get what the maximum bandwidth we can get from each data disk. The output is not encouraging: for disks that can provide 150 MB/s block-level sequential read bandwidths, we can only get about 90MB/s from each disk. Something that is particular interesting is that the replica size also affects the bandwidth we could get from the cluster. It seems that there is no such observation/conversations in the Ceph community and I think it may be helpful to share my findings. The experiment was run with two d820 machines in Emulab at University of Utah. One is used as the data node and the other is used as the client. They are connected by a 10 GB/s Ethernet. The data node has 7 disks, one for OS and the rest 6 for OSDs. For the rest 6 disks, we use one for journal and the other for data. Thus in total we have 3 OSDs. The network bandwidth is sufficient to support reading from 3 disks in full bandwidth. I varied the read-ahead size for the rbd block device (exp1), osd op threads for each osd (exp2), varied the replica size (exp3), and object size (exp4). The most interesting is varying the replica size. As I varied replica size from 1, to 2 and to 3, the aggregated bandwidth dropped from 267 MB/s to 211 and 180. The reason for the drop I believe is as we increase the number of replicas, we store more data into each OSD. then when we need to read it back, we have to read from a larger range (more seeks). The fundamental problem is likely because we are doing replication synchronously, and thus layout object files in a raid 10 - near format, rather than the far format. For the difference between the near format and far format for raid 10, you could have a look at the link provided below. http://lxr.free-electrons.com/source/Documentation/device-mapper/dm-raid.txt For results about other experiments, you could download my slides at the link provided below. http://www.cs.utah.edu/~xinglin/slides/ceph-bandiwdth-exp.pptx I do not know why Ceph only gets about 60% of the disk bandwidth. To do a comparison, I ran tar to read every rbd object files to create a tarball and see how much bandwidth I can get from this workload. Interestingly, the tar workload actually gets a higher bandwidth (80% of block level bandwidth), even though it is accessing the disk more randomly (tar reads each object file in a dir sequentially while the object files were created in a different order.). For more detail, please goto my blog to have a read. http://xinglin-system.blogspot.com/2014/04/ceph-lab-note-1-disk-read-bandwidth-in.html Here are a few questions. 1. What are the maximum bandwidth people can get from each disk? I found Jiangang from Intel also reported 57% efficiency for disk bandwidth. He suggested one reason: interference among so many sequential read workloads. I agree but when I tried to run with one single workload, I still do not get a higher efficiency. 2. If the efficiency is about 60%, what are the reasons that cause this? Could it be because of the locks (futex as I mentioned in my previous email) or anything else? Thanks very much for any feedback. Thanks, Xing -- To unsubscribe from this list: send the line unsubscribe ceph-devel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: bandwidth with Ceph - v0.59 (Bobtail)
Hi Gregory, Thanks very much for your quick reply. When I started to look into Ceph, Bobtail was the latest stable release and that was why I picked that version and started to make a few modifications. I have not ported my changes to 0.79 yet. The plan is if v-0.79 can provide a higher disk bandwidth efficiency, I will switch to 0.79. Unfortunately, that does not seem to be the case. The futex trace was done with version 0.79, not 0.59. I did a profile in 0.59 too. There are some improvements, such as the introduction of fd cache. But lots of futex calls are still there in v-0.79. I also measured the maximum bandwidth from each disk we can get in Version 0.79. It does not improve significantly: we can still only get 90~100 MB/s from each disk. Thanks, Xing On Apr 25, 2014, at 2:42 PM, Gregory Farnum g...@inktank.com wrote: Bobtail is really too old to draw any meaningful conclusions from; why did you choose it? That's not to say that performance on current code will be better (though it very much might be), but the internal architecture has changed in some ways that will be particularly important for the futex profiling you did, and are probably important for these throughput results as well. -Greg Software Engineer #42 @ http://inktank.com | http://ceph.com On Fri, Apr 25, 2014 at 1:38 PM, Xing xing...@cs.utah.edu wrote: Hi, I also did a few other experiments, trying to get what the maximum bandwidth we can get from each data disk. The output is not encouraging: for disks that can provide 150 MB/s block-level sequential read bandwidths, we can only get about 90MB/s from each disk. Something that is particular interesting is that the replica size also affects the bandwidth we could get from the cluster. It seems that there is no such observation/conversations in the Ceph community and I think it may be helpful to share my findings. The experiment was run with two d820 machines in Emulab at University of Utah. One is used as the data node and the other is used as the client. They are connected by a 10 GB/s Ethernet. The data node has 7 disks, one for OS and the rest 6 for OSDs. For the rest 6 disks, we use one for journal and the other for data. Thus in total we have 3 OSDs. The network bandwidth is sufficient to support reading from 3 disks in full bandwidth. I varied the read-ahead size for the rbd block device (exp1), osd op threads for each osd (exp2), varied the replica size (exp3), and object size (exp4). The most interesting is varying the replica size. As I varied replica size from 1, to 2 and to 3, the aggregated bandwidth dropped from 267 MB/s to 211 and 180. The reason for the drop I believe is as we increase the number of replicas, we store more data into each OSD. then when we need to read it back, we have to read from a larger range (more seeks). The fundamental problem is likely because we are doing replication synchronously, and thus layout object files in a raid 10 - near format, rather than the far format. For the difference between the near format and far format for raid 10, you could have a look at the link provided below. http://lxr.free-electrons.com/source/Documentation/device-mapper/dm-raid.txt For results about other experiments, you could download my slides at the link provided below. http://www.cs.utah.edu/~xinglin/slides/ceph-bandiwdth-exp.pptx I do not know why Ceph only gets about 60% of the disk bandwidth. To do a comparison, I ran tar to read every rbd object files to create a tarball and see how much bandwidth I can get from this workload. Interestingly, the tar workload actually gets a higher bandwidth (80% of block level bandwidth), even though it is accessing the disk more randomly (tar reads each object file in a dir sequentially while the object files were created in a different order.). For more detail, please goto my blog to have a read. http://xinglin-system.blogspot.com/2014/04/ceph-lab-note-1-disk-read-bandwidth-in.html Here are a few questions. 1. What are the maximum bandwidth people can get from each disk? I found Jiangang from Intel also reported 57% efficiency for disk bandwidth. He suggested one reason: interference among so many sequential read workloads. I agree but when I tried to run with one single workload, I still do not get a higher efficiency. 2. If the efficiency is about 60%, what are the reasons that cause this? Could it be because of the locks (futex as I mentioned in my previous email) or anything else? Thanks very much for any feedback. Thanks, Xing -- To unsubscribe from this list: send the line unsubscribe ceph-devel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html -- To unsubscribe from this list: send the line unsubscribe ceph-devel in the body of a message to majord...@vger.kernel.org More majordomo info at http
Re: bandwidth with Ceph - v0.59 (Bobtail)
Hi Mark, That seems pretty good. What is the block level sequential read bandwidth of your disks? What configuration did you use? What was the replica size, read_ahead for your rbds and what were the number of workloads you used? I used btrfs in my experiments as well. Thanks, Xing On 04/25/2014 03:36 PM, Mark Nelson wrote: For what it's worth, I've been able to achieve up to around 120MB/s with btrfs before things fragment. Mark On 04/25/2014 03:59 PM, Xing wrote: Hi Gregory, Thanks very much for your quick reply. When I started to look into Ceph, Bobtail was the latest stable release and that was why I picked that version and started to make a few modifications. I have not ported my changes to 0.79 yet. The plan is if v-0.79 can provide a higher disk bandwidth efficiency, I will switch to 0.79. Unfortunately, that does not seem to be the case. The futex trace was done with version 0.79, not 0.59. I did a profile in 0.59 too. There are some improvements, such as the introduction of fd cache. But lots of futex calls are still there in v-0.79. I also measured the maximum bandwidth from each disk we can get in Version 0.79. It does not improve significantly: we can still only get 90~100 MB/s from each disk. Thanks, Xing On Apr 25, 2014, at 2:42 PM, Gregory Farnum g...@inktank.com wrote: Bobtail is really too old to draw any meaningful conclusions from; why did you choose it? That's not to say that performance on current code will be better (though it very much might be), but the internal architecture has changed in some ways that will be particularly important for the futex profiling you did, and are probably important for these throughput results as well. -Greg Software Engineer #42 @ http://inktank.com | http://ceph.com On Fri, Apr 25, 2014 at 1:38 PM, Xing xing...@cs.utah.edu wrote: Hi, I also did a few other experiments, trying to get what the maximum bandwidth we can get from each data disk. The output is not encouraging: for disks that can provide 150 MB/s block-level sequential read bandwidths, we can only get about 90MB/s from each disk. Something that is particular interesting is that the replica size also affects the bandwidth we could get from the cluster. It seems that there is no such observation/conversations in the Ceph community and I think it may be helpful to share my findings. The experiment was run with two d820 machines in Emulab at University of Utah. One is used as the data node and the other is used as the client. They are connected by a 10 GB/s Ethernet. The data node has 7 disks, one for OS and the rest 6 for OSDs. For the rest 6 disks, we use one for journal and the other for data. Thus in total we have 3 OSDs. The network bandwidth is sufficient to support reading from 3 disks in full bandwidth. I varied the read-ahead size for the rbd block device (exp1), osd op threads for each osd (exp2), varied the replica size (exp3), and object size (exp4). The most interesting is varying the replica size. As I varied replica size from 1, to 2 and to 3, the aggregated bandwidth dropped from 267 MB/s to 211 and 180. The reason for the drop I believe is as we increase the number of replicas, we store more data into each OSD. then when we need to read it back, we have to read from a larger range (more seeks). The fundamental problem is likely because we are doing replication synchronously, and thus layout object files in a raid 10 - near format, rather than the far format. For the difference between the near format and far format for raid 10, you could have a look at the link provided below. http://lxr.free-electrons.com/source/Documentation/device-mapper/dm-raid.txt For results about other experiments, you could download my slides at the link provided below. http://www.cs.utah.edu/~xinglin/slides/ceph-bandiwdth-exp.pptx I do not know why Ceph only gets about 60% of the disk bandwidth. To do a comparison, I ran tar to read every rbd object files to create a tarball and see how much bandwidth I can get from this workload. Interestingly, the tar workload actually gets a higher bandwidth (80% of block level bandwidth), even though it is accessing the disk more randomly (tar reads each object file in a dir sequentially while the object files were created in a different order.). For more detail, please goto my blog to have a read. http://xinglin-system.blogspot.com/2014/04/ceph-lab-note-1-disk-read-bandwidth-in.html Here are a few questions. 1. What are the maximum bandwidth people can get from each disk? I found Jiangang from Intel also reported 57% efficiency for disk bandwidth. He suggested one reason: interference among so many sequential read workloads. I agree but when I tried to run with one single workload, I still do not get a higher efficiency. 2. If the efficiency is about 60%, what are the reasons that cause this? Could it be because of the locks (futex as I mentioned in my previous email
Re: bandwidth with Ceph - v0.59 (Bobtail)
Hi Mark, Thanks for sharing this. I did read these blogs early. If we look at the aggregated bandwidth, 600-700 MB/s for reads for 6 disks are quite good. But consider it is shared among 256 concurrent read streams, each one gets as little as 2-3 MB/s bandwidth. This does not sound that right. I think the read bandwidth of a disk will be close to its write bandwidth. But just double-check: what the sequential read bandwidth your disks can provide? I also read your follow-up blogs, comparing bobtail and cuttlefish. One thing I do not get from your experiments is that it hit the network bottleneck much earlier before being bottlenecked by disks. Could you setup a smaller cluster (e.g with 8 disks, rather than 24) such as a 10 Gb/s link will not become the bottleneck and then test how much disk bandwidth can be achieved, preferably with new releases of Ceph. The other concern is I am not sure how close RADOS bench results are when compared with kernel RBD performance. I would appreciate it if you can do that. Thanks, Xing On 04/25/2014 04:16 PM, Mark Nelson wrote: I don't have any recent results published, but you can see some of the older results from bobtail here: http://ceph.com/performance-2/argonaut-vs-bobtail-performance-preview/ Specifically, look at the 256 concurrent 4MB rados bench tests. In a 6 disk, 2 SSD configuration we could push about 800MB/s for writes (no replication) and around 600-700MB/s for reads with BTRFS. On this controller using a RAID0 configuration with WB cache helps quite a bit, but in other tests I've seen similar results with a 9207-8i that doesn't have WB cache when BTRFS filestores and SSD journals are used. Regarding the drives, they can do somewhere around 140-150MB/s large block writes with fio. Replication definitely adds additional latency so aggregate write throughput goes down, though it seems the penalty is worst after the first replica and doesn't hurt as much with subsequent ones. Mark On 04/25/2014 04:50 PM, Xing Lin wrote: Hi Mark, That seems pretty good. What is the block level sequential read bandwidth of your disks? What configuration did you use? What was the replica size, read_ahead for your rbds and what were the number of workloads you used? I used btrfs in my experiments as well. Thanks, Xing On 04/25/2014 03:36 PM, Mark Nelson wrote: For what it's worth, I've been able to achieve up to around 120MB/s with btrfs before things fragment. Mark On 04/25/2014 03:59 PM, Xing wrote: Hi Gregory, Thanks very much for your quick reply. When I started to look into Ceph, Bobtail was the latest stable release and that was why I picked that version and started to make a few modifications. I have not ported my changes to 0.79 yet. The plan is if v-0.79 can provide a higher disk bandwidth efficiency, I will switch to 0.79. Unfortunately, that does not seem to be the case. The futex trace was done with version 0.79, not 0.59. I did a profile in 0.59 too. There are some improvements, such as the introduction of fd cache. But lots of futex calls are still there in v-0.79. I also measured the maximum bandwidth from each disk we can get in Version 0.79. It does not improve significantly: we can still only get 90~100 MB/s from each disk. Thanks, Xing On Apr 25, 2014, at 2:42 PM, Gregory Farnum g...@inktank.com wrote: Bobtail is really too old to draw any meaningful conclusions from; why did you choose it? That's not to say that performance on current code will be better (though it very much might be), but the internal architecture has changed in some ways that will be particularly important for the futex profiling you did, and are probably important for these throughput results as well. -Greg Software Engineer #42 @ http://inktank.com | http://ceph.com On Fri, Apr 25, 2014 at 1:38 PM, Xing xing...@cs.utah.edu wrote: Hi, I also did a few other experiments, trying to get what the maximum bandwidth we can get from each data disk. The output is not encouraging: for disks that can provide 150 MB/s block-level sequential read bandwidths, we can only get about 90MB/s from each disk. Something that is particular interesting is that the replica size also affects the bandwidth we could get from the cluster. It seems that there is no such observation/conversations in the Ceph community and I think it may be helpful to share my findings. The experiment was run with two d820 machines in Emulab at University of Utah. One is used as the data node and the other is used as the client. They are connected by a 10 GB/s Ethernet. The data node has 7 disks, one for OS and the rest 6 for OSDs. For the rest 6 disks, we use one for journal and the other for data. Thus in total we have 3 OSDs. The network bandwidth is sufficient to support reading from 3 disks in full bandwidth. I varied the read-ahead size for the rbd block device (exp1), osd op threads for each osd (exp2), varied the replica size (exp3), and object size (exp4
Re: bandwidth with Ceph - v0.59 (Bobtail)
One more thing that needs to be considered is as we add more sequential workloads into the disk, the aggregated bandwidth will start to drop. For example, for the TOSHIBA MBF2600RC SCSI disk, we could get 155 MB/s sequential read bandwidth for a single workload. As we add more, the aggregated bandwidth will drop. number of SRs, aggregated bandwidth (KB/s) 1 154145 2 144296 3 147994 4 141063 5 134698 6 133874 7 130915 8 132366 9 97068 10 111897 11 108508.5 12 106450.9 13 105521.9 14 102411.7 15 102618.2 16 102779.1 17 102745 As we can see, we can only get about 100MB/s once we are running ~10 concurrent workloads. For your case, you were running 256 concurrent read streams for 6/8 disks, I would expect the aggregated disk bandwidth too be lower than 100 MB/s per disk. Any thoughts? Thanks, Xing On Apr 25, 2014, at 5:47 PM, Xing Lin xing...@cs.utah.edu wrote: Hi Mark, Thanks for sharing this. I did read these blogs early. If we look at the aggregated bandwidth, 600-700 MB/s for reads for 6 disks are quite good. But consider it is shared among 256 concurrent read streams, each one gets as little as 2-3 MB/s bandwidth. This does not sound that right. I think the read bandwidth of a disk will be close to its write bandwidth. But just double-check: what the sequential read bandwidth your disks can provide? I also read your follow-up blogs, comparing bobtail and cuttlefish. One thing I do not get from your experiments is that it hit the network bottleneck much earlier before being bottlenecked by disks. Could you setup a smaller cluster (e.g with 8 disks, rather than 24) such as a 10 Gb/s link will not become the bottleneck and then test how much disk bandwidth can be achieved, preferably with new releases of Ceph. The other concern is I am not sure how close RADOS bench results are when compared with kernel RBD performance. I would appreciate it if you can do that. Thanks, Xing On 04/25/2014 04:16 PM, Mark Nelson wrote: I don't have any recent results published, but you can see some of the older results from bobtail here: http://ceph.com/performance-2/argonaut-vs-bobtail-performance-preview/ Specifically, look at the 256 concurrent 4MB rados bench tests. In a 6 disk, 2 SSD configuration we could push about 800MB/s for writes (no replication) and around 600-700MB/s for reads with BTRFS. On this controller using a RAID0 configuration with WB cache helps quite a bit, but in other tests I've seen similar results with a 9207-8i that doesn't have WB cache when BTRFS filestores and SSD journals are used. Regarding the drives, they can do somewhere around 140-150MB/s large block writes with fio. Replication definitely adds additional latency so aggregate write throughput goes down, though it seems the penalty is worst after the first replica and doesn't hurt as much with subsequent ones. Mark On 04/25/2014 04:50 PM, Xing Lin wrote: Hi Mark, That seems pretty good. What is the block level sequential read bandwidth of your disks? What configuration did you use? What was the replica size, read_ahead for your rbds and what were the number of workloads you used? I used btrfs in my experiments as well. Thanks, Xing On 04/25/2014 03:36 PM, Mark Nelson wrote: For what it's worth, I've been able to achieve up to around 120MB/s with btrfs before things fragment. Mark On 04/25/2014 03:59 PM, Xing wrote: Hi Gregory, Thanks very much for your quick reply. When I started to look into Ceph, Bobtail was the latest stable release and that was why I picked that version and started to make a few modifications. I have not ported my changes to 0.79 yet. The plan is if v-0.79 can provide a higher disk bandwidth efficiency, I will switch to 0.79. Unfortunately, that does not seem to be the case. The futex trace was done with version 0.79, not 0.59. I did a profile in 0.59 too. There are some improvements, such as the introduction of fd cache. But lots of futex calls are still there in v-0.79. I also measured the maximum bandwidth from each disk we can get in Version 0.79. It does not improve significantly: we can still only get 90~100 MB/s from each disk. Thanks, Xing On Apr 25, 2014, at 2:42 PM, Gregory Farnum g...@inktank.com wrote: Bobtail is really too old to draw any meaningful conclusions from; why did you choose it? That's not to say that performance on current code will be better (though it very much might be), but the internal architecture has changed in some ways that will be particularly important for the futex profiling you did, and are probably important for these throughput results as well. -Greg Software Engineer #42 @ http://inktank.com | http://ceph.com On Fri, Apr 25, 2014 at 1:38 PM, Xing xing...@cs.utah.edu wrote: Hi, I also did a few other experiments, trying to get what the maximum bandwidth we can get from each data disk. The output is not encouraging
Re: [PATCH] test/libcephfs: free cmount after tests finishes
Hi Sage, Thanks for applying these two patches. I will try to accumulate more fixes and submit pull requests via github later. Thanks, Xing On Nov 3, 2013, at 12:17 AM, Sage Weil s...@inktank.com wrote: Applied this one too! BTW, an easier workflow than sending patches to the list is to accumulate a batch of fixes in a branch and submit a pull request via github (at least if you're already a github user). Whichever works well for you. Thanks! s -- To unsubscribe from this list: send the line unsubscribe ceph-devel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: unable to compile
Hi all, It looks like it is because g++ does not support designated initializer list (gcc supports it). http://stackoverflow.com/questions/18731707/why-does-c11-not-support-designated-initializer-list-as-c99 I am able to compile and run vstart.sh again after I reverted the following commit by Noah. 6efc2b54d5ce85fcb4b66237b051bcbb5072e6a3 Noah, do you have any feedback? Thanks, Xing On Nov 3, 2013, at 1:22 PM, Xing xing...@cs.utah.edu wrote: Hi all, I was able to compile and run vstart.sh yesterday. When I merged the ceph master branch into my local master branch, I am not able to compile now. This issue seems to be related with boost but I have installed the libboost1.46-dev package. Any suggestions? thanks, === the compiler complains the following code section. // file layouts struct ceph_file_layout g_default_file_layout = { .fl_stripe_unit = init_le32(122), .fl_stripe_count = init_le32(1), .fl_object_size = init_le32(122), .fl_cas_hash = init_le32(0), .fl_object_stripe_unit = init_le32(0), .fl_unused = init_le32(-1), .fl_pg_pool = init_le32(-1), }; ===make errors=== common/config.cc:61:2: error: expected primary-expression before '.' token common/config.cc:61:20: warning: extended initializer lists only available with -std=c++0x or -std=gnu++0x [enabled by default] common/config.cc:62:2: error: expected primary-expression before '.' token common/config.cc:62:21: warning: extended initializer lists only available with -std=c++0x or -std=gnu++0x [enabled by default] common/config.cc:63:2: error: expected primary-expression before '.' token common/config.cc:63:20: warning: extended initializer lists only available with -std=c++0x or -std=gnu++0x [enabled by default] common/config.cc:64:2: error: expected primary-expression before '.' token common/config.cc:64:17: warning: extended initializer lists only available with -std=c++0x or -std=gnu++0x [enabled by default] common/config.cc:65:2: error: expected primary-expression before '.' token common/config.cc:65:27: warning: extended initializer lists only available with -std=c++0x or -std=gnu++0x [enabled by default] common/config.cc:66:2: error: expected primary-expression before '.' token common/config.cc:66:15: warning: extended initializer lists only available with -std=c++0x or -std=gnu++0x [enabled by default] common/config.cc:67:2: error: expected primary-expression before '.' token common/config.cc:67:16: warning: extended initializer lists only available with -std=c++0x or -std=gnu++0x [enabled by default] I am compiling ceph in a 64-bit ubuntu12-04 machine and gcc version is 4.6.3. ===gcc version=== root@client:/mnt/ceph# gcc -v Using built-in specs. COLLECT_GCC=gcc COLLECT_LTO_WRAPPER=/usr/lib/gcc/x86_64-linux-gnu/4.6/lto-wrapper Target: x86_64-linux-gnu Configured with: ../src/configure -v --with-pkgversion='Ubuntu/Linaro 4.6.3-1ubuntu5' --with-bugurl=file:///usr/share/doc/gcc-4.6/README.Bugs --enable-languages=c,c++,fortran,objc,obj-c++ --prefix=/usr --program-suffix=-4.6 --enable-shared --enable-linker-build-id --with-system-zlib --libexecdir=/usr/lib --without-included-gettext --enable-threads=posix --with-gxx-include-dir=/usr/include/c++/4.6 --libdir=/usr/lib --enable-nls --with-sysroot=/ --enable-clocale=gnu --enable-libstdcxx-debug --enable-libstdcxx-time=yes --enable-gnu-unique-object --enable-plugin --enable-objc-gc --disable-werror --with-arch-32=i686 --with-tune=generic --enable-checking=release --build=x86_64-linux-gnu --host=x86_64-linux-gnu --target=x86_64-linux-gnu Thread model: posix gcc version 4.6.3 (Ubuntu/Linaro 4.6.3-1ubuntu5) Thanks, Xing -- To unsubscribe from this list: send the line unsubscribe ceph-devel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: unable to compile
Thanks, Noah! Xing On 11/3/2013 3:17 PM, Noah Watkins wrote: Thanks for looking at this. Unless there is a good solution I think reverting it is ok as breaking the compile on a few platflorms is not ok. Ill be lookong at this tonight. -- To unsubscribe from this list: send the line unsubscribe ceph-devel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH] osd/erasurecode: Fix memory leak in jerasure_matrix_to_bitmatrix()
Free allocated memory before return because of NULL input Signed-off-by: Xing Lin xing...@cs.utah.edu --- src/osd/ErasureCodePluginJerasure/jerasure.c | 5 - 1 file changed, 4 insertions(+), 1 deletion(-) diff --git a/src/osd/ErasureCodePluginJerasure/jerasure.c b/src/osd/ErasureCodePluginJerasure/jerasure.c index 9efae02..1bb4b1d 100755 --- a/src/osd/ErasureCodePluginJerasure/jerasure.c +++ b/src/osd/ErasureCodePluginJerasure/jerasure.c @@ -276,7 +276,10 @@ int *jerasure_matrix_to_bitmatrix(int k, int m, int w, int *matrix) int rowelts, rowindex, colindex, elt, i, j, l, x; bitmatrix = talloc(int, k*m*w*w); - if (matrix == NULL) { return NULL; } + if (matrix == NULL) { +free(bitmatrix); +return NULL; + } rowelts = k * w; rowindex = 0; -- 1.8.3.4 (Apple Git-47) -- To unsubscribe from this list: send the line unsubscribe ceph-devel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH] osd/erasurecode: correct one variable name in jerasure_matrix_to_bitmatrix()
When bitmatrix is NULL, this function returns NULL. Signed-off-by: Xing Lin xing...@cs.utah.edu --- src/osd/ErasureCodePluginJerasure/jerasure.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/src/osd/ErasureCodePluginJerasure/jerasure.c b/src/osd/ErasureCodePluginJerasure/jerasure.c index 9efae02..d5752a8 100755 --- a/src/osd/ErasureCodePluginJerasure/jerasure.c +++ b/src/osd/ErasureCodePluginJerasure/jerasure.c @@ -276,7 +276,7 @@ int *jerasure_matrix_to_bitmatrix(int k, int m, int w, int *matrix) int rowelts, rowindex, colindex, elt, i, j, l, x; bitmatrix = talloc(int, k*m*w*w); - if (matrix == NULL) { return NULL; } + if (bitmatrix == NULL) { return NULL; } rowelts = k * w; rowindex = 0; -- 1.8.3.4 (Apple Git-47) -- To unsubscribe from this list: send the line unsubscribe ceph-devel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: coverity scan - a plea for help!
Hi Sage, I would like to help here as well. Thanks, Xing On 10/31/2013 5:30 PM, Sage Weil wrote: Hi everyone, When I send this out several months ago, Danny Al-Gaaf stepped up and submitted an amazing number of patches cleaning up the most concerning issues that Coverity had picked up. His attention has been directed elsewhere more recently, but there are still a number of outstanding issues in Coverity's tracker that are reasonably quick and easy to resolve and will make our ability to identify newly introduced defects much simpler. Coverity Scan makes it really easy to participate: just create an account and I can grant you access to the Ceph project. If you're interested in contributing here (and it's an easy way to quickly start working with the Ceph code), let me know! Thanks- sage On Thu, 9 May 2013, Sage Weil wrote: We were added to coverity's awesome scan program a while back, which gives free access to their static analysis tool to open source projects. Currently it identifies 421 issues. We've already taken care of the ones that are highest impact, but the usefulness of periodic scans is limited until we can eliminate the noise from the remaining issues and easily see when new problems come up. If anybody is interested in helping out in the cleanup effort, let me know and I'll share the login info. This would provide significant value to our overall quality efforts and is a pretty easy way to make a meaningful contribution to the project without a huge investment in understanding the code and architecture! sage -- To unsubscribe from this list: send the line unsubscribe ceph-devel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html -- To unsubscribe from this list: send the line unsubscribe ceph-devel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: When ceph synchronizes journal to disk?
Thanks very much for all your explanations. I am now much clearer about it. Have a great day! Xing On 03/05/2013 01:12 PM, Greg Farnum wrote: All the data goes to the disk in write-back mode so it isn't safe yet until the flush is called. That's why it goes into the journal first, to be consistent at all times. If you would buffer everything in the journal and flush that at once you would overload the disk for that time. Let's say you have 300MB in the journal after 10 seconds and you want to flush that at once. That would mean that specific disk is unable to do any other operations then writing with 60MB/sec for 5 seconds. It's better to always write in write-back mode to the disk and flush at a certain point. In the meantime the scheduler can do it's job to balance between the reads and the writes. Wido Yep, what Wido said. Specifically, we do force the data to the journal with an fsync or equivalent before responding to the client, but once it's stable on the journal we give it to the filesystem (without doing any sort of forced sync). This is necessary — all reads are served from the filesystem. -Greg -- To unsubscribe from this list: send the line unsubscribe ceph-devel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: When ceph synchronizes journal to disk?
No, I am using xfs. The same thing happens even if I specified the journal mode explicitly as follows. filestore journal parallel = false filestore journal writeahead = true Xing On 03/04/2013 09:32 AM, Sage Weil wrote: Are you using btrfs? In that case, the journaling is parallel to the fs workload because the btrfs snapshots provide us with a stable checkpoint we can replay from. In contrast, for non-btrfs file systems we need to do writeahead journaling. -- To unsubscribe from this list: send the line unsubscribe ceph-devel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: When ceph synchronizes journal to disk?
Hi Gregory, Thanks for your reply. On 03/04/2013 09:55 AM, Gregory Farnum wrote: The journal [min|max] sync interval values specify how frequently the OSD's FileStore sends a sync to the disk. However, data is still written into the normal filesystem as it comes in, and the normal filesystem continues to schedule normal dirty data writeouts. This is good — it means that when we do send a sync down you don't need to wait for all (30 seconds * 100MB/s) 3GB or whatever of data to go to disk before it's completed. I do not think I understand this well. When the writeahead journal mode is in use, would you please explain what happens to a single 4M write request? I assume that an entry in the journal will be created for this write request and after this entry is flushed to the journal disk, Ceph returns successful. There should be no IO to the osd's disk. All IOs are supposed to go to the journal disk. At a later time, Ceph will start to apply these changes to the normal filesystem by reading from the first entry at which its previous synchronization stops. Finally, it will read this entry and apply this write change to the normal file system. Could you please point out where is wrong in my understanding? Thanks, I am running 0.48.2. The related configuration is as follows. If you're starting up a new cluster I recommend upgrading to the bobtail series (.56.3) instead of using Argonaut — it's got a number of enhancements you'll appreciate! Yeah, I would like to use bobtail series. However, I started to make small changes with Argonaut (0.48) and had ported my changes once to 0.48.2 when it was released. I think I am good to continue with it for the moment. I may consider to port my changes to bobtail series at a later time. Thanks, Xing -- To unsubscribe from this list: send the line unsubscribe ceph-devel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: When ceph synchronizes journal to disk?
Maybe it is easier to tell in this way. What we want to see is that the newly written data to stay in the journal disk for as long as possible such that write workloads do not compete for disk headers for read workloads. Any way to achieve that in Ceph? Thanks, Xing On 03/04/2013 09:55 AM, Gregory Farnum wrote: The journal [min|max] sync interval values specify how frequently the OSD's FileStore sends a sync to the disk. However, data is still written into the normal filesystem as it comes in, and the normal filesystem continues to schedule normal dirty data writeouts. This is good — it means that when we do send a sync down you don't need to wait for all (30 seconds * 100MB/s) 3GB or whatever of data to go to disk before it's completed. -- To unsubscribe from this list: send the line unsubscribe ceph-devel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: ssh passwords
If it is the command 'mkcephfs' that asked you for ssh password, then that is probably because that script needs to push some files (ceph.conf, e.g) to other hosts. If we open that script, we can see that it uses 'scp' to send some files. If I remember correctly, for every osd at other hosts, it will ask us ssh password seven times. So, we'd better set up public key first. :) Xing On 01/22/2013 11:24 AM, Gandalf Corvotempesta wrote: Hi all, i'm trying my very first ceph installation following the 5-minutes quickstart: http://ceph.com/docs/master/start/quick-start/#install-debian-ubuntu just a question: why ceph is asking me for SSH password? Is ceph trying to connect to itself via SSH? -- To unsubscribe from this list: send the line unsubscribe ceph-devel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html -- To unsubscribe from this list: send the line unsubscribe ceph-devel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: ssh passwords
I like the current approach. I think it is more convenient to run commands once at one host to do all the setup work. When the first time I deployed a ceph cluster with 4 hosts, I thought 'service ceph start' would start the whole ceph cluster. But as it turns out, it only starts local osd, mon processes. So, currently, I am using polysh to run the same commands at all hosts (mostly, to restart ceph service before every measurement.). Thanks. Xing On 01/22/2013 12:35 PM, Neil Levine wrote: Out of interest, would people prefer that the Ceph deployment script didn't try to handle server-server file copy and just did the local setup only, or is it useful that it tries to be a mini-config management tool at the same time? Neil -- To unsubscribe from this list: send the line unsubscribe ceph-devel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: ssh passwords
I did not notice that there exists such a parameter. Thanks, Dan! Xing On 01/22/2013 02:11 PM, Dan Mick wrote: The '-a/--allhosts' parameter is to spread the command across the cluster...that is, service ceph -a start will start across the cluster. -- To unsubscribe from this list: send the line unsubscribe ceph-devel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Test infrastructure: 2 or more servers?
You can change the number replicas in runtime with the following command: $ ceph osd pool set {poolname} size {num-replicas} Xing On 01/15/2013 03:00 PM, Gandalf Corvotempesta wrote: Is possible to change the number of replicas in realtime ? -- To unsubscribe from this list: send the line unsubscribe ceph-devel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Test infrastructure: 2 or more servers?
It seems to be: Ceph will shuffle data to rebalance in situations such as when we change the replica num or when some nodes or disks are down. Xing On 01/15/2013 03:26 PM, Gandalf Corvotempesta wrote: 2013/1/15 Xing Lin xing...@cs.utah.edu: You can change the number replicas in runtime with the following command: $ ceph osd pool set {poolname} size {num-replicas} So it's absolutely safe to start with just 2 server, make all the necessary tests and when ready to go in production, increase the servers and replicas? -- To unsubscribe from this list: send the line unsubscribe ceph-devel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Test infrastructure: 2 or more servers?
It seems to be: Ceph will shuffle data to rebalance in situations such as when we change the replica num or when some nodes or disks are down. Xing On 01/15/2013 03:26 PM, Gandalf Corvotempesta wrote: So it's absolutely safe to start with just 2 server, make all the necessary tests and when ready to go in production, increase the servers and replicas? -- To unsubscribe from this list: send the line unsubscribe ceph-devel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[debug help]: get dprintk() outputs in src/crush/mapper.c or net/crush/mapper.c
Hi, There are many dprintk() statements in the file mapper.c. How can we get to see outputs from these statements? I would like to see these, to get a better understanding about how these functions work together and then enhance my added algorithm. Besides, adding a printf() statement in the crush_do_rule() or my simple bucket_directmap_choose() does not give me any output. Thanks. Xing -- To unsubscribe from this list: send the line unsubscribe ceph-devel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: which Linux kernel version corresponds to 0.48argonaut?
Hi Mark, Thanks for your reply. I do not think I am running the packaged version. The output shows it is my version (0.48.2argonaut.fast at commit 000...). root@client:/users/utos# rbd -v ceph version 0.48.2argonaut.fast (commit:0) root@client:/users/utos# /usr/bin/rbd -v ceph version 0.48.2argonaut (commit:3e02b2fad88c2a95d9c0c86878f10d1beb780bfe) root@client:/users/utos# /usr/local/bin/rbd -v ceph version 0.48.2argonaut.fast (commit:0) Xing On 01/05/2013 08:00 PM, Mark Kirkwood wrote: I'd hazard a guess that you are still (accidentally) running the packaged binary - the packaged version installs in /usr/bin (etc) but your source build will probably be in /usr/local/bin. I've been through this myself and purged the packaged version before building and installing from source (just to be sure). Cheers Mark On 06/01/13 14:55, Xing Lin wrote: After changing the client-side code, I can map/unmap rbd block devices at client machines. However, I am not able to list rbd block devices. At the client machine, I first installed 0.48.2argonaut package for Ubuntu then I compiled and installed my own version according to instructions on this page ( http://ceph.com/docs/master/install/building-ceph/). The client failed to recognize the fifth bucket algorithm I added. I searched unsupported bucket algorithm in the ceph code base and that text only appeared in the src/crush/CrushWrapper.cc. I checked decode_crush_bucket() and it should be able to recognize the fifth algorithm. Even after I changed the error message (added print of [XXX] and values for two bucket algorithm macros), it still prints the same error message. So, it seems that my new version of CrushWrapper.cc is not used during compilation to create the final rbd binary. Would you please tell me where the problem is and how I can fix it? Thank you very much. -- To unsubscribe from this list: send the line unsubscribe ceph-devel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: which Linux kernel version corresponds to 0.48argonaut?
It works now. The old version of .so files in /usr/lib are linked, instead of new version of these files which are installed at /usr/local/lib. Thanks, Sage. Xing On 01/05/2013 09:46 PM, Sage Weil wrote: Hi, The rbd binary is dynamically linking to librbd1.so and librados2.so (usually in /usr/lib). You need to make sure that the .so's you compiled are the ones it links to to get your code to run. sage On Sat, 5 Jan 2013, Xing Lin wrote: Hi Mark, Thanks for your reply. I do not think I am running the packaged version. The output shows it is my version (0.48.2argonaut.fast at commit 000...). root@client:/users/utos# rbd -v ceph version 0.48.2argonaut.fast (commit:0) root@client:/users/utos# /usr/bin/rbd -v ceph version 0.48.2argonaut (commit:3e02b2fad88c2a95d9c0c86878f10d1beb780bfe) root@client:/users/utos# /usr/local/bin/rbd -v ceph version 0.48.2argonaut.fast (commit:0) Xing On 01/05/2013 08:00 PM, Mark Kirkwood wrote: I'd hazard a guess that you are still (accidentally) running the packaged binary - the packaged version installs in /usr/bin (etc) but your source build will probably be in /usr/local/bin. I've been through this myself and purged the packaged version before building and installing from source (just to be sure). Cheers Mark On 06/01/13 14:55, Xing Lin wrote: After changing the client-side code, I can map/unmap rbd block devices at client machines. However, I am not able to list rbd block devices. At the client machine, I first installed 0.48.2argonaut package for Ubuntu then I compiled and installed my own version according to instructions on this page ( http://ceph.com/docs/master/install/building-ceph/). The client failed to recognize the fifth bucket algorithm I added. I searched unsupported bucket algorithm in the ceph code base and that text only appeared in the src/crush/CrushWrapper.cc. I checked decode_crush_bucket() and it should be able to recognize the fifth algorithm. Even after I changed the error message (added print of [XXX] and values for two bucket algorithm macros), it still prints the same error message. So, it seems that my new version of CrushWrapper.cc is not used during compilation to create the final rbd binary. Would you please tell me where the problem is and how I can fix it? Thank you very much. -- To unsubscribe from this list: send the line unsubscribe ceph-devel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html -- To unsubscribe from this list: send the line unsubscribe ceph-devel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
which Linux kernel version corresponds to 0.48argonaut?
Hi, I was trying to add a simple replica placement algorithm in Ceph. This algorithm simply returns r_th item in a bucket for the r_th replica. I have made that change in Ceph source code (including files such as crush.h, crush.c, mapper.c, ...) and I can run Ceph monitor and osd daemons. However, I am not able to map rbd block devices at client machines. 'rbd map image0' reported input/output error and 'dmesg' at the client machine showed message like libceph: handle_map corrupt msg. I believe that is because I have not ported my changes to Ceph client side programs and it does not recognize the new placement algorithm. I probably need to recompile the rbd block device driver. When I was trying to replace Ceph related files in Linux with my own version, I noticed that files in Linux-3.2.16 are different from these included in Ceph source code. For example, the following is the diff of crush.h in Linux-3.2.16 and 0.48argonaut. So, my question is that is there any version of Linux that contains the exact Ceph files as included in 0.48argonaut? Thanks. --- $ diff -uNrp ceph-0.48argonaut/src/crush/crush.h linux-3.2.16/include/linux/crush/crush.h --- ceph-0.48argonaut/src/crush/crush.h2012-06-26 11:56:36.0 -0600 +++ linux-3.2.16/include/linux/crush/crush.h2012-04-22 16:31:32.0 -0600 @@ -1,12 +1,7 @@ #ifndef CEPH_CRUSH_CRUSH_H #define CEPH_CRUSH_CRUSH_H -#if defined(__linux__) #include linux/types.h -#elif defined(__FreeBSD__) -#include sys/types.h -#include include/inttypes.h -#endif /* * CRUSH is a pseudo-random data distribution algorithm that @@ -156,24 +151,25 @@ struct crush_map { struct crush_bucket **buckets; struct crush_rule **rules; +/* + * Parent pointers to identify the parent bucket a device or + * bucket in the hierarchy. If an item appears more than + * once, this is the _last_ time it appeared (where buckets + * are processed in bucket id order, from -1 on down to + * -max_buckets. + */ +__u32 *bucket_parents; +__u32 *device_parents; + __s32 max_buckets; __u32 max_rules; __s32 max_devices; - -/* choose local retries before re-descent */ -__u32 choose_local_tries; -/* choose local attempts using a fallback permutation before - * re-descent */ -__u32 choose_local_fallback_tries; -/* choose attempts before giving up */ -__u32 choose_total_tries; - -__u32 *choose_tries; }; /* crush.c */ -extern int crush_get_bucket_item_weight(const struct crush_bucket *b, int pos); +extern int crush_get_bucket_item_weight(struct crush_bucket *b, int pos); +extern void crush_calc_parents(struct crush_map *map); extern void crush_destroy_bucket_uniform(struct crush_bucket_uniform *b); extern void crush_destroy_bucket_list(struct crush_bucket_list *b); extern void crush_destroy_bucket_tree(struct crush_bucket_tree *b); @@ -181,9 +177,4 @@ extern void crush_destroy_bucket_straw(s extern void crush_destroy_bucket(struct crush_bucket *b); extern void crush_destroy(struct crush_map *map); -static inline int crush_calc_tree_node(int i) -{ -return ((i+1) 1)-1; -} - #endif Xing -- To unsubscribe from this list: send the line unsubscribe ceph-devel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: which Linux kernel version corresponds to 0.48argonaut?
This may be useful for other Ceph newbies just like me. I have ported my changes to 0.48argonaut to related Ceph files included in Linux, though files with the same name are not exactly the same. Then I recompiled and installed the kernel. After that, everything seems to be working again now: Ceph is working with my new simple replica placement algorithm. :) So, it seems that Ceph files included in the Linux kernel are supposed to be different from those in 0.48argonaut. Presumably, the Linux kernel contains the client-side implementation while 0.48argonaut contains the server-side implementation. It would be appreciated if someone can confirm it. Thank you! Xing On 12/20/2012 11:54 AM, Xing Lin wrote: Hi, I was trying to add a simple replica placement algorithm in Ceph. This algorithm simply returns r_th item in a bucket for the r_th replica. I have made that change in Ceph source code (including files such as crush.h, crush.c, mapper.c, ...) and I can run Ceph monitor and osd daemons. However, I am not able to map rbd block devices at client machines. 'rbd map image0' reported input/output error and 'dmesg' at the client machine showed message like libceph: handle_map corrupt msg. I believe that is because I have not ported my changes to Ceph client side programs and it does not recognize the new placement algorithm. I probably need to recompile the rbd block device driver. When I was trying to replace Ceph related files in Linux with my own version, I noticed that files in Linux-3.2.16 are different from these included in Ceph source code. For example, the following is the diff of crush.h in Linux-3.2.16 and 0.48argonaut. So, my question is that is there any version of Linux that contains the exact Ceph files as included in 0.48argonaut? Thanks. --- $ diff -uNrp ceph-0.48argonaut/src/crush/crush.h linux-3.2.16/include/linux/crush/crush.h --- ceph-0.48argonaut/src/crush/crush.h2012-06-26 11:56:36.0 -0600 +++ linux-3.2.16/include/linux/crush/crush.h2012-04-22 16:31:32.0 -0600 @@ -1,12 +1,7 @@ #ifndef CEPH_CRUSH_CRUSH_H #define CEPH_CRUSH_CRUSH_H -#if defined(__linux__) #include linux/types.h -#elif defined(__FreeBSD__) -#include sys/types.h -#include include/inttypes.h -#endif /* * CRUSH is a pseudo-random data distribution algorithm that @@ -156,24 +151,25 @@ struct crush_map { struct crush_bucket **buckets; struct crush_rule **rules; +/* + * Parent pointers to identify the parent bucket a device or + * bucket in the hierarchy. If an item appears more than + * once, this is the _last_ time it appeared (where buckets + * are processed in bucket id order, from -1 on down to + * -max_buckets. + */ +__u32 *bucket_parents; +__u32 *device_parents; + __s32 max_buckets; __u32 max_rules; __s32 max_devices; - -/* choose local retries before re-descent */ -__u32 choose_local_tries; -/* choose local attempts using a fallback permutation before - * re-descent */ -__u32 choose_local_fallback_tries; -/* choose attempts before giving up */ -__u32 choose_total_tries; - -__u32 *choose_tries; }; /* crush.c */ -extern int crush_get_bucket_item_weight(const struct crush_bucket *b, int pos); +extern int crush_get_bucket_item_weight(struct crush_bucket *b, int pos); +extern void crush_calc_parents(struct crush_map *map); extern void crush_destroy_bucket_uniform(struct crush_bucket_uniform *b); extern void crush_destroy_bucket_list(struct crush_bucket_list *b); extern void crush_destroy_bucket_tree(struct crush_bucket_tree *b); @@ -181,9 +177,4 @@ extern void crush_destroy_bucket_straw(s extern void crush_destroy_bucket(struct crush_bucket *b); extern void crush_destroy(struct crush_map *map); -static inline int crush_calc_tree_node(int i) -{ -return ((i+1) 1)-1; -} - #endif Xing -- To unsubscribe from this list: send the line unsubscribe ceph-devel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html -- To unsubscribe from this list: send the line unsubscribe ceph-devel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
questions with rbd sequential read throughputs inside kvm/qemu VMs
Hi, I am interested to use rbd block devices inside kvm/qemu VMs. I set up a tiny ceph cluster using one server machines and used 6 SCSI disks for storing data. At the client machine, the sequential read throughput seems to be reasonable (~60 MB/s) when I run fio against rbd block devices mounted outside of VMs. The read throughput does not seem to be reasonable when I use rbd block devices as block devices inside kvm/qemu VMs: it jumps to be as high as 200 MB/s. 'tcpdump' shows read requests do reach the ceph server. What makes things confusing is the 'iotop' does not show any I/O for sequential reads while 'top' shows 'ceph-osd' is utilizing CPU at 100%. This is the section for the rbd disk in VM's xml file. disk type='network' device='disk' driver name='qemu' type='raw'/ source protocol='rbd' name='rbd/image3' host name='node-0' port='6789'/ /source target dev='vda' bus='virtio'/ address type='pci' domain='0x' bus='0x00' slot='0x05' function='0x0'/ /disk This is the fio job file I used to measure throughputs. -- [global] rw=read bs=4m thread=0 time_based=1 runtime=300 invalidate=1 direct=1 sync=1 ioengine=sync [sr-vda] filename=${DEV} - Does anyone have some suggestions or hints for me to try? Thank you very much! Xing -- To unsubscribe from this list: send the line unsubscribe ceph-devel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html