Hi Gregory,

Thanks very much for your quick reply. When I started to look into Ceph, 
Bobtail was the latest stable release and that was why I picked that version 
and started to make a few modifications. I have not ported my changes to 0.79 
yet. The plan is if v-0.79 can provide a higher disk bandwidth efficiency, I 
will switch to 0.79. Unfortunately, that does not seem to be the case. 

The futex trace was done with version 0.79, not 0.59. I did a profile in 0.59 
too. There are some improvements, such as the introduction of fd cache. But 
lots of futex calls are still there in v-0.79. I also measured the maximum 
bandwidth from each disk we can get in Version 0.79. It does not improve 
significantly: we can still only get 90~100 MB/s from each disk. 

Thanks,
Xing


On Apr 25, 2014, at 2:42 PM, Gregory Farnum <g...@inktank.com> wrote:

> Bobtail is really too old to draw any meaningful conclusions from; why
> did you choose it?
> 
> That's not to say that performance on current code will be better
> (though it very much might be), but the internal architecture has
> changed in some ways that will be particularly important for the futex
> profiling you did, and are probably important for these throughput
> results as well.
> -Greg
> Software Engineer #42 @ http://inktank.com | http://ceph.com
> 
> 
> On Fri, Apr 25, 2014 at 1:38 PM, Xing <xing...@cs.utah.edu> wrote:
>> Hi,
>> 
>> I also did a few other experiments, trying to get what the maximum bandwidth 
>> we can get from each data disk. The output is not encouraging: for disks 
>> that can provide 150 MB/s block-level sequential read bandwidths, we can 
>> only get about 90MB/s from each disk. Something that is particular 
>> interesting is that the replica size also affects the bandwidth we could get 
>> from the cluster. It seems that there is no such observation/conversations 
>> in the Ceph community and I think it may be helpful to share my findings.
>> 
>> The experiment was run with two d820 machines in Emulab at University of 
>> Utah. One is used as the data node and the other is used as the client. They 
>> are connected by a 10 GB/s Ethernet. The data node has 7 disks, one for OS 
>> and the rest 6 for OSDs. For the rest 6 disks, we use one for journal and 
>> the other for data. Thus in total we have 3 OSDs. The network bandwidth is 
>> sufficient to support reading from 3 disks in full bandwidth.
>> 
>> I varied the read-ahead size for the rbd block device (exp1), osd op threads 
>> for each osd (exp2), varied the replica size (exp3), and object size (exp4). 
>> The most interesting is varying the replica size. As I varied replica size 
>> from 1, to 2 and to 3, the aggregated bandwidth dropped from 267 MB/s to 211 
>> and 180. The reason for the drop I believe is as we increase the number of 
>> replicas, we store more data into each OSD. then when we need to read it 
>> back, we have to read from a larger range (more seeks). The fundamental 
>> problem is likely because we are doing replication synchronously, and thus 
>> layout object files in a raid 10 - near format, rather than the far format. 
>> For the difference between the near format and far format for raid 10, you 
>> could have a look at the link provided below.
>> 
>> http://lxr.free-electrons.com/source/Documentation/device-mapper/dm-raid.txt
>> 
>> For results about other experiments, you could download my slides at the 
>> link provided below.
>> http://www.cs.utah.edu/~xinglin/slides/ceph-bandiwdth-exp.pptx
>> 
>> 
>> I do not know why Ceph only gets about 60% of the disk bandwidth. To do a 
>> comparison, I ran tar to read every rbd object files to create a tarball and 
>> see how much bandwidth I can get from this workload. Interestingly, the tar 
>> workload actually gets a higher bandwidth (80% of block level bandwidth), 
>> even though it is accessing the disk more randomly (tar reads each object 
>> file in a dir sequentially while the object files were created in a 
>> different order.). For more detail, please goto my blog to have a read.
>> http://xinglin-system.blogspot.com/2014/04/ceph-lab-note-1-disk-read-bandwidth-in.html
>> 
>> Here are a few questions.
>> 1. What are the maximum bandwidth people can get from each disk? I found 
>> Jiangang from Intel also reported 57% efficiency for disk bandwidth. He 
>> suggested one reason: interference among so many sequential read workloads. 
>> I agree but when I tried to run with one single workload, I still do not get 
>> a higher efficiency.
>> 2. If the efficiency is about 60%, what are the reasons that cause this? 
>> Could it be because of the locks (futex as I mentioned in my previous email) 
>> or anything else?
>> 
>> Thanks very much for any feedback.
>> 
>> Thanks,
>> Xing
>> 
>> 
>> 
>> 
>> --
>> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
>> the body of a message to majord...@vger.kernel.org
>> More majordomo info at  http://vger.kernel.org/majordomo-info.html

--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Reply via email to