question about futex calls in OSD

2014-04-25 Thread Xing
Hi,

This is Xing, a graduate student at the University of Utah. I am playing with 
Ceph and have a few questions about futex calls in OSD process. I am using 
Ceph-v0.79. The data node machine has 7 disks, one for OS and the other 6 disks 
for running 6 OSDs. I set replica size = 1, for all three pools (this is to 
improve sequential read bandwidth. Since ceph is doing replication 
synchronously, it layouts data in a fashion similar as RAID 10 with the near 
format for replica size=2 with two OSDs. Each OSD will store exactly the same 
copy of whole data while during read back, only half blocks will be read from 
each OSD: it will read one block, seek over the next block and then read the 
third block. Ideally, to get full disk bandwidth, we should layout data as the 
far format in RAID 10. I do not know how to do that in Ceph and thus simply 
tried with replicasize=1). I created a rbd block device, initialized with an 
ext4 fs, stored a 10GB file into it and then read it back sequentially. I set 
the read_ahead to be 128M or 1G, to get the maximum bandwidth for a single rbd 
block device (I know this is not realistic). 

While these reads requests are being served, I used strace to capture all 
system calls for an OSD thread that actually does reads. I observed lots of 
futex system calls. By specifying -T for strace, it reports time spent by each 
system call and I summed it up to come up with total time spent for each system 
call. I further break down two main futex calls that contributes most to its 
overhead.
--
syscallruntime (s)   call num  average (s)
pread  11.1028   420   0.026435
fgetxattr  0.178381  1680  0.000106
futex  8.83125   5217
total runtime: 21

   runtime (s)  call num average (s)
futex(WAIT_PRIVATE):   4.97166  1415 0.003513
futex(WAIT_BITSET_PRIVATE  3.79 51   0.0743
--

The overhead of lock seems to be quite high. I image this could become worse as 
I increase the number of workloads. I was wondering why there are so many futex 
calls and took some time to look into it. It appears to me that there are three 
locks used during the read path in OSD. These locks are Mutexes and I believe 
the futex calls I observed in straces are results of operations on these 
Mutexes. 

a. snapset_contexts_lock, used in functions such as get_snapset_context() and 
put_snapset_context()

b. fdcache_lock, used in lfn_open() and such

c. ondisk_read_lock, used in execute_ctx(). 

The one that affects most is the snapset_contexts_lock: this lock seems to be 
the monolithic lock for controlling access to all object files in a snapset or 
to all object files within the same placement group and belonging to the same 
snapset (not sure what a ’snapset’ means. It sounds like equivalent to a 
’snapshot’.). To read a block from an object file, OSD needs to first read two 
extended attributes (OI_ATTR and SS_ATTR) for that file. For each read of these 
attributes, it seems the snapset_contexts_lock is involved: the SS_ATTR 
attribute is read inside the get_snapset_context() function; the OI_ATTR 
attribute is read in the function get_object_context() which could be called 
from the function find_object_context(). In the find_object_context(), I can 
also find a few calls to get_snapset_context() function. To release a snapset 
context, the snapset_contexts_lock is also involved (in the 
put_snapset_context()). 

Here are a few questions.
1. What is the snapset_contexts_lock used for? Is it used to control accesses 
to all files in a snapset or all files in the same placement group and 
belonging to the same snapset? Such a big lock design does not seem to be 
scale. Any comments?
2. Has anyone noticed the overhead in locking? I tried to comment out the lock 
in get_snapset_context() and put_snapset_context(), by commenting the Mutex 
instantiation statement but that turns the system to be not working: I could 
not list/mount rbd block devices. 

Thanks very much for reading this. Any comment is welcome. 
Thanks,
Xing




--
To unsubscribe from this list: send the line unsubscribe ceph-devel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


bandwidth with Ceph - v0.59 (Bobtail)

2014-04-25 Thread Xing
Hi,

I also did a few other experiments, trying to get what the maximum bandwidth we 
can get from each data disk. The output is not encouraging: for disks that can 
provide 150 MB/s block-level sequential read bandwidths, we can only get about 
90MB/s from each disk. Something that is particular interesting is that the 
replica size also affects the bandwidth we could get from the cluster. It seems 
that there is no such observation/conversations in the Ceph community and I 
think it may be helpful to share my findings.  

The experiment was run with two d820 machines in Emulab at University of Utah. 
One is used as the data node and the other is used as the client. They are 
connected by a 10 GB/s Ethernet. The data node has 7 disks, one for OS and the 
rest 6 for OSDs. For the rest 6 disks, we use one for journal and the other for 
data. Thus in total we have 3 OSDs. The network bandwidth is sufficient to 
support reading from 3 disks in full bandwidth. 

I varied the read-ahead size for the rbd block device (exp1), osd op threads 
for each osd (exp2), varied the replica size (exp3), and object size (exp4). 
The most interesting is varying the replica size. As I varied replica size from 
1, to 2 and to 3, the aggregated bandwidth dropped from 267 MB/s to 211 and 
180. The reason for the drop I believe is as we increase the number of 
replicas, we store more data into each OSD. then when we need to read it back, 
we have to read from a larger range (more seeks). The fundamental problem is 
likely because we are doing replication synchronously, and thus layout object 
files in a raid 10 - near format, rather than the far format. For the 
difference between the near format and far format for raid 10, you could have a 
look at the link provided below. 

http://lxr.free-electrons.com/source/Documentation/device-mapper/dm-raid.txt

For results about other experiments, you could download my slides at the link 
provided below. 
http://www.cs.utah.edu/~xinglin/slides/ceph-bandiwdth-exp.pptx


I do not know why Ceph only gets about 60% of the disk bandwidth. To do a 
comparison, I ran tar to read every rbd object files to create a tarball and 
see how much bandwidth I can get from this workload. Interestingly, the tar 
workload actually gets a higher bandwidth (80% of block level bandwidth), even 
though it is accessing the disk more randomly (tar reads each object file in a 
dir sequentially while the object files were created in a different order.). 
For more detail, please goto my blog to have a read. 
http://xinglin-system.blogspot.com/2014/04/ceph-lab-note-1-disk-read-bandwidth-in.html

Here are a few questions. 
1. What are the maximum bandwidth people can get from each disk? I found 
Jiangang from Intel also reported 57% efficiency for disk bandwidth. He 
suggested one reason: interference among so many sequential read workloads. I 
agree but when I tried to run with one single workload, I still do not get a 
higher efficiency. 
2. If the efficiency is about 60%, what are the reasons that cause this? Could 
it be because of the locks (futex as I mentioned in my previous email) or 
anything else? 

Thanks very much for any feedback. 

Thanks,
Xing




--
To unsubscribe from this list: send the line unsubscribe ceph-devel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: bandwidth with Ceph - v0.59 (Bobtail)

2014-04-25 Thread Xing
Hi Gregory,

Thanks very much for your quick reply. When I started to look into Ceph, 
Bobtail was the latest stable release and that was why I picked that version 
and started to make a few modifications. I have not ported my changes to 0.79 
yet. The plan is if v-0.79 can provide a higher disk bandwidth efficiency, I 
will switch to 0.79. Unfortunately, that does not seem to be the case. 

The futex trace was done with version 0.79, not 0.59. I did a profile in 0.59 
too. There are some improvements, such as the introduction of fd cache. But 
lots of futex calls are still there in v-0.79. I also measured the maximum 
bandwidth from each disk we can get in Version 0.79. It does not improve 
significantly: we can still only get 90~100 MB/s from each disk. 

Thanks,
Xing


On Apr 25, 2014, at 2:42 PM, Gregory Farnum g...@inktank.com wrote:

 Bobtail is really too old to draw any meaningful conclusions from; why
 did you choose it?
 
 That's not to say that performance on current code will be better
 (though it very much might be), but the internal architecture has
 changed in some ways that will be particularly important for the futex
 profiling you did, and are probably important for these throughput
 results as well.
 -Greg
 Software Engineer #42 @ http://inktank.com | http://ceph.com
 
 
 On Fri, Apr 25, 2014 at 1:38 PM, Xing xing...@cs.utah.edu wrote:
 Hi,
 
 I also did a few other experiments, trying to get what the maximum bandwidth 
 we can get from each data disk. The output is not encouraging: for disks 
 that can provide 150 MB/s block-level sequential read bandwidths, we can 
 only get about 90MB/s from each disk. Something that is particular 
 interesting is that the replica size also affects the bandwidth we could get 
 from the cluster. It seems that there is no such observation/conversations 
 in the Ceph community and I think it may be helpful to share my findings.
 
 The experiment was run with two d820 machines in Emulab at University of 
 Utah. One is used as the data node and the other is used as the client. They 
 are connected by a 10 GB/s Ethernet. The data node has 7 disks, one for OS 
 and the rest 6 for OSDs. For the rest 6 disks, we use one for journal and 
 the other for data. Thus in total we have 3 OSDs. The network bandwidth is 
 sufficient to support reading from 3 disks in full bandwidth.
 
 I varied the read-ahead size for the rbd block device (exp1), osd op threads 
 for each osd (exp2), varied the replica size (exp3), and object size (exp4). 
 The most interesting is varying the replica size. As I varied replica size 
 from 1, to 2 and to 3, the aggregated bandwidth dropped from 267 MB/s to 211 
 and 180. The reason for the drop I believe is as we increase the number of 
 replicas, we store more data into each OSD. then when we need to read it 
 back, we have to read from a larger range (more seeks). The fundamental 
 problem is likely because we are doing replication synchronously, and thus 
 layout object files in a raid 10 - near format, rather than the far format. 
 For the difference between the near format and far format for raid 10, you 
 could have a look at the link provided below.
 
 http://lxr.free-electrons.com/source/Documentation/device-mapper/dm-raid.txt
 
 For results about other experiments, you could download my slides at the 
 link provided below.
 http://www.cs.utah.edu/~xinglin/slides/ceph-bandiwdth-exp.pptx
 
 
 I do not know why Ceph only gets about 60% of the disk bandwidth. To do a 
 comparison, I ran tar to read every rbd object files to create a tarball and 
 see how much bandwidth I can get from this workload. Interestingly, the tar 
 workload actually gets a higher bandwidth (80% of block level bandwidth), 
 even though it is accessing the disk more randomly (tar reads each object 
 file in a dir sequentially while the object files were created in a 
 different order.). For more detail, please goto my blog to have a read.
 http://xinglin-system.blogspot.com/2014/04/ceph-lab-note-1-disk-read-bandwidth-in.html
 
 Here are a few questions.
 1. What are the maximum bandwidth people can get from each disk? I found 
 Jiangang from Intel also reported 57% efficiency for disk bandwidth. He 
 suggested one reason: interference among so many sequential read workloads. 
 I agree but when I tried to run with one single workload, I still do not get 
 a higher efficiency.
 2. If the efficiency is about 60%, what are the reasons that cause this? 
 Could it be because of the locks (futex as I mentioned in my previous email) 
 or anything else?
 
 Thanks very much for any feedback.
 
 Thanks,
 Xing
 
 
 
 
 --
 To unsubscribe from this list: send the line unsubscribe ceph-devel in
 the body of a message to majord...@vger.kernel.org
 More majordomo info at  http://vger.kernel.org/majordomo-info.html

--
To unsubscribe from this list: send the line unsubscribe ceph-devel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http

Re: bandwidth with Ceph - v0.59 (Bobtail)

2014-04-25 Thread Xing Lin

Hi Mark,

That seems pretty good. What is the block level sequential read 
bandwidth of your disks? What configuration did you use? What was the 
replica size, read_ahead for your rbds and what were the number of 
workloads you used? I used btrfs in my experiments as well.


Thanks,
Xing

On 04/25/2014 03:36 PM, Mark Nelson wrote:
For what it's worth, I've been able to achieve up to around 120MB/s 
with btrfs before things fragment.


Mark

On 04/25/2014 03:59 PM, Xing wrote:

Hi Gregory,

Thanks very much for your quick reply. When I started to look into 
Ceph, Bobtail was the latest stable release and that was why I picked 
that version and started to make a few modifications. I have not 
ported my changes to 0.79 yet. The plan is if v-0.79 can provide a 
higher disk bandwidth efficiency, I will switch to 0.79. 
Unfortunately, that does not seem to be the case.


The futex trace was done with version 0.79, not 0.59. I did a profile 
in 0.59 too. There are some improvements, such as the introduction of 
fd cache. But lots of futex calls are still there in v-0.79. I also 
measured the maximum bandwidth from each disk we can get in Version 
0.79. It does not improve significantly: we can still only get 90~100 
MB/s from each disk.


Thanks,
Xing


On Apr 25, 2014, at 2:42 PM, Gregory Farnum g...@inktank.com wrote:


Bobtail is really too old to draw any meaningful conclusions from; why
did you choose it?

That's not to say that performance on current code will be better
(though it very much might be), but the internal architecture has
changed in some ways that will be particularly important for the futex
profiling you did, and are probably important for these throughput
results as well.
-Greg
Software Engineer #42 @ http://inktank.com | http://ceph.com


On Fri, Apr 25, 2014 at 1:38 PM, Xing xing...@cs.utah.edu wrote:

Hi,

I also did a few other experiments, trying to get what the maximum 
bandwidth we can get from each data disk. The output is not 
encouraging: for disks that can provide 150 MB/s block-level 
sequential read bandwidths, we can only get about 90MB/s from each 
disk. Something that is particular interesting is that the replica 
size also affects the bandwidth we could get from the cluster. It 
seems that there is no such observation/conversations in the Ceph 
community and I think it may be helpful to share my findings.


The experiment was run with two d820 machines in Emulab at 
University of Utah. One is used as the data node and the other is 
used as the client. They are connected by a 10 GB/s Ethernet. The 
data node has 7 disks, one for OS and the rest 6 for OSDs. For the 
rest 6 disks, we use one for journal and the other for data. Thus 
in total we have 3 OSDs. The network bandwidth is sufficient to 
support reading from 3 disks in full bandwidth.


I varied the read-ahead size for the rbd block device (exp1), osd 
op threads for each osd (exp2), varied the replica size (exp3), and 
object size (exp4). The most interesting is varying the replica 
size. As I varied replica size from 1, to 2 and to 3, the 
aggregated bandwidth dropped from 267 MB/s to 211 and 180. The 
reason for the drop I believe is as we increase the number of 
replicas, we store more data into each OSD. then when we need to 
read it back, we have to read from a larger range (more seeks). The 
fundamental problem is likely because we are doing replication 
synchronously, and thus layout object files in a raid 10 - near 
format, rather than the far format. For the difference between the 
near format and far format for raid 10, you could have a look at 
the link provided below.


http://lxr.free-electrons.com/source/Documentation/device-mapper/dm-raid.txt 



For results about other experiments, you could download my slides 
at the link provided below.

http://www.cs.utah.edu/~xinglin/slides/ceph-bandiwdth-exp.pptx


I do not know why Ceph only gets about 60% of the disk bandwidth. 
To do a comparison, I ran tar to read every rbd object files to 
create a tarball and see how much bandwidth I can get from this 
workload. Interestingly, the tar workload actually gets a higher 
bandwidth (80% of block level bandwidth), even though it is 
accessing the disk more randomly (tar reads each object file in a 
dir sequentially while the object files were created in a different 
order.). For more detail, please goto my blog to have a read.
http://xinglin-system.blogspot.com/2014/04/ceph-lab-note-1-disk-read-bandwidth-in.html 



Here are a few questions.
1. What are the maximum bandwidth people can get from each disk? I 
found Jiangang from Intel also reported 57% efficiency for disk 
bandwidth. He suggested one reason: interference among so many 
sequential read workloads. I agree but when I tried to run with one 
single workload, I still do not get a higher efficiency.
2. If the efficiency is about 60%, what are the reasons that cause 
this? Could it be because of the locks (futex as I mentioned in my 
previous email

Re: bandwidth with Ceph - v0.59 (Bobtail)

2014-04-25 Thread Xing Lin

Hi Mark,

Thanks for sharing this. I did read these blogs early. If we look at the 
aggregated bandwidth, 600-700 MB/s for reads for 6 disks are quite good. 
But consider it is shared among 256 concurrent read streams, each one 
gets as little as 2-3 MB/s bandwidth. This does not sound that right.


I think the read bandwidth of a disk will be close to its write 
bandwidth. But just double-check: what the sequential read bandwidth 
your disks can provide?


I also read your follow-up blogs, comparing bobtail and cuttlefish. One 
thing I do not get from your experiments is that it hit the network 
bottleneck much earlier before being bottlenecked by disks. Could you 
setup a smaller cluster (e.g with 8 disks, rather than 24) such as a 10 
Gb/s link will not become the bottleneck and then test how much disk 
bandwidth can be achieved, preferably with new releases of Ceph. The 
other concern is I am not sure how close RADOS bench results are when 
compared with kernel RBD performance. I would appreciate it if you can 
do that. Thanks,


Xing

On 04/25/2014 04:16 PM, Mark Nelson wrote:
I don't have any recent results published, but you can see some of the 
older results from bobtail here:


http://ceph.com/performance-2/argonaut-vs-bobtail-performance-preview/

Specifically, look at the 256 concurrent 4MB rados bench tests. In a 6 
disk, 2 SSD configuration we could push about 800MB/s for writes (no 
replication) and around 600-700MB/s for reads with BTRFS.  On this 
controller using a RAID0 configuration with WB cache helps quite a 
bit, but in other tests I've seen similar results with a 9207-8i that 
doesn't have WB cache when BTRFS filestores and SSD journals are used.


Regarding the drives, they can do somewhere around 140-150MB/s large 
block writes with fio.


Replication definitely adds additional latency so aggregate write 
throughput goes down, though it seems the penalty is worst after the 
first replica and doesn't hurt as much with subsequent ones.


Mark


On 04/25/2014 04:50 PM, Xing Lin wrote:

Hi Mark,

That seems pretty good. What is the block level sequential read
bandwidth of your disks? What configuration did you use? What was the
replica size, read_ahead for your rbds and what were the number of
workloads you used? I used btrfs in my experiments as well.

Thanks,
Xing

On 04/25/2014 03:36 PM, Mark Nelson wrote:

For what it's worth, I've been able to achieve up to around 120MB/s
with btrfs before things fragment.

Mark

On 04/25/2014 03:59 PM, Xing wrote:

Hi Gregory,

Thanks very much for your quick reply. When I started to look into
Ceph, Bobtail was the latest stable release and that was why I picked
that version and started to make a few modifications. I have not
ported my changes to 0.79 yet. The plan is if v-0.79 can provide a
higher disk bandwidth efficiency, I will switch to 0.79.
Unfortunately, that does not seem to be the case.

The futex trace was done with version 0.79, not 0.59. I did a profile
in 0.59 too. There are some improvements, such as the introduction of
fd cache. But lots of futex calls are still there in v-0.79. I also
measured the maximum bandwidth from each disk we can get in Version
0.79. It does not improve significantly: we can still only get 90~100
MB/s from each disk.

Thanks,
Xing


On Apr 25, 2014, at 2:42 PM, Gregory Farnum g...@inktank.com wrote:

Bobtail is really too old to draw any meaningful conclusions from; 
why

did you choose it?

That's not to say that performance on current code will be better
(though it very much might be), but the internal architecture has
changed in some ways that will be particularly important for the 
futex

profiling you did, and are probably important for these throughput
results as well.
-Greg
Software Engineer #42 @ http://inktank.com | http://ceph.com


On Fri, Apr 25, 2014 at 1:38 PM, Xing xing...@cs.utah.edu wrote:

Hi,

I also did a few other experiments, trying to get what the maximum
bandwidth we can get from each data disk. The output is not
encouraging: for disks that can provide 150 MB/s block-level
sequential read bandwidths, we can only get about 90MB/s from each
disk. Something that is particular interesting is that the replica
size also affects the bandwidth we could get from the cluster. It
seems that there is no such observation/conversations in the Ceph
community and I think it may be helpful to share my findings.

The experiment was run with two d820 machines in Emulab at
University of Utah. One is used as the data node and the other is
used as the client. They are connected by a 10 GB/s Ethernet. The
data node has 7 disks, one for OS and the rest 6 for OSDs. For the
rest 6 disks, we use one for journal and the other for data. Thus
in total we have 3 OSDs. The network bandwidth is sufficient to
support reading from 3 disks in full bandwidth.

I varied the read-ahead size for the rbd block device (exp1), osd
op threads for each osd (exp2), varied the replica size (exp3), and
object size (exp4

Re: bandwidth with Ceph - v0.59 (Bobtail)

2014-04-25 Thread Xing
One more thing that needs to be considered is as we add more sequential 
workloads into the disk, the aggregated bandwidth will start to drop. For 
example, for the TOSHIBA MBF2600RC SCSI disk, we could get 155 MB/s sequential 
read bandwidth for a single workload. As we add more, the aggregated bandwidth 
will drop. 

number of SRs, aggregated bandwidth (KB/s)
1 154145
2 144296
3 147994
4 141063
5 134698
6 133874
7 130915
8 132366
9 97068
10 111897
11 108508.5
12 106450.9
13 105521.9
14 102411.7
15 102618.2
16 102779.1
17 102745

As we can see, we can only get about 100MB/s once we are running ~10 concurrent 
workloads. For your case, you were running 256 concurrent read streams for 6/8 
disks, I would expect the aggregated disk bandwidth too be lower than 100 MB/s 
per disk. Any thoughts? 

Thanks,
Xing

On Apr 25, 2014, at 5:47 PM, Xing Lin xing...@cs.utah.edu wrote:

 Hi Mark,
 
 Thanks for sharing this. I did read these blogs early. If we look at the 
 aggregated bandwidth, 600-700 MB/s for reads for 6 disks are quite good. But 
 consider it is shared among 256 concurrent read streams, each one gets as 
 little as 2-3 MB/s bandwidth. This does not sound that right.
 
 I think the read bandwidth of a disk will be close to its write bandwidth. 
 But just double-check: what the sequential read bandwidth your disks can 
 provide?
 
 I also read your follow-up blogs, comparing bobtail and cuttlefish. One thing 
 I do not get from your experiments is that it hit the network bottleneck much 
 earlier before being bottlenecked by disks. Could you setup a smaller cluster 
 (e.g with 8 disks, rather than 24) such as a 10 Gb/s link will not become the 
 bottleneck and then test how much disk bandwidth can be achieved, preferably 
 with new releases of Ceph. The other concern is I am not sure how close RADOS 
 bench results are when compared with kernel RBD performance. I would 
 appreciate it if you can do that. Thanks,
 
 Xing
 
 On 04/25/2014 04:16 PM, Mark Nelson wrote:
 I don't have any recent results published, but you can see some of the older 
 results from bobtail here:
 
 http://ceph.com/performance-2/argonaut-vs-bobtail-performance-preview/
 
 Specifically, look at the 256 concurrent 4MB rados bench tests. In a 6 disk, 
 2 SSD configuration we could push about 800MB/s for writes (no replication) 
 and around 600-700MB/s for reads with BTRFS.  On this controller using a 
 RAID0 configuration with WB cache helps quite a bit, but in other tests I've 
 seen similar results with a 9207-8i that doesn't have WB cache when BTRFS 
 filestores and SSD journals are used.
 
 Regarding the drives, they can do somewhere around 140-150MB/s large block 
 writes with fio.
 
 Replication definitely adds additional latency so aggregate write throughput 
 goes down, though it seems the penalty is worst after the first replica and 
 doesn't hurt as much with subsequent ones.
 
 Mark
 
 
 On 04/25/2014 04:50 PM, Xing Lin wrote:
 Hi Mark,
 
 That seems pretty good. What is the block level sequential read
 bandwidth of your disks? What configuration did you use? What was the
 replica size, read_ahead for your rbds and what were the number of
 workloads you used? I used btrfs in my experiments as well.
 
 Thanks,
 Xing
 
 On 04/25/2014 03:36 PM, Mark Nelson wrote:
 For what it's worth, I've been able to achieve up to around 120MB/s
 with btrfs before things fragment.
 
 Mark
 
 On 04/25/2014 03:59 PM, Xing wrote:
 Hi Gregory,
 
 Thanks very much for your quick reply. When I started to look into
 Ceph, Bobtail was the latest stable release and that was why I picked
 that version and started to make a few modifications. I have not
 ported my changes to 0.79 yet. The plan is if v-0.79 can provide a
 higher disk bandwidth efficiency, I will switch to 0.79.
 Unfortunately, that does not seem to be the case.
 
 The futex trace was done with version 0.79, not 0.59. I did a profile
 in 0.59 too. There are some improvements, such as the introduction of
 fd cache. But lots of futex calls are still there in v-0.79. I also
 measured the maximum bandwidth from each disk we can get in Version
 0.79. It does not improve significantly: we can still only get 90~100
 MB/s from each disk.
 
 Thanks,
 Xing
 
 
 On Apr 25, 2014, at 2:42 PM, Gregory Farnum g...@inktank.com wrote:
 
 Bobtail is really too old to draw any meaningful conclusions from; why
 did you choose it?
 
 That's not to say that performance on current code will be better
 (though it very much might be), but the internal architecture has
 changed in some ways that will be particularly important for the futex
 profiling you did, and are probably important for these throughput
 results as well.
 -Greg
 Software Engineer #42 @ http://inktank.com | http://ceph.com
 
 
 On Fri, Apr 25, 2014 at 1:38 PM, Xing xing...@cs.utah.edu wrote:
 Hi,
 
 I also did a few other experiments, trying to get what the maximum
 bandwidth we can get from each data disk. The output is not
 encouraging

Re: [PATCH] test/libcephfs: free cmount after tests finishes

2013-11-03 Thread Xing
Hi Sage,

Thanks for applying these two patches. I will try to accumulate more fixes and 
submit pull requests via github later. 

Thanks,
Xing

On Nov 3, 2013, at 12:17 AM, Sage Weil s...@inktank.com wrote:

 Applied this one too!
 
 BTW, an easier workflow than sending patches to the list is to accumulate 
 a batch of fixes in a branch and submit a pull request via github (at 
 least if you're already a github user).  Whichever works well for you.
 
 Thanks!
 s

--
To unsubscribe from this list: send the line unsubscribe ceph-devel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: unable to compile

2013-11-03 Thread Xing
Hi all,

It looks like it is because g++ does not support designated initializer list 
(gcc supports it). 
http://stackoverflow.com/questions/18731707/why-does-c11-not-support-designated-initializer-list-as-c99

I am able to compile and run vstart.sh again after I reverted the following 
commit by Noah.
6efc2b54d5ce85fcb4b66237b051bcbb5072e6a3

Noah, do you have any feedback?

Thanks,
Xing

On Nov 3, 2013, at 1:22 PM, Xing xing...@cs.utah.edu wrote:

 Hi all,
 
 I was able to compile and run vstart.sh yesterday. When I merged the ceph 
 master branch into my local master branch, I am not able to compile now. This 
 issue seems to be related with boost but I have installed the 
 libboost1.46-dev package. Any suggestions? thanks,
 
 ===
 the compiler complains the following code section.
 
 // file layouts
 struct ceph_file_layout g_default_file_layout = {
 .fl_stripe_unit = init_le32(122),
 .fl_stripe_count = init_le32(1),
 .fl_object_size = init_le32(122),
 .fl_cas_hash = init_le32(0),
 .fl_object_stripe_unit = init_le32(0),
 .fl_unused = init_le32(-1),
 .fl_pg_pool = init_le32(-1),
 };
 
 ===make errors===
 common/config.cc:61:2: error: expected primary-expression before '.' token
 common/config.cc:61:20: warning: extended initializer lists only available 
 with -std=c++0x or -std=gnu++0x [enabled by default]
 common/config.cc:62:2: error: expected primary-expression before '.' token
 common/config.cc:62:21: warning: extended initializer lists only available 
 with -std=c++0x or -std=gnu++0x [enabled by default]
 common/config.cc:63:2: error: expected primary-expression before '.' token
 common/config.cc:63:20: warning: extended initializer lists only available 
 with -std=c++0x or -std=gnu++0x [enabled by default]
 common/config.cc:64:2: error: expected primary-expression before '.' token
 common/config.cc:64:17: warning: extended initializer lists only available 
 with -std=c++0x or -std=gnu++0x [enabled by default]
 common/config.cc:65:2: error: expected primary-expression before '.' token
 common/config.cc:65:27: warning: extended initializer lists only available 
 with -std=c++0x or -std=gnu++0x [enabled by default]
 common/config.cc:66:2: error: expected primary-expression before '.' token
 common/config.cc:66:15: warning: extended initializer lists only available 
 with -std=c++0x or -std=gnu++0x [enabled by default]
 common/config.cc:67:2: error: expected primary-expression before '.' token
 common/config.cc:67:16: warning: extended initializer lists only available 
 with -std=c++0x or -std=gnu++0x [enabled by default]
 
 I am compiling ceph in a 64-bit ubuntu12-04 machine and gcc version is 4.6.3. 
 
 ===gcc version===
 root@client:/mnt/ceph# gcc -v
 Using built-in specs.
 COLLECT_GCC=gcc
 COLLECT_LTO_WRAPPER=/usr/lib/gcc/x86_64-linux-gnu/4.6/lto-wrapper
 Target: x86_64-linux-gnu
 Configured with: ../src/configure -v --with-pkgversion='Ubuntu/Linaro 
 4.6.3-1ubuntu5' --with-bugurl=file:///usr/share/doc/gcc-4.6/README.Bugs 
 --enable-languages=c,c++,fortran,objc,obj-c++ --prefix=/usr 
 --program-suffix=-4.6 --enable-shared --enable-linker-build-id 
 --with-system-zlib --libexecdir=/usr/lib --without-included-gettext 
 --enable-threads=posix --with-gxx-include-dir=/usr/include/c++/4.6 
 --libdir=/usr/lib --enable-nls --with-sysroot=/ --enable-clocale=gnu 
 --enable-libstdcxx-debug --enable-libstdcxx-time=yes 
 --enable-gnu-unique-object --enable-plugin --enable-objc-gc --disable-werror 
 --with-arch-32=i686 --with-tune=generic --enable-checking=release 
 --build=x86_64-linux-gnu --host=x86_64-linux-gnu --target=x86_64-linux-gnu
 Thread model: posix
 gcc version 4.6.3 (Ubuntu/Linaro 4.6.3-1ubuntu5)
 
 Thanks,
 Xing
 
 
 
 

--
To unsubscribe from this list: send the line unsubscribe ceph-devel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: unable to compile

2013-11-03 Thread Xing Lin

Thanks, Noah!

Xing

On 11/3/2013 3:17 PM, Noah Watkins wrote:
Thanks for looking at this. Unless there is a good solution I think 
reverting it is ok as breaking the compile on a few platflorms is not 
ok. Ill be lookong at this tonight. 


--
To unsubscribe from this list: send the line unsubscribe ceph-devel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH] osd/erasurecode: Fix memory leak in jerasure_matrix_to_bitmatrix()

2013-11-02 Thread Xing Lin
Free allocated memory before return because of NULL input

Signed-off-by: Xing Lin xing...@cs.utah.edu
---
 src/osd/ErasureCodePluginJerasure/jerasure.c | 5 -
 1 file changed, 4 insertions(+), 1 deletion(-)

diff --git a/src/osd/ErasureCodePluginJerasure/jerasure.c 
b/src/osd/ErasureCodePluginJerasure/jerasure.c
index 9efae02..1bb4b1d 100755
--- a/src/osd/ErasureCodePluginJerasure/jerasure.c
+++ b/src/osd/ErasureCodePluginJerasure/jerasure.c
@@ -276,7 +276,10 @@ int *jerasure_matrix_to_bitmatrix(int k, int m, int w, int 
*matrix)
   int rowelts, rowindex, colindex, elt, i, j, l, x;
 
   bitmatrix = talloc(int, k*m*w*w);
-  if (matrix == NULL) { return NULL; }
+  if (matrix == NULL) {
+free(bitmatrix);
+return NULL; 
+  }
 
   rowelts = k * w;
   rowindex = 0;
-- 
1.8.3.4 (Apple Git-47)

--
To unsubscribe from this list: send the line unsubscribe ceph-devel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH] osd/erasurecode: correct one variable name in jerasure_matrix_to_bitmatrix()

2013-11-02 Thread Xing Lin
When bitmatrix is NULL, this function returns NULL.

Signed-off-by: Xing Lin xing...@cs.utah.edu
---
 src/osd/ErasureCodePluginJerasure/jerasure.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/src/osd/ErasureCodePluginJerasure/jerasure.c 
b/src/osd/ErasureCodePluginJerasure/jerasure.c
index 9efae02..d5752a8 100755
--- a/src/osd/ErasureCodePluginJerasure/jerasure.c
+++ b/src/osd/ErasureCodePluginJerasure/jerasure.c
@@ -276,7 +276,7 @@ int *jerasure_matrix_to_bitmatrix(int k, int m, int w, int 
*matrix)
   int rowelts, rowindex, colindex, elt, i, j, l, x;
 
   bitmatrix = talloc(int, k*m*w*w);
-  if (matrix == NULL) { return NULL; }
+  if (bitmatrix == NULL) { return NULL; }
 
   rowelts = k * w;
   rowindex = 0;
-- 
1.8.3.4 (Apple Git-47)

--
To unsubscribe from this list: send the line unsubscribe ceph-devel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: coverity scan - a plea for help!

2013-10-31 Thread Xing Lin

Hi Sage,

I would like to help here as well.

Thanks,
Xing

On 10/31/2013 5:30 PM, Sage Weil wrote:

Hi everyone,

When I send this out several months ago, Danny Al-Gaaf stepped up and
submitted an amazing number of patches cleaning up the most concerning
issues that Coverity had picked up.  His attention has been directed
elsewhere more recently, but there are still a number of outstanding
issues in Coverity's tracker that are reasonably quick and easy to resolve
and will make our ability to identify newly introduced defects much
simpler.

Coverity Scan makes it really easy to participate: just create an account
and I can grant you access to the Ceph project.  If you're interested in
contributing here (and it's an easy way to quickly start working with the
Ceph code), let me know!

Thanks-
sage


On Thu, 9 May 2013, Sage Weil wrote:


We were added to coverity's awesome scan program a while back, which gives
free access to their static analysis tool to open source projects.

Currently it identifies 421 issues.  We've already taken care of the ones
that are highest impact, but the usefulness of periodic scans is limited
until we can eliminate the noise from the remaining issues and easily see
when new problems come up.

If anybody is interested in helping out in the cleanup effort, let me know
and I'll share the login info.  This would provide significant value to
our overall quality efforts and is a pretty easy way to make a meaningful
contribution to the project without a huge investment in understanding the
code and architecture!

sage


--
To unsubscribe from this list: send the line unsubscribe ceph-devel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


--
To unsubscribe from this list: send the line unsubscribe ceph-devel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: When ceph synchronizes journal to disk?

2013-03-05 Thread Xing Lin
Thanks very much for all your explanations. I am now much clearer about 
it. Have a great day!


Xing

On 03/05/2013 01:12 PM, Greg Farnum wrote:

All the data goes to the disk in write-back mode so it isn't safe yet
until the flush is called. That's why it goes into the journal first, to
be consistent at all times.
  
If you would buffer everything in the journal and flush that at once you

would overload the disk for that time.
  
Let's say you have 300MB in the journal after 10 seconds and you want to

flush that at once. That would mean that specific disk is unable to do
any other operations then writing with 60MB/sec for 5 seconds.
  
It's better to always write in write-back mode to the disk and flush at

a certain point.
  
In the meantime the scheduler can do it's job to balance between the

reads and the writes.
  
Wido

Yep, what Wido said. Specifically, we do force the data to the journal with an 
fsync or equivalent before responding to the client, but once it's stable on 
the journal we give it to the filesystem (without doing any sort of forced 
sync). This is necessary — all reads are served from the filesystem.
-Greg


--
To unsubscribe from this list: send the line unsubscribe ceph-devel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: When ceph synchronizes journal to disk?

2013-03-04 Thread Xing Lin
No, I am using xfs. The same thing happens even if I specified the 
journal mode explicitly as follows.


filestore journal parallel = false
filestore journal writeahead = true

Xing

On 03/04/2013 09:32 AM, Sage Weil wrote:

Are you using btrfs?  In that case, the journaling is parallel to the fs
workload because the btrfs snapshots provide us with a stable checkpoint
we can replay from.  In contrast, for non-btrfs file systems we need to do
writeahead journaling.


--
To unsubscribe from this list: send the line unsubscribe ceph-devel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: When ceph synchronizes journal to disk?

2013-03-04 Thread Xing Lin

Hi Gregory,

Thanks for your reply.

On 03/04/2013 09:55 AM, Gregory Farnum wrote:

The journal [min|max] sync interval values specify how frequently
the OSD's FileStore sends a sync to the disk. However, data is still
written into the normal filesystem as it comes in, and the normal
filesystem continues to schedule normal dirty data writeouts. This is
good — it means that when we do send a sync down you don't need to
wait for all (30 seconds * 100MB/s) 3GB or whatever of data to go to
disk before it's completed.


I do not think I understand this well. When the writeahead journal mode 
is in use, would you please explain what happens to a single 4M write 
request? I assume that an entry in the journal will be created for this 
write request and after this entry is flushed to the journal disk, Ceph 
returns successful. There should be no IO to the osd's disk. All IOs are 
supposed to go to the journal disk. At a later time, Ceph will start to 
apply these changes to the normal filesystem by reading from the first 
entry at which its previous synchronization stops. Finally, it will read 
this entry and apply this write change to the normal file system. Could 
you please point out where is wrong in my understanding? Thanks,



I am running 0.48.2. The related configuration is as follows.

If you're starting up a new cluster I recommend upgrading to the
bobtail series (.56.3) instead of using Argonaut — it's got a number
of enhancements you'll appreciate!


Yeah, I would like to use bobtail series. However, I started to make 
small changes with Argonaut (0.48) and had ported my changes once to 
0.48.2 when it was released. I think I am good to continue with it for 
the moment. I may consider to port my changes to bobtail series at a 
later time. Thanks,


Xing
--
To unsubscribe from this list: send the line unsubscribe ceph-devel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: When ceph synchronizes journal to disk?

2013-03-04 Thread Xing Lin

Maybe it is easier to tell in this way.

What we want to see is that the newly written data to stay in the 
journal disk for as long as possible such that write workloads do not 
compete for disk headers for read workloads. Any way to achieve that in 
Ceph? Thanks,


Xing

On 03/04/2013 09:55 AM, Gregory Farnum wrote:

The journal [min|max] sync interval values specify how frequently
the OSD's FileStore sends a sync to the disk. However, data is still
written into the normal filesystem as it comes in, and the normal
filesystem continues to schedule normal dirty data writeouts. This is
good — it means that when we do send a sync down you don't need to
wait for all (30 seconds * 100MB/s) 3GB or whatever of data to go to
disk before it's completed.


--
To unsubscribe from this list: send the line unsubscribe ceph-devel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: ssh passwords

2013-01-22 Thread Xing Lin
If it is the command 'mkcephfs' that asked you for ssh password, then 
that is probably because that script needs to push some files 
(ceph.conf, e.g) to other hosts. If we open that script, we can see that 
it uses 'scp' to send some files. If I remember correctly, for every osd 
at other hosts, it will ask us ssh password seven times. So, we'd better 
set up public key first. :)


Xing

On 01/22/2013 11:24 AM, Gandalf Corvotempesta wrote:

Hi all,
i'm trying my very first ceph installation following the 5-minutes quickstart:
http://ceph.com/docs/master/start/quick-start/#install-debian-ubuntu

just a question: why ceph is asking me  for SSH password? Is ceph
trying to connect to itself via SSH?
--
To unsubscribe from this list: send the line unsubscribe ceph-devel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


--
To unsubscribe from this list: send the line unsubscribe ceph-devel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: ssh passwords

2013-01-22 Thread Xing Lin
I like the current approach. I think it is more convenient to run 
commands once at one host to do all the setup work. When the first time 
I deployed a ceph cluster with 4 hosts, I thought 'service ceph start' 
would start the whole ceph cluster. But as it turns out, it only starts 
local osd, mon processes. So, currently, I am using polysh to run the 
same commands at all hosts (mostly, to restart ceph service before every 
measurement.). Thanks.


Xing

On 01/22/2013 12:35 PM, Neil Levine wrote:

Out of interest, would people prefer that the Ceph deployment script
didn't try to handle server-server file copy and just did the local
setup only, or is it useful that it tries to be a mini-config
management tool at the same time?

Neil


--
To unsubscribe from this list: send the line unsubscribe ceph-devel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: ssh passwords

2013-01-22 Thread Xing Lin

I did not notice that there exists such a parameter. Thanks, Dan!

Xing

On 01/22/2013 02:11 PM, Dan Mick wrote:
The '-a/--allhosts' parameter is to spread the command across the 
cluster...that is, service ceph -a start will start across the cluster.




--
To unsubscribe from this list: send the line unsubscribe ceph-devel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Test infrastructure: 2 or more servers?

2013-01-15 Thread Xing Lin

You can change the number replicas in runtime with the following command:

$ ceph osd pool set {poolname} size {num-replicas}

Xing

On 01/15/2013 03:00 PM, Gandalf Corvotempesta wrote:

Is possible to change the number of replicas in realtime ?


--
To unsubscribe from this list: send the line unsubscribe ceph-devel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Test infrastructure: 2 or more servers?

2013-01-15 Thread Xing Lin
It seems to be: Ceph will shuffle data to rebalance in situations such 
as when we change the replica num or when some nodes or disks are down.


Xing

On 01/15/2013 03:26 PM, Gandalf Corvotempesta wrote:

2013/1/15 Xing Lin xing...@cs.utah.edu:

You can change the number replicas in runtime with the following command:

$ ceph osd pool set {poolname} size {num-replicas}

So it's absolutely safe to start with just 2 server, make all the
necessary tests and when ready to go in production, increase the
servers and replicas?


--
To unsubscribe from this list: send the line unsubscribe ceph-devel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Test infrastructure: 2 or more servers?

2013-01-15 Thread Xing Lin
It seems to be: Ceph will shuffle data to rebalance in situations such 
as when we change the replica num or when some nodes or disks are down.


Xing

On 01/15/2013 03:26 PM, Gandalf Corvotempesta wrote:

So it's absolutely safe to start with just 2 server, make all the
necessary tests and when ready to go in production, increase the
servers and replicas?


--
To unsubscribe from this list: send the line unsubscribe ceph-devel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[debug help]: get dprintk() outputs in src/crush/mapper.c or net/crush/mapper.c

2013-01-13 Thread Xing Lin

Hi,

There are many dprintk() statements in the file mapper.c. How can we get 
to see outputs from these statements? I would like to see these, to get 
a better understanding about how these functions work together and then 
enhance my added algorithm. Besides, adding a printf() statement in the 
crush_do_rule() or my simple bucket_directmap_choose() does not give me 
any output. Thanks.


Xing
--
To unsubscribe from this list: send the line unsubscribe ceph-devel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: which Linux kernel version corresponds to 0.48argonaut?

2013-01-05 Thread Xing Lin

Hi Mark,

Thanks for your reply. I do not think I am running the packaged version. 
The output shows it is my version (0.48.2argonaut.fast at commit 000...).


root@client:/users/utos# rbd -v
ceph version 0.48.2argonaut.fast 
(commit:0)

root@client:/users/utos# /usr/bin/rbd -v
ceph version 0.48.2argonaut 
(commit:3e02b2fad88c2a95d9c0c86878f10d1beb780bfe)

root@client:/users/utos# /usr/local/bin/rbd -v
ceph version 0.48.2argonaut.fast 
(commit:0)


Xing

On 01/05/2013 08:00 PM, Mark Kirkwood wrote:
I'd hazard a guess that you are still (accidentally) running the 
packaged binary - the packaged version installs in /usr/bin (etc) but 
your source build will probably be in /usr/local/bin. I've been 
through this myself and purged the packaged version before building 
and installing from source (just to be sure).


Cheers

Mark

On 06/01/13 14:55, Xing Lin wrote:


After changing the client-side code, I can map/unmap rbd block devices
at client machines. However, I am not able to list rbd block devices. At
the client machine, I first installed 0.48.2argonaut package for Ubuntu
then I compiled and installed my own version according to instructions
on this page ( http://ceph.com/docs/master/install/building-ceph/). The
client failed to recognize the fifth bucket algorithm I added. I
searched unsupported bucket algorithm in the ceph code base and that
text only appeared in the src/crush/CrushWrapper.cc. I checked
decode_crush_bucket() and it should be able to recognize the fifth
algorithm. Even after I changed the error message (added print of
[XXX] and values for two bucket algorithm macros), it still prints the
same error message. So, it seems that my new version of CrushWrapper.cc
is not used during compilation to create the final rbd binary. Would you
please tell me where the problem is and how I can fix it? Thank you very
much.



--
To unsubscribe from this list: send the line unsubscribe ceph-devel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: which Linux kernel version corresponds to 0.48argonaut?

2013-01-05 Thread Xing Lin
It works now. The old version of .so files in /usr/lib are linked, 
instead of new version of these files which are installed at 
/usr/local/lib. Thanks, Sage.


Xing

On 01/05/2013 09:46 PM, Sage Weil wrote:

Hi,

The rbd binary is dynamically linking to librbd1.so and librados2.so
(usually in /usr/lib).  You need to make sure that the .so's you compiled
are the ones it links to to get your code to run.

sage


On Sat, 5 Jan 2013, Xing Lin wrote:


Hi Mark,

Thanks for your reply. I do not think I am running the packaged version. The
output shows it is my version (0.48.2argonaut.fast at commit 000...).

root@client:/users/utos# rbd -v
ceph version 0.48.2argonaut.fast
(commit:0)
root@client:/users/utos# /usr/bin/rbd -v
ceph version 0.48.2argonaut (commit:3e02b2fad88c2a95d9c0c86878f10d1beb780bfe)
root@client:/users/utos# /usr/local/bin/rbd -v
ceph version 0.48.2argonaut.fast
(commit:0)

Xing

On 01/05/2013 08:00 PM, Mark Kirkwood wrote:

I'd hazard a guess that you are still (accidentally) running the packaged
binary - the packaged version installs in /usr/bin (etc) but your source
build will probably be in /usr/local/bin. I've been through this myself and
purged the packaged version before building and installing from source (just
to be sure).

Cheers

Mark

On 06/01/13 14:55, Xing Lin wrote:


After changing the client-side code, I can map/unmap rbd block devices
at client machines. However, I am not able to list rbd block devices. At
the client machine, I first installed 0.48.2argonaut package for Ubuntu
then I compiled and installed my own version according to instructions
on this page ( http://ceph.com/docs/master/install/building-ceph/). The
client failed to recognize the fifth bucket algorithm I added. I
searched unsupported bucket algorithm in the ceph code base and that
text only appeared in the src/crush/CrushWrapper.cc. I checked
decode_crush_bucket() and it should be able to recognize the fifth
algorithm. Even after I changed the error message (added print of
[XXX] and values for two bucket algorithm macros), it still prints the
same error message. So, it seems that my new version of CrushWrapper.cc
is not used during compilation to create the final rbd binary. Would you
please tell me where the problem is and how I can fix it? Thank you very
much.


--
To unsubscribe from this list: send the line unsubscribe ceph-devel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html




--
To unsubscribe from this list: send the line unsubscribe ceph-devel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


which Linux kernel version corresponds to 0.48argonaut?

2012-12-20 Thread Xing Lin

Hi,

I was trying to add a simple replica placement algorithm in Ceph. This 
algorithm simply returns r_th item in a bucket for the r_th replica. I 
have made that change in Ceph source code (including files such as 
crush.h, crush.c, mapper.c, ...) and I can run Ceph monitor and osd 
daemons. However, I am not able to map rbd block devices at client 
machines. 'rbd map image0' reported input/output error and 'dmesg' at 
the client machine showed message like libceph: handle_map corrupt 
msg. I believe that is because I have not ported my changes to Ceph 
client side programs and it does not recognize the new placement 
algorithm. I probably need to recompile the rbd block device driver. 
When I was trying to replace Ceph related files in Linux with my own 
version, I noticed that files in Linux-3.2.16 are different from these 
included in Ceph source code. For example, the following is the diff of 
crush.h in Linux-3.2.16 and 0.48argonaut. So, my question is that is 
there any version of Linux that contains the exact Ceph files as 
included in 0.48argonaut? Thanks.


---
 $ diff -uNrp ceph-0.48argonaut/src/crush/crush.h 
linux-3.2.16/include/linux/crush/crush.h
--- ceph-0.48argonaut/src/crush/crush.h2012-06-26 11:56:36.0 
-0600
+++ linux-3.2.16/include/linux/crush/crush.h2012-04-22 
16:31:32.0 -0600

@@ -1,12 +1,7 @@
 #ifndef CEPH_CRUSH_CRUSH_H
 #define CEPH_CRUSH_CRUSH_H

-#if defined(__linux__)
 #include linux/types.h
-#elif defined(__FreeBSD__)
-#include sys/types.h
-#include include/inttypes.h
-#endif

 /*
  * CRUSH is a pseudo-random data distribution algorithm that
@@ -156,24 +151,25 @@ struct crush_map {
 struct crush_bucket **buckets;
 struct crush_rule **rules;

+/*
+ * Parent pointers to identify the parent bucket a device or
+ * bucket in the hierarchy.  If an item appears more than
+ * once, this is the _last_ time it appeared (where buckets
+ * are processed in bucket id order, from -1 on down to
+ * -max_buckets.
+ */
+__u32 *bucket_parents;
+__u32 *device_parents;
+
 __s32 max_buckets;
 __u32 max_rules;
 __s32 max_devices;
-
-/* choose local retries before re-descent */
-__u32 choose_local_tries;
-/* choose local attempts using a fallback permutation before
- * re-descent */
-__u32 choose_local_fallback_tries;
-/* choose attempts before giving up */
-__u32 choose_total_tries;
-
-__u32 *choose_tries;
 };


 /* crush.c */
-extern int crush_get_bucket_item_weight(const struct crush_bucket *b, 
int pos);

+extern int crush_get_bucket_item_weight(struct crush_bucket *b, int pos);
+extern void crush_calc_parents(struct crush_map *map);
 extern void crush_destroy_bucket_uniform(struct crush_bucket_uniform *b);
 extern void crush_destroy_bucket_list(struct crush_bucket_list *b);
 extern void crush_destroy_bucket_tree(struct crush_bucket_tree *b);
@@ -181,9 +177,4 @@ extern void crush_destroy_bucket_straw(s
 extern void crush_destroy_bucket(struct crush_bucket *b);
 extern void crush_destroy(struct crush_map *map);

-static inline int crush_calc_tree_node(int i)
-{
-return ((i+1)  1)-1;
-}
-
 #endif


Xing
--
To unsubscribe from this list: send the line unsubscribe ceph-devel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: which Linux kernel version corresponds to 0.48argonaut?

2012-12-20 Thread Xing Lin

This may be useful for other Ceph newbies just like me.

I have ported my changes to 0.48argonaut to related Ceph files included 
in Linux, though files with the same name are not exactly the same. Then 
I recompiled and installed the kernel. After that, everything seems to 
be working again now: Ceph is working with my new simple replica 
placement algorithm. :)
So, it seems that Ceph files included in the Linux kernel are supposed 
to be different from those in 0.48argonaut. Presumably, the Linux kernel 
contains the client-side implementation while 0.48argonaut contains the 
server-side implementation. It would be appreciated if someone can 
confirm it. Thank you!


Xing

On 12/20/2012 11:54 AM, Xing Lin wrote:

Hi,

I was trying to add a simple replica placement algorithm in Ceph. This 
algorithm simply returns r_th item in a bucket for the r_th replica. I 
have made that change in Ceph source code (including files such as 
crush.h, crush.c, mapper.c, ...) and I can run Ceph monitor and osd 
daemons. However, I am not able to map rbd block devices at client 
machines. 'rbd map image0' reported input/output error and 'dmesg' 
at the client machine showed message like libceph: handle_map corrupt 
msg. I believe that is because I have not ported my changes to Ceph 
client side programs and it does not recognize the new placement 
algorithm. I probably need to recompile the rbd block device driver. 
When I was trying to replace Ceph related files in Linux with my own 
version, I noticed that files in Linux-3.2.16 are different from these 
included in Ceph source code. For example, the following is the diff 
of crush.h in Linux-3.2.16 and 0.48argonaut. So, my question is that 
is there any version of Linux that contains the exact Ceph files as 
included in 0.48argonaut? Thanks.


---
 $ diff -uNrp ceph-0.48argonaut/src/crush/crush.h 
linux-3.2.16/include/linux/crush/crush.h
--- ceph-0.48argonaut/src/crush/crush.h2012-06-26 
11:56:36.0 -0600
+++ linux-3.2.16/include/linux/crush/crush.h2012-04-22 
16:31:32.0 -0600

@@ -1,12 +1,7 @@
 #ifndef CEPH_CRUSH_CRUSH_H
 #define CEPH_CRUSH_CRUSH_H

-#if defined(__linux__)
 #include linux/types.h
-#elif defined(__FreeBSD__)
-#include sys/types.h
-#include include/inttypes.h
-#endif

 /*
  * CRUSH is a pseudo-random data distribution algorithm that
@@ -156,24 +151,25 @@ struct crush_map {
 struct crush_bucket **buckets;
 struct crush_rule **rules;

+/*
+ * Parent pointers to identify the parent bucket a device or
+ * bucket in the hierarchy.  If an item appears more than
+ * once, this is the _last_ time it appeared (where buckets
+ * are processed in bucket id order, from -1 on down to
+ * -max_buckets.
+ */
+__u32 *bucket_parents;
+__u32 *device_parents;
+
 __s32 max_buckets;
 __u32 max_rules;
 __s32 max_devices;
-
-/* choose local retries before re-descent */
-__u32 choose_local_tries;
-/* choose local attempts using a fallback permutation before
- * re-descent */
-__u32 choose_local_fallback_tries;
-/* choose attempts before giving up */
-__u32 choose_total_tries;
-
-__u32 *choose_tries;
 };


 /* crush.c */
-extern int crush_get_bucket_item_weight(const struct crush_bucket *b, 
int pos);
+extern int crush_get_bucket_item_weight(struct crush_bucket *b, int 
pos);

+extern void crush_calc_parents(struct crush_map *map);
 extern void crush_destroy_bucket_uniform(struct crush_bucket_uniform 
*b);

 extern void crush_destroy_bucket_list(struct crush_bucket_list *b);
 extern void crush_destroy_bucket_tree(struct crush_bucket_tree *b);
@@ -181,9 +177,4 @@ extern void crush_destroy_bucket_straw(s
 extern void crush_destroy_bucket(struct crush_bucket *b);
 extern void crush_destroy(struct crush_map *map);

-static inline int crush_calc_tree_node(int i)
-{
-return ((i+1)  1)-1;
-}
-
 #endif


Xing
--
To unsubscribe from this list: send the line unsubscribe ceph-devel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


--
To unsubscribe from this list: send the line unsubscribe ceph-devel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


questions with rbd sequential read throughputs inside kvm/qemu VMs

2012-11-27 Thread Xing Lin

Hi,

I am interested to use rbd block devices inside kvm/qemu VMs. I set up a 
tiny ceph cluster using one server machines and used 6 SCSI disks for 
storing data. At the client machine, the sequential read throughput 
seems to be reasonable (~60 MB/s) when I run fio against rbd block 
devices mounted outside of VMs. The read throughput does not seem to be 
reasonable when I use rbd block devices as block devices inside kvm/qemu 
VMs: it jumps to be as high as 200 MB/s. 'tcpdump' shows read requests 
do reach the ceph server. What makes things confusing is the 'iotop' 
does not show any I/O for sequential reads while 'top' shows 'ceph-osd' 
is utilizing CPU at 100%.


This is the section for the rbd disk in VM's xml file.

disk type='network' device='disk'
  driver name='qemu' type='raw'/
  source protocol='rbd' name='rbd/image3'
host name='node-0' port='6789'/
  /source
  target dev='vda' bus='virtio'/
  address type='pci' domain='0x' bus='0x00' slot='0x05' 
function='0x0'/

/disk


This is the fio job file I used to measure throughputs.
--
[global]
rw=read
bs=4m
thread=0
time_based=1
runtime=300
invalidate=1

direct=1
sync=1
ioengine=sync

[sr-vda]
filename=${DEV}
-

Does anyone have some suggestions or hints for me to try? Thank you very 
much!


Xing
--
To unsubscribe from this list: send the line unsubscribe ceph-devel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html