hi noah, Some results for the read tests:
I set client_readahead_min=4193404 which is the default for hadoop dfs.datanode.readahead.bytes also. I ran the dfsio test 6 times each for HDFS, Ceph with default read ahead & ceph with readahead=4193404. Setting read ahead in ceph did give about a 10% overall improvement over the default values. The hdfs average is only slightly better .... but then there was a lot more run to run variation for hdfs - perhaps some caching going there. Seems like a good read ahead value that the ceph hadoop client can use as a default ! I'll look at the DFS write tests later today .... any tuning suggestions you can think of there. I was thinking of trying out increasing the journal size and separating out the journaling to a separate disk. Anything else ? For hdfs dfsio read test: Average execution time: 258 Best execution time: 149 Worst exec time: 361 For ceph with default read ahead setting: Average execution time: 316 Best execution time: 296 Worst execution time: 358 For ceph with read ahead setting = 4193404 Average execution time: 285 Best execution time: 277 Worst execution time: 294 I didn't set max bytes ... I guess the default is zero which means no max ? I tried increasing the readahead max periods to 8 .. didn't look like a good change. thanks ! On Wed, Jul 10, 2013 at 10:56 AM, Noah Watkins <[email protected]>wrote: > Hey KC, > > I wanted to follow up on this, but ran out of time yesterday. To set > the options in ceph.conf you can do something like > > [client] > readahead min = blah > readahead max bytes = blah > readahead max periods = blah > > then, make just sure that your client is pointing to a ceph.conf with > these settings. > > > On Tue, Jul 9, 2013 at 4:32 PM, Noah Watkins <[email protected]> > wrote: > > Yes, the libcephfs client. You should be able to adjust the settings > > without changing any code. The settings should be adjustable either by > > setting the config options in ceph.conf, or using the > > "ceph.conf.options" settings in Hadoop's core-site.xml. > > > > On Tue, Jul 9, 2013 at 4:26 PM, ker can <[email protected]> wrote: > >> Makes sense. I can try playing around with these settings .... when > you're > >> saying client, would this be libcephfs.so ? > >> > >> > >> > >> > >> > >> On Tue, Jul 9, 2013 at 5:35 PM, Noah Watkins <[email protected]> > >> wrote: > >>> > >>> Greg pointed out the read-ahead client options. I would suggest > >>> fiddling with these settings. If things improve, we can put automatic > >>> configuration of these settings into the Hadoop client itself. At the > >>> very least, we should be able to see if it is the read-ahead that is > >>> causing performance problems. > >>> > >>> OPTION(client_readahead_min, OPT_LONGLONG, 128*1024) // readahead at > >>> _least_ this much. > >>> OPTION(client_readahead_max_bytes, OPT_LONGLONG, 0) //8 * 1024*1024 > >>> OPTION(client_readahead_max_periods, OPT_LONGLONG, 4) // as multiple > >>> of file layout period (object size * num stripes) > >>> > >>> -Noah > >>> > >>> > >>> On Tue, Jul 9, 2013 at 3:27 PM, Noah Watkins <[email protected] > > > >>> wrote: > >>> >> Is the JNI interface still an issue or have we moved past that ? > >>> > > >>> > We haven't done much performance tuning with Hadoop, but I suspect > >>> > that the JNI interface is not a bottleneck. > >>> > > >>> > My very first thought about what might be causing slow read > >>> > performance is the read-ahead settings we use vs Hadoop. Hadoop > should > >>> > be performing big, efficient, block-size reads and caching these in > >>> > each map task. However, I think we are probably doing lots of small > >>> > reads on demand. That would certainly hurt performance. > >>> > > >>> > In fact, in CephInputStream.java I see we are doing buffer-sized > >>> > reads. Which, at least in my tree, turn out to be 4096 bytes :) > >>> > > >>> > So, there are two issues now. First, the C-Java barrier is being > cross > >>> > a lot (16K times for a 64MB block). That's probably not a huge > >>> > overhead, but it might be something. The second is read-ahead. I'm > not > >>> > sure how much read-ahead the libcephfs client is performing, but the > >>> > more round trips its doing the more overhead we would incur. > >>> > > >>> > > >>> >> > >>> >> thanks ! > >>> >> > >>> >> > >>> >> > >>> >> > >>> >> On Tue, Jul 9, 2013 at 3:01 PM, ker can <[email protected]> wrote: > >>> >>> > >>> >>> For this particular test I turned off replication for both hdfs and > >>> >>> ceph. > >>> >>> So there is just one copy of the data lying around. > >>> >>> > >>> >>> hadoop@vega7250:~$ ceph osd dump | grep rep > >>> >>> pool 0 'data' rep size 1 min_size 1 crush_ruleset 0 object_hash > >>> >>> rjenkins > >>> >>> pg_num 960 pgp_num 960 last_change 26 owner 0 > crash_replay_interval 45 > >>> >>> pool 1 'metadata' rep size 2 min_size 1 crush_ruleset 1 object_hash > >>> >>> rjenkins pg_num 960 pgp_num 960 last_change 1 owner 0 > >>> >>> pool 2 'rbd' rep size 2 min_size 1 crush_ruleset 2 object_hash > >>> >>> rjenkins > >>> >>> pg_num 960 pgp_num 960 last_change 1 owner 0 > >>> >>> > >>> >>> From hdfs-site.xml: > >>> >>> > >>> >>> <property> > >>> >>> <name>dfs.replication</name> > >>> >>> <value>1</value> > >>> >>> </property> > >>> >>> > >>> >>> > >>> >>> > >>> >>> > >>> >>> > >>> >>> On Tue, Jul 9, 2013 at 2:44 PM, Noah Watkins > >>> >>> <[email protected]> > >>> >>> wrote: > >>> >>>> > >>> >>>> On Tue, Jul 9, 2013 at 12:35 PM, ker can <[email protected]> > wrote: > >>> >>>> > hi Noah, > >>> >>>> > > >>> >>>> > while we're still on the hadoop topic ... I was also trying out > the > >>> >>>> > TestDFSIO tests ceph v/s hadoop. The Read tests on ceph takes > >>> >>>> > about > >>> >>>> > 1.5x > >>> >>>> > the hdfs time. The write tests are worse about ... 2.5x the > time > >>> >>>> > on > >>> >>>> > hdfs, > >>> >>>> > but I guess we have additional journaling overheads for the > writes > >>> >>>> > on > >>> >>>> > ceph. > >>> >>>> > But there should be no such overheads for the read ? > >>> >>>> > >>> >>>> Out of the box Hadoop will keep 3 copies, and Ceph 2, so it could > be > >>> >>>> the case that reads are slower because there is less opportunity > for > >>> >>>> scheduling local reads. You can create a new pool with > replication=3 > >>> >>>> and test this out (documentation on how to do this is on > >>> >>>> http://ceph.com/docs/wip-hadoop-doc/cephfs/hadoop/). > >>> >>>> > >>> >>>> As for writes, Hadoop will write 2 remote and 1 local blocks, > however > >>> >>>> Ceph will write all copies remotely, so there is some overhead for > >>> >>>> the > >>> >>>> extra remote object write (compared to Hadoop), but i wouldn't > have > >>> >>>> expected 2.5x. It might be useful to run dd or something like > that on > >>> >>>> Ceph to see if the numbers make sense to rule out Hadoop as the > >>> >>>> bottleneck. > >>> >>>> > >>> >>>> -Noah > >>> >>> > >>> >>> > >>> >> > >> > >> >
_______________________________________________ ceph-users mailing list [email protected] http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
