hi noah,

Some results for the read tests:

I set client_readahead_min=4193404 which is the default for hadoop
dfs.datanode.readahead.bytes also.  I ran the dfsio test 6 times each for
HDFS, Ceph with default read ahead & ceph with readahead=4193404.  Setting
read ahead in ceph did give about a 10% overall improvement over the
default values. The hdfs average is only slightly better .... but then
there was a lot more run to run variation for hdfs - perhaps some caching
going there.

Seems like a good read ahead value that the ceph hadoop client can use as a
default   !

I'll look at the DFS write tests later today .... any tuning suggestions
you can think of there. I was thinking of trying out increasing the journal
size and separating out the journaling to a separate  disk.  Anything else
?

For hdfs dfsio read test:

Average execution time: 258
Best execution time: 149
Worst exec time: 361

For ceph with default read ahead setting:

Average execution time: 316
Best execution time: 296
Worst execution time: 358

For ceph with read ahead setting = 4193404

Average execution time: 285
Best execution time: 277
Worst execution time: 294

I didn't set max bytes ... I guess the default is zero which means no max ?
I tried increasing the readahead max periods to 8 .. didn't look like a
good change.

thanks !




On Wed, Jul 10, 2013 at 10:56 AM, Noah Watkins <[email protected]>wrote:

> Hey KC,
>
> I wanted to follow up on this, but ran out of time yesterday. To set
> the options in ceph.conf you can do something like
>
> [client]
>     readahead min = blah
>     readahead max bytes = blah
>     readahead max periods = blah
>
> then, make just sure that your client is pointing to a ceph.conf with
> these settings.
>
>
> On Tue, Jul 9, 2013 at 4:32 PM, Noah Watkins <[email protected]>
> wrote:
> > Yes, the libcephfs client. You should be able to adjust the settings
> > without changing any code. The settings should be adjustable either by
> > setting the config options in ceph.conf, or using the
> > "ceph.conf.options" settings in Hadoop's core-site.xml.
> >
> > On Tue, Jul 9, 2013 at 4:26 PM, ker can <[email protected]> wrote:
> >> Makes sense.  I can try playing around with these settings  .... when
> you're
> >> saying client, would this be libcephfs.so ?
> >>
> >>
> >>
> >>
> >>
> >> On Tue, Jul 9, 2013 at 5:35 PM, Noah Watkins <[email protected]>
> >> wrote:
> >>>
> >>> Greg pointed out the read-ahead client options. I would suggest
> >>> fiddling with these settings. If things improve, we can put automatic
> >>> configuration of these settings into the Hadoop client itself. At the
> >>> very least, we should be able to see if it is the read-ahead that is
> >>> causing performance problems.
> >>>
> >>> OPTION(client_readahead_min, OPT_LONGLONG, 128*1024) // readahead at
> >>> _least_ this much.
> >>> OPTION(client_readahead_max_bytes, OPT_LONGLONG, 0) //8 * 1024*1024
> >>> OPTION(client_readahead_max_periods, OPT_LONGLONG, 4) // as multiple
> >>> of file layout period (object size * num stripes)
> >>>
> >>> -Noah
> >>>
> >>>
> >>> On Tue, Jul 9, 2013 at 3:27 PM, Noah Watkins <[email protected]
> >
> >>> wrote:
> >>> >> Is the JNI interface still an issue or have we moved past that ?
> >>> >
> >>> > We haven't done much performance tuning with Hadoop, but I suspect
> >>> > that the JNI interface is not a bottleneck.
> >>> >
> >>> > My very first thought about what might be causing slow read
> >>> > performance is the read-ahead settings we use vs Hadoop. Hadoop
> should
> >>> > be performing big, efficient, block-size reads and caching these in
> >>> > each map task. However, I think we are probably doing lots of small
> >>> > reads on demand. That would certainly hurt performance.
> >>> >
> >>> > In fact, in CephInputStream.java I see we are doing buffer-sized
> >>> > reads. Which, at least in my tree, turn out to be 4096 bytes :)
> >>> >
> >>> > So, there are two issues now. First, the C-Java barrier is being
> cross
> >>> > a lot (16K times for a 64MB block). That's probably not a huge
> >>> > overhead, but it might be something. The second is read-ahead. I'm
> not
> >>> > sure how much read-ahead the libcephfs client is performing, but the
> >>> > more round trips its doing the more overhead we would incur.
> >>> >
> >>> >
> >>> >>
> >>> >> thanks !
> >>> >>
> >>> >>
> >>> >>
> >>> >>
> >>> >> On Tue, Jul 9, 2013 at 3:01 PM, ker can <[email protected]> wrote:
> >>> >>>
> >>> >>> For this particular test I turned off replication for both hdfs and
> >>> >>> ceph.
> >>> >>> So there is just one copy of the data lying around.
> >>> >>>
> >>> >>> hadoop@vega7250:~$ ceph osd dump | grep rep
> >>> >>> pool 0 'data' rep size 1 min_size 1 crush_ruleset 0 object_hash
> >>> >>> rjenkins
> >>> >>> pg_num 960 pgp_num 960 last_change 26 owner 0
> crash_replay_interval 45
> >>> >>> pool 1 'metadata' rep size 2 min_size 1 crush_ruleset 1 object_hash
> >>> >>> rjenkins pg_num 960 pgp_num 960 last_change 1 owner 0
> >>> >>> pool 2 'rbd' rep size 2 min_size 1 crush_ruleset 2 object_hash
> >>> >>> rjenkins
> >>> >>> pg_num 960 pgp_num 960 last_change 1 owner 0
> >>> >>>
> >>> >>> From hdfs-site.xml:
> >>> >>>
> >>> >>>   <property>
> >>> >>>     <name>dfs.replication</name>
> >>> >>>     <value>1</value>
> >>> >>>   </property>
> >>> >>>
> >>> >>>
> >>> >>>
> >>> >>>
> >>> >>>
> >>> >>> On Tue, Jul 9, 2013 at 2:44 PM, Noah Watkins
> >>> >>> <[email protected]>
> >>> >>> wrote:
> >>> >>>>
> >>> >>>> On Tue, Jul 9, 2013 at 12:35 PM, ker can <[email protected]>
> wrote:
> >>> >>>> > hi Noah,
> >>> >>>> >
> >>> >>>> > while we're still on the hadoop topic ... I was also trying out
> the
> >>> >>>> > TestDFSIO tests ceph v/s hadoop.  The Read tests on ceph takes
> >>> >>>> > about
> >>> >>>> > 1.5x
> >>> >>>> > the hdfs time.  The write tests are worse about ... 2.5x the
> time
> >>> >>>> > on
> >>> >>>> > hdfs,
> >>> >>>> > but I guess we have additional journaling overheads for the
> writes
> >>> >>>> > on
> >>> >>>> > ceph.
> >>> >>>> > But there should be no such overheads for the read  ?
> >>> >>>>
> >>> >>>> Out of the box Hadoop will keep 3 copies, and Ceph 2, so it could
> be
> >>> >>>> the case that reads are slower because there is less opportunity
> for
> >>> >>>> scheduling local reads. You can create a new pool with
> replication=3
> >>> >>>> and test this out (documentation on how to do this is on
> >>> >>>> http://ceph.com/docs/wip-hadoop-doc/cephfs/hadoop/).
> >>> >>>>
> >>> >>>> As for writes, Hadoop will write 2 remote and 1 local blocks,
> however
> >>> >>>> Ceph will write all copies remotely, so there is some overhead for
> >>> >>>> the
> >>> >>>> extra remote object write  (compared to Hadoop), but i wouldn't
> have
> >>> >>>> expected 2.5x. It might be useful to run dd or something like
> that on
> >>> >>>> Ceph to see if the numbers make sense to rule out Hadoop as the
> >>> >>>> bottleneck.
> >>> >>>>
> >>> >>>> -Noah
> >>> >>>
> >>> >>>
> >>> >>
> >>
> >>
>
_______________________________________________
ceph-users mailing list
[email protected]
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Reply via email to