thanks Stack,

             Answers inline.

On Fri, Apr 9, 2010 at 12:23 AM, Stack <st...@duboce.net> wrote:

> It'll depend on your access patterns but in general we'll be doing
> lots of small accesses... many more.  A recently added clienttrace
> log, in this case the client referred to is dfsclient, will log
> messages like the following:
>
> 2010-04-07 22:15:52,078 INFO
> org.apache.hadoop.hdfs.server.datanode.DataNode.clienttrace: src:
> /10.20.20.189:50010, dest: /10.20.20.189:56736, bytes: 2022080, op:
> HDFS_READ, cliID: DFSClient_-994492608, srvID:
> DS-1740361948-10.20.20.189-50010-1270703663528, blockid:
> blk_2797215769808904384_1015
>
> Lots of them, one per access.
>
>         In this case, lots of access records, but fairly less data than
usual Hadoop jobs, can we say usually there are many more blocks involved in
a Hbase HDFS access than in a Hadoop HDFS access, this cannot be efficient.
 I know sometime there are small region store files, but if they are small,
they would be merged into one by compaction, right?
       Is there anyway we lower number of small data access? maybe by
setting higher rowcaching number, but that should be App dependent. Any
other options we can use to lower this number?

You could turn them off explicitly in your log4j.  That should help.
>
> Don't run DEBUG level in datanode logs.
>
>
we are running the cluster at INFO level.


> Other answers inlined below.
>
> On Thu, Apr 8, 2010 at 2:51 AM, steven zhuang
> <steven.zhuang.1...@gmail.com> wrote:
> >...
> >        At present, my idea is calculating the data IO quantity of both
> HDFS
> > and HBase for a given day, and with the result we can have a rough
> estimate
> > of the situation.
>
> Can you use the above noted clientrace logs to do this?  Are clients
> on different hosts -- i.e. the hdfs clients and hbase clients?  If so
> that'd make it easy enough.  Otherwise, it'd be a little difficult.
> There is probably an easier way but one (awkward) means of calculating
> would be by writing a mapreduce job that took clienttrace messages and
> al blocks in the filesystem and then had it sort the clienttrace
> messages that belong to the ${HBASE_ROOTDIR} subdirectory.
>
> yeah, the hbase regionserver and datanode are on same host. so I cannot get
the data read/written by HBase just from the datanode log.
the Map/Reduce way may have a problem, we can not get the historical block
info from HDFS file system, I mean there are lots of blocks been garbage
collected when we import or delete data.

> >        One problem I met now is to decide from the regionserver log the
> > quantity of data been read/written by Hbase, should I count the lengths
> in
> > following log records as lengths of data been read/written?:
> >
> > org.apache.hadoop.hbase.regionserver.Store: loaded
> > /user/ccenterq/hbase/hbt2table2/165204266/queries/1091785486701083780,
> > isReference=false,
> > sequence id=1526201715, length=*72426373*, majorCompaction=true
> > 2010-03-04 01:11:54,262 DEBUG
> org.apache.hadoop.hbase.regionserver.HRegion:
> > Started memstore flush for region table_word_in_doc, resort
> > all-2010/01/01,1267629092479. Current region memstore size *40.5m*
> >
> >        here I am not sure the *72426373/40.5m is the length (in byte) of
> > data read by HBase. *
>
> Thats just file size.  Above we opened a storefile and we just logged its
> size.
>
> We don't log how much we've read/written any where in hbase logs.
>
> St.Ack
>

Reply via email to