On Sat, Mar 20, 2010 at 8:32 PM, prasenjit mukherjee
<[email protected]>wrote:

> Any documentation on when/how hbase uses mapreduce ?
>
> Is it done for bulk-read/writes.  Any tuning parameters which users
> can specify ( like # of mappers/


The new (.mapreduce.) API does not allow you to specify  the number of
mappers and that sort of makes sense, since the parallelism factor of the
number of machines etc. can be inferred by the framework better than the
user specifying the same.


reducers )


( As with any m-r job,  ) the number of reducers would be directly
proportional to the number of unique keys written to the output by the
mappers. So - you can set the number of reducers accordingly, depending on
the job



> to speed up these
> operations ?
>

Check TableMapReduceUtil.setScannerCaching(  )  . Set it to a reasonably
higher number to speed up M-R . The factor would also increase the memory
usage at the region servers ( i.e. data nodes ). So - that has to be
factored into account too.



>
> -Prasen
>
> On Thu, Mar 18, 2010 at 11:07 PM, Stack <[email protected]> wrote:
> > On Thu, Mar 18, 2010 at 8:33 AM, Asif Jan <[email protected]> wrote:
> >> Hello
> >>
> >> Is there any information (post/wiki page)  on how data locality works in
> >> hbase. From the documentation in the site I was able to spot following
> >> paragraph in the "Old Road Maps"  section at url
> >>  http://wiki.apache.org/hadoop/HBase/RoadMaps
> >>
> >> Data-Locality Awareness The Hadoop map reduce framework does
> >> -------------------- in network I/O.
> >>
> >> I am looking for answer of the following:
> >>
> >> 1) when using hbase, does the jobs end up where the data is stored (I
> will
> >> guess so); if yes then how is it done (links to related
> packages/pointers).
> >>
> >
> > The default splitter used by TableInputFormat splits a table at region
> > boundaries.  The splitter passes back to mapreduce the region range as
> > delimiter for the task.  It also passes the name of the machine that
> > is hosting the region at the time at which the splitter runs.
> > Informal observations have it that at least for smaller jobs --
> > hundreds of millions of rows over a smallish 10 node cluster -- the
> > mapreduce framework will schedule tasks on the tasktracker that is
> > running beside the hosting regionserver.  You can check for yourself
> > using the mapreduce dashboard.
> >
> > Grep TableSplit
> > (
> http://hadoop.apache.org/hbase/docs/r0.20.3/api/org/apache/hadoop/hbase/mapreduce/TableSplit.html
> )
> > in the mapreduce or mapred packages and see the getSplits method for
> > how it works.
> >
> > The above describes how the mapreduce child task and regionserver relate.
> >
> > The other partner to the relationship is how data in HDFS is located
> > relative to the regionserver hosting it.
> >
> > A region is made up of many files (Regions are made of column
> > families.  Each column family has 0-N files).  When HBase writes
> > files, in general, one of the replicas will land on the local
> > datanode.  The DFSClient makes an effort at reading the data closest
> > so in general the RegionServer is reading data that is local (Note,
> > though its local, we still go via the network to get it currently.
> > There is no exception made when data is local where we use unix domain
> > sockets or by-pass the network and go direct to the local disk).
> >
> > But, currently, region assignment in hbase pays no regard to where the
> > data is actually located in HDFS so, on startup or following a split,
> > you could have a region being served by RegionServer that does not
> > have the data local hosted by the adjacent DataNode.  So, a restart
> > messes up developed locality.  We need to do a bit of work here where
> > we do some sort of aggregation of the block locations that make up a
> > regions content and make a call on which RegionServer is best
> > positioned to serve the regions data.
> >
> > A tendency that works to repair the above disruption is that as HBase
> > runs, it compacts column families that have too many storefiles.  This
> > compaction process in effect rewrites data.  The rewrite will pull
> > data local again (because the new rewritten file will have one of its
> > replicas placed on local datanode).
> >
> >> 2) Is it possible to find out where the data resides (the way one could
> do
> >> when using hadoop file system directly).
> >>
> >
> > There are probably better ways, but here is where I'd start: Namenode
> > has block locations, filenames and what blocks make up a block.  There
> > is also ./bin/hadoop fsck.  You can output filenames, blocks and
> > locations.  In the datanode, there is a new clienttrace which outputs
> > source and dest of reads, etc.
> >
> >>
> >> Also with respect to the performance charts included in Raghu's keynote
> at
> >> LADIS 2009  (slide 84-87)
> >>
> >>
> http://www.cs.cornell.edu/projects/ladis2009/talks/ramakrishnan-keynote-ladis2009.pdf
> >>
> >> Do we have numbers for latest releases (or these numbers are still valid
> for
> >> newer releases as well). The latency numbers look pretty bad for hbase.
> >>
> >
> > Yeah, they look bad there.  Working w/ Adam over at Y! Research, our
> > numbers should be better now -- configuration and bug-fix made a
> > difference -- but they have a ways to go yet.  More on this soon
> > hopefully.
> >
> > St.Ack
> >
> >> thanks a lot
> >>
> >
> >
> >
> >>
> >>
> >>
> >>
> >
>

Reply via email to