stack <[EMAIL PROTECTED]> wrote on 08/06/2008 05:32:09 PM: > Jun Rao wrote: > > In terms of performance, the biggest overhead comes from Hbase/Hadoop ipc. > > For simple queries, a search through ipc takes 3-4 times as long as that > > directly on HDFS. I guess a lot of the overhead is because of java > > reflection in ipc proxy. Does Hbase have plans to make ipc more efficient? > > > We do. Its a priority. 0.3.0 hopefully. > > > HDFS adds another layer of overhead compared with local file system. A > > search on HDFS (on a node that has a local copy of all data) can take 10 > > times as long as that on local file system. We suspect most overhead comes > > from reopening sockets in HDFS client. > > > Are you on a recent hbase Jun? Hadoop RPC seems to be reusing > connections in 0.17.1. Maybe that will help. >
Our tests were done on Hadoop 0.17.1. > St.Ack > > > > Jun > > IBM Almaden Research Center > > K55/B1, 650 Harry Road, San Jose, CA 95120-6099 > > > > [EMAIL PROTECTED] > > (408)927-1886 (phone) > > (408)927-3215 (fax) > > > > > > > > > > stack > > <[EMAIL PROTECTED] > > > To > > [email protected] > > 08/06/2008 01:42 cc > > PM > > Subject > > Re: Multi get/put > > Please respond to > > [EMAIL PROTECTED] > > .apache.org > > > > > > > > > > > > > > > > Ning Li wrote: > > > >>> Does you have to do a rewrite of the lucene index at compaction time? > >>> > > Or > > > >>> just call optimize? (I suppose its the former if you need to clean up > >>> 'References' as per below where you talk of splits) > >>> > >>> > >> What do you mean by "a rewrite of the lucene index"? > >> > > > > In hbase, on split, daughters hold a reference to either the top or > > bottom half of their parent region. References are undone by > > compactions; as part of compaction, the part of the parent referenced by > > the daughter gets written out to store files under the daughter. > > Daughters try to undo references as promptly as possible because regions > > with references are not splitable (references to references, and so on, > > would soon become unmanageble). > > > > In your description, you mentioned that daughter regions reference their > > parents' index. When I said, 'a rewrite of the lucene index', I was > > asking, as per hbase regions, if you followed the model and wrote a new > > lucene index comprised of daughter-only content at compaction time. Or > > do you just 'optimize' and let the references build up so the daughter > > of a daughter points all the ways up to the parent? > > > > Just wondering. > > > > > > > >>> Regards your 'on the other hand' above, thats a good point. Have you > >>> verified that if a regionerver is running on a datanode, that the lucene > >>> index is written local? Would be interesting to know. > >>> > >>> > >> That's HDFS's policy. See HDFS's FSNamesystem.getAdditionalBlock. > >> > >> > > Sorry. Yeah, of course. > > > > So, why do you think it so slow going via HDFS FileSystem when the data > > is local? Is it the block-orientated access or is there just a high-tax > > going via the HDFS FS interface? > > > > St.Ack > > > > > > >
