This is something I would like to implement as well. A connection pool of some sort to increase the open/close performance and to be able to hold a connection "open" during a session or at least a transaction (more than one put in a row) which I guess is supported in trunk ?
//Marcus On Thu, Aug 7, 2008 at 2:15 AM, Jun Rao <[EMAIL PROTECTED]> wrote: > In terms of performance, the biggest overhead comes from Hbase/Hadoop ipc. > For simple queries, a search through ipc takes 3-4 times as long as that > directly on HDFS. I guess a lot of the overhead is because of java > reflection in ipc proxy. Does Hbase have plans to make ipc more efficient? > > HDFS adds another layer of overhead compared with local file system. A > search on HDFS (on a node that has a local copy of all data) can take 10 > times as long as that on local file system. We suspect most overhead comes > from reopening sockets in HDFS client. > > Jun > IBM Almaden Research Center > K55/B1, 650 Harry Road, San Jose, CA 95120-6099 > > [EMAIL PROTECTED] > (408)927-1886 (phone) > (408)927-3215 (fax) > > > > > stack > <[EMAIL PROTECTED] > > To > [email protected] > 08/06/2008 01:42 cc > PM > Subject > Re: Multi get/put > Please respond to > [EMAIL PROTECTED] > .apache.org > > > > > > > > Ning Li wrote: > >> Does you have to do a rewrite of the lucene index at compaction time? > Or > >> just call optimize? (I suppose its the former if you need to clean up > >> 'References' as per below where you talk of splits) > >> > > > > What do you mean by "a rewrite of the lucene index"? > > In hbase, on split, daughters hold a reference to either the top or > bottom half of their parent region. References are undone by > compactions; as part of compaction, the part of the parent referenced by > the daughter gets written out to store files under the daughter. > Daughters try to undo references as promptly as possible because regions > with references are not splitable (references to references, and so on, > would soon become unmanageble). > > In your description, you mentioned that daughter regions reference their > parents' index. When I said, 'a rewrite of the lucene index', I was > asking, as per hbase regions, if you followed the model and wrote a new > lucene index comprised of daughter-only content at compaction time. Or > do you just 'optimize' and let the references build up so the daughter > of a daughter points all the ways up to the parent? > > Just wondering. > > > >> Regards your 'on the other hand' above, thats a good point. Have you > >> verified that if a regionerver is running on a datanode, that the lucene > >> index is written local? Would be interesting to know. > >> > > > > That's HDFS's policy. See HDFS's FSNamesystem.getAdditionalBlock. > > > Sorry. Yeah, of course. > > So, why do you think it so slow going via HDFS FileSystem when the data > is local? Is it the block-orientated access or is there just a high-tax > going via the HDFS FS interface? > > St.Ack > > > -- Marcus Herou CTO and co-founder Tailsweep AB +46702561312 [EMAIL PROTECTED] http://www.tailsweep.com/ http://blogg.tailsweep.com/
