IMHO, anyone working on HBASE-32 should consider the multi-billion row case. 

One possible option is to count the entries in the MapFile indexes, multiply 
that count by whatever hbase.io.index.interval (or the INDEX_INTERVAL HTD 
attribute) is, consider all of the MapFiles for the columns in a table, and 
choose the largest value. Do this for all of the table's regions. The result 
would be a reasonable estimate, but the whole process sounds expensive. 
Originally I was thinking that the regionservers could do this since they have 
to read in the MapFile indexes anyway, and also they know the count of rows in 
memcache, but if regionservers limit the number of in-memory MapFile indexes to 
avoid OOME as has been discussed, they won't have all of the information on 
hand. 

Maybe a map of MapFile to row count estimations can be stored in the FS next to 
the MapFiles and can be updated appropriately during compactions. Then a client 
can iterate over the regions of a table, ask the regionservers involved for row 
count estimations, the regionservers can consult the estimation-map and send 
the largest count found there for the table plus the largest memcache count for 
the table, and finally the client can total all of the results.

   - Andy

> From: Jean-Daniel Cryans <[EMAIL PROTECTED]>
> Subject: Re: any chance to get the size of a table?
> To: [email protected]
> Date: Monday, July 21, 2008, 6:43 AM
>
> Zhao,
> 
> Yes, the only way is to use a scanner but it will take a
> _long_ time. HBASE-32 about adding a row count
> estimator. For those who want to know why it's so slow,
> having a scanner that goes on each row of a table requires
> doing a read request on disk for each one of them (except
> for the stuff in the memcache that waits to be flushed).
> If you have 6 500 000 rows like I saw last week on the IRC
> channel, i may take well over 80 minutes (it depends on the
> cpu/io/network load, hardware, etc).
> 
> J-D



      

Reply via email to