Re: count of rows in table

Ted Yu Thu, 29 Jul 2010 21:08:04 -0700

I think OR is more reasonable.

On Thu, Jul 29, 2010 at 8:54 PM, Angus He <[email protected]> wrote:


> By the way
>
> If users input multiple columns, it seems that the current
> implementation of RowCounter employs the OR logical operation.
>
> Is the AND more reasonable?
>
>
>
> On Fri, Jul 30, 2010 at 11:13 AM, Ryan Rawson <[email protected]> wrote:
> > RowCounter job counts rows. Its answer will be how many distinct row keys
> > were in the table approximately at a given time range.
> >
> > Even if the implementation uses first kv filter nothing about what I just
> > said is false.
> >
> > A KeyValue counter would tell you how many cells and versions there were
> > total don't you think?
> >
> > On Jul 29, 2010 7:58 PM, "Angus He" <[email protected]> wrote:
> >> Column names are just optional for RowCounter job.
> >>
> >> To be more accurate, RowCounter is a KeyValueCounter.
> >> If no columns are specified, only the first KeyValues of each row are
> >> included, then get the RowCounter.
> >>
> >>
> >> On Fri, Jul 30, 2010 at 9:28 AM, Ted Yu <[email protected]> wrote:
> >>> If someone can share the commandline for running RowCounter, that would
> > be
> >>> great.
> >>>
> >>> Also, hbase shell count doesn't require column name. Why does
> RowCounter
> >>> require it ?
> >>>
> >>> Thanks
> >>>
> >>> On Thu, Jul 29, 2010 at 4:55 PM, Ryan Rawson <[email protected]>
> wrote:
> >>>
> >>>> Hi,
> >>>>
> >>>> That table appears to be empty.  Eg:
> >>>>
> >>>> 10/07/29 22:38:43 INFO mapred.JobClient:     Map input records=0
> >>>>
> >>>>
> >>>> So back to the count issue... Counting in databases is a classic
> >>>> problem. Unless your DB system is keeping stats on how many
> >>>> inserts/deletes and thus how big it thinks the table is, you have to
> >>>> count all the rows by reading them.  HBase is no different, and a
> >>>> little harder, because we have a variable length data format, so we
> >>>> can't just estimate row sizes from file sizes.  Keeping distributed
> >>>> stats is not impossible, but certainly not on any priority list to be
> >>>> implemented - of course JIRAs/patches welcome etc.
> >>>>
> >>>> -ryan
> >>>>
> >>>>
> >>>> On Thu, Jul 29, 2010 at 3:48 PM, Ted Yu <[email protected]> wrote:
> >>>> > We use HBase 0.20.5
> >>>> >
> >>>> > Here is the snippet from RowCounter output:
> >>>> >
> >>>> > 10/07/29 22:38:42 DEBUG client.HTable$ClientScanner: Finished with
> >>>> scanning
> >>>> > at REGION => {NAME =>
> >>>> >
> >>>>
> >
> '2__HB_NOINC_ORCL_SQLLDR_0728-THREEGPPSPEECHCALLS-1280408509541-0,DFF46493EB352D0E31CBFA4652E6EC06,1280412540858',
> >>>> > STARTKEY => 'DFF46493EB352D0E31CBFA4652E6EC06', ENDKEY => '',
> ENCODED
> > =>
> >>>> > 1375318608, TABLE => {{NAME =>
> >>>> > '2__HB_NOINC_ORCL_SQLLDR_0728-THREEGPPSPEECHCALLS-1280408509541-0',
> >>>> FAMILIES
> >>>> > => [{NAME => 'd', COMPRESSION => 'GZ', VERSIONS => '2', TTL =>
> >>>> '31536000',
> >>>> > BLOCKSIZE => '65536', IN_MEMORY => 'false', BLOCKCACHE => 'false'},
> > {NAME
> >>>> =>
> >>>> > 'i', COMPRESSION => 'GZ', VERSIONS => '2', TTL => '31536000',
> > BLOCKSIZE
> >>>> =>
> >>>> > '65536', IN_MEMORY => 'false', BLOCKCACHE => 'false'}, {NAME => 'v',
> >>>> > COMPRESSION => 'GZ', VERSIONS => '2', TTL => '31536000', BLOCKSIZE
> =>
> >>>> > '65536', IN_MEMORY => 'false', BLOCKCACHE => 'false'}]}}
> >>>> > 10/07/29 22:38:42 INFO mapred.TaskRunner:
> >>>> Task:attempt_local_0001_m_000000_0
> >>>> > is done. And is in the process of commiting
> >>>> > 10/07/29 22:38:42 INFO mapred.LocalJobRunner:
> >>>> > 10/07/29 22:38:42 INFO mapred.TaskRunner: Task
> >>>> attempt_local_0001_m_000000_0
> >>>> > is allowed to commit now
> >>>> > 10/07/29 22:38:42 INFO mapred.FileOutputCommitter: Saved output of
> > task
> >>>> > 'attempt_local_0001_m_000000_0' to
> >>>> > file:/usr/local/hadoop/trunk.80-275066/hbase-0.20.5/rc
> >>>> > 10/07/29 22:38:42 INFO mapred.LocalJobRunner:
> >>>> > 10/07/29 22:38:42 INFO mapred.TaskRunner: Task
> >>>> > 'attempt_local_0001_m_000000_0' done.
> >>>> > 10/07/29 22:38:43 INFO mapred.JobClient:  map 100% reduce 0%
> >>>> > 10/07/29 22:38:43 INFO mapred.JobClient: Job complete:
> job_local_0001
> >>>> > 10/07/29 22:38:43 INFO mapred.JobClient: Counters: 6
> >>>> > 10/07/29 22:38:43 INFO mapred.JobClient:   FileSystemCounters
> >>>> > 10/07/29 22:38:43 INFO mapred.JobClient:     FILE_BYTES_READ=1592883
> >>>> > 10/07/29 22:38:43 INFO mapred.JobClient:
> > FILE_BYTES_WRITTEN=1624956
> >>>> > 10/07/29 22:38:43 INFO mapred.JobClient:   Map-Reduce Framework
> >>>> > 10/07/29 22:38:43 INFO mapred.JobClient:     Map input records=0
> >>>> > 10/07/29 22:38:43 INFO mapred.JobClient:     Spilled Records=0
> >>>> > 10/07/29 22:38:43 INFO mapred.JobClient:     Map input bytes=0
> >>>> > 10/07/29 22:38:43 INFO mapred.JobClient:     Map output records=0
> >>>> >
> >>>> > [sjc1-hadoop8.sjc1:hadoop 3705]ls -l
> >>>> > /usr/local/hadoop/trunk.80-275066/hbase-0.20.5/rc/part-00000
> >>>> > -rwxrwxrwx 1 hadoop users 0 Jul 29 22:38
> >>>> > /usr/local/hadoop/trunk.80-275066/hbase-0.20.5/rc/part-00000
> >>>> >
> >>>> > But there are many records in the table I was querying.
> >>>> >
> >>>> > Can someone comment ?
> >>>> >
> >>>> > On Thu, Jul 29, 2010 at 2:26 PM, Jean-Daniel Cryans <
> > [email protected]
> >>>> >wrote:
> >>>> >
> >>>> >> In 0.89 you can specify CACHE for the count command. Set it higher
> > (it
> >>>> >> defaults to 10 rows per call).
> >>>> >>
> >>>> >> Also you can use the RowCounter MR job.
> >>>> >>
> >>>> >> J-D
> >>>> >>
> >>>> >> On Thu, Jul 29, 2010 at 2:22 PM, Ted Yu <[email protected]>
> wrote:
> >>>> >> > Hi,
> >>>> >> > The count method in HBase shell is quite slow.
> >>>> >> > Is there a way to obtain count faster ?
> >>>> >> >
> >>>> >> > Thanks
> >>>> >> >
> >>>> >>
> >>>> >
> >>>>
> >>>
> >>
> >>
> >>
> >> --
> >> Regards
> >> Angus
> >
>
>
>
> --
> Regards
> Angus
>

Re: count of rows in table

Reply via email to