Re: A question about MapReduce job extracts recent data from HBase/Bigtable

Jonathan Gray Mon, 20 Jul 2009 12:12:00 -0700

The row key is (website,stamp) so the table is GROUP BY website and thenORDER BY stamp. If you'd want to just get recent data, you'd do somekind of row filter server-side so you were only returned clicks from therange you specified for that particular MR job.

Does that make sense? Do you think it's more complex than that? Theyare grouping by the website, not strictly ordering by the stamp, sothere's no way to prevent a full table scan (server-side), you can usefilters to prevent all the unnecessary data from moving back to theclient/job.


JG

Schubert Zhang wrote:

Hi all,

I have a periodically scheduled MapReduce job need to extract recent data
from a HBase table for analysis, and avoid scanning/reading the analyzed
data. Do you have any idea?

In the Google paper <Bigtable: A Distributed Storage System for Structured
Data>
Section: 8.1 Google Analytics

The raw click table (˜200 TB) maintains a row for each end-user session. The
row name is a tuple containing the website's name and the time at which the
session was created. This schema ensures that sessions that visit the same
web site are contiguous, and that they are sorted chronologically. This
table compresses to 14% of its original size.

The summary table (~20 TB) contains various predefined summaries for each
website. This table is generated from the raw click table by periodically
scheduled MapReduce jobs. Each MapReduce job extracts recent session data
from the raw click table. The overall system's throughput is limited by the
throughput of GFS. This table compresses to 29% of its original size.

Can anybody share your ideas about how "Each MapReduce job extracts recent
session data from the raw click table."?

Thanks!
Schubert

Re: A question about MapReduce job extracts recent data from HBase/Bigtable

Reply via email to