Hi all, I have a periodically scheduled MapReduce job need to extract recent data from a HBase table for analysis, and avoid scanning/reading the analyzed data. Do you have any idea?
In the Google paper <Bigtable: A Distributed Storage System for Structured Data> Section: 8.1 Google Analytics The raw click table (200 TB) maintains a row for each end-user session. The row name is a tuple containing the website's name and the time at which the session was created. This schema ensures that sessions that visit the same web site are contiguous, and that they are sorted chronologically. This table compresses to 14% of its original size. The summary table (~20 TB) contains various predefined summaries for each website. This table is generated from the raw click table by periodically scheduled MapReduce jobs. Each MapReduce job extracts recent session data from the raw click table. The overall system's throughput is limited by the throughput of GFS. This table compresses to 29% of its original size. Can anybody share your ideas about how "Each MapReduce job extracts recent session data from the raw click table."? Thanks! Schubert
