> Rowkey is a combination of timestamp+primary key as string. I.e
> 1234567890-hostname. Therefore, the byte order of string sorting works fine.
I don't think this is correct. If your row keys are strings, you'd get
an ordering like this:
1000-hostname
200-hostname
3000-hostname
For the use case I was concerned about, I think it would be solved my
making the row key a long timestamp and the data-type a column family.
Then you could something similar to what you described:
Scan “user_table”, { COLUMNS => “<data_type>”, STARTROW => 1234567890,
STOPROW => 1234597890 };
I'm not sure how to do the same thing though if you want to partition
by both hostname and datatype.
On Tue, Nov 23, 2010 at 1:54 PM, Eric Yang <[email protected]> wrote:
> It is more efficient because there is no need to wait for the file to be
> closed before the map reduce job can be launched. Data type is grouped into
> a hbase table or column families. The choice is in the hand of parser
> developer. Rowkey is a combination of timestamp+primary key as string. I.e
> 1234567890-hostname. Therefore, the byte order of string sorting works
> fine.
>
> There are two ways to deal with this problem, it can be scanned using
> StartRow feature in Hbase to narrow down the row range, or use Hbase
> timestamp field to control the scanning range. Hbase timestamp is a special
> numeric field.
>
> To translate your query to hbase:
>
> Scan “<data_type>”, { STARTROW => ‘timestamp’ };
>
> Or
>
> Scan “user_table”, { COLUMNS => “<data_type>”, timestamp => 1234567890 };
>
> The design is up to the parser designer. FYI, Hbase shell doesn’t support
> timestamp range query, but the java api does.
>
> Regards,
> Eric
>
> On 11/22/10 10:38 PM, "Bill Graham" <[email protected]> wrote:
>
> I see plenty of value in the HBase approach, but I'm still not clear
> on how the time and data type partitioning would be done more
> efficiently within HBase when running a job on a specific 5 minute
> interval for a given data type. I've only used HBase briefly so I
> could certainly be missing something, but I thought the sort for range
> scans is by byte order, which works for string types, but not numbers.
>
> So if your row ids are are <timestamp>/<data_type>, how do you fetch
> all the data for a given data_type for a given time period without
> potentially scanning many unnecessary rows? The timestamps will be in
> alphabetical order, not numeric and data_types would be mixed.
>
> Under the current scheme, since partitioning is done in HDFS you could
> just get <data_type>/<time>/part-* to get exactly the records you're
> looking for.
>
>
> On Mon, Nov 22, 2010 at 5:00 PM, Eric Yang <[email protected]> wrote:
>> Comparison chart:
>>
>>
>> ---------------------------------------------------------------------------
>> | Chukwa Types | Chukwa classic | Chukwa on Hbase
>> |
>>
>> ---------------------------------------------------------------------------
>> | Installation cost | Hadoop + Chukwa | Hadoop + Hbase + Chukwa
>> |
>>
>> ---------------------------------------------------------------------------
>> | Data latency | fixed n Minutes | 50-100 ms
>> |
>>
>> ---------------------------------------------------------------------------
>> | File Management | Hourly/Daily Roll Up | Hbase periodically
>> |
>> | Cost | Mapreduce Job | spill data to disk
>> |
>>
>> ---------------------------------------------------------------------------
>> | Record Size | Small needs to fit | Data node block
>> |
>> | | in java HashMap | size. (64MB)
>> |
>>
>> ---------------------------------------------------------------------------
>> | GUI friendly view | Data needs to be | drill down to raw
>> |
>> | | aggregated first | data or aggregated
>> |
>>
>> ---------------------------------------------------------------------------
>> | Demux | Single reducer | Write to hbase in
>> |
>> | | or creates multiple | parallel
>> |
>> | | part-nnn files, and |
>> |
>> | | unsorted between files |
>> |
>>
>> ---------------------------------------------------------------------------
>> | Demux Output | Sequence file | Hbase Table
>> |
>>
>> ---------------------------------------------------------------------------
>> | Data analytics tools | Mapreduce/Pig | MR/Pig/Hive/Cascading
>> |
>>
>> ---------------------------------------------------------------------------
>>
>> Regards,
>> Eric
>>
>> On 11/22/10 3:05 PM, "Ahmed Fathalla" <[email protected]> wrote:
>>
>>> I think what we need to do is create some kind of comparison table
>>> contrasting the merits of each approach (HBase vs Normal Demux
>>> processing).
>>> This exercise will be both useful in making the decision of choosing the
>>> default and for documentation purposes to illustrate the difference for
>>> new
>>> users.
>>>
>>>
>>> On Mon, Nov 22, 2010 at 11:19 PM, Bill Graham <[email protected]>
>>> wrote:
>>>
>>>> We are going to continue to have use cases where we want log data
>>>> rolled up into 5 minute, hourly and daily increments in HDFS to run
>>>> map reduce jobs on them. How will this model work with the HBase
>>>> approach? What process will aggregate the HBase data into time
>>>> increments like the current demux and hourly/daily rolling processes
>>>> do? Basically, what does the time partitioning look like in the HBase
>>>> storage scheme?
>>>>
>>>>> My concern is that the demux process is going to become two parallel
>>>>> tracks, one works in mapreduce, and another one works in collector. It
>>>>> becomes difficult to have clean efficient parsers which works in both
>>>>
>>>> This statement makes me concerned that you're implying the need to
>>>> deprecate the current demux model, which is very different than making
>>>> one or the other the default in the configs. Is that the case?
>>>>
>>>>
>>>>
>>>> On Mon, Nov 22, 2010 at 11:41 AM, Eric Yang <[email protected]> wrote:
>>>>> MySQL support has been removed from Chukwa 0.5. My concern is that the
>>>> demux process is going to become two parallel tracks, one works in
>>>> mapreduce, and another one works in collector. It becomes difficult to
>>>> have
>>>> clean efficient parsers which works in both places. From architecture
>>>> perspective, incremental updates to data is better than batch processing
>>>> for
>>>> near real time monitoring purpose. I like to ensure Chukwa framework
>>>> can
>>>> deliver Chukwa's mission statement, hence I standby Hbase as default. I
>>>> was
>>>> playing with Hbase 0.20.6+Pig 0.8 branch last weekend, I was very
>>>> impressed
>>>> by both speed and performance of this combination. I encourage people
>>>> to
>>>> try it out.
>>>>>
>>>>> Regards,
>>>>> Eric
>>>>>
>>>>> On 11/22/10 10:50 AM, "Ariel Rabkin" <[email protected]> wrote:
>>>>>
>>>>> I agree with Bill and Deshpande that we ought to make clear to users
>>>>> that they don't nee HICC, and therefore don't need either MySQL or
>>>>> HBase.
>>>>>
>>>>> But I think what Eric meant to ask was which of MySQL and HBase ought
>>>>> to be the default *for HICC*. My sense is that the HBase support
>>>>> isn't quite mature enough, but it's getting there.
>>>>>
>>>>> I think HBase is ultimately the way to go. I think we might benefit as
>>>>> a community by doing a 0.5 release first, while waiting for the
>>>>> pig-based aggregation support that's blocking HBase.
>>>>>
>>>>> --Ari
>>>>>
>>>>> On Mon, Nov 22, 2010 at 10:47 AM, Deshpande, Deepak
>>>>> <[email protected]> wrote:
>>>>>> I agree. Making HBase by default would make some Chukwa users life
>>>> difficult. In my set up, I don't need HDFS. I am using Chukwa merely as
>>>> a
>>>> Log Streaming framework. I have plugged in my own writer to write log
>>>> files
>>>> in Local File system (instead of HDFS). I evaluated Chukwa with other
>>>> frameworks and Chukwa had very good fault tolerance built in than other
>>>> frameworks. This made me recommend Chukwa over other frameworks.
>>>>>>
>>>>>> By making HBase default option would definitely make my life difficult
>>>> :).
>>>>>>
>>>>>> Thanks,
>>>>>> Deepak Deshpande
>>>>>>
>>>>>
>>>>>
>>>>> --
>>>>> Ari Rabkin [email protected]
>>>>> UC Berkeley Computer Science Department
>>>>>
>>>>>
>>>>
>>>
>>>
>>>
>>> --
>>> Ahmed Fathalla
>>>
>>
>>
>
>