The table was created with two column families: createdAt and event, the
former is the timestamp, so 1 entry per entity and the latter is a
collection of events.  In the latter entries take the form event:1524,
event:1207, etc. and for the time being I'm storing only the event time. 
The input is a set of text files generated at a rate of about 600 an hour
with up to 50,000 entries per file.  Each line in the text file contains a
unique entity ID, a timestamp of the first time it was seen, an event code
and a history of the last 100 event codes.  In cases where I haven't seen an
entity before I want to add everything in the history; when the entity has
been seen previously I just want to add the last event.  I'm keeping the
table design simple to start with while I'm getting familiar with HBase.

The principal area of concern I have is regarding the reading of the data
from the HBase table during the map/reduce process to determine if an entity
already exists.  If I'm running the map/reduce on a single machine then its
pretty easy to keep track of previously unknown entities; but if I'm running
in a cluster a new entity may show up in the inputs to several concurrent
[EMAIL PROTECTED]


Jean-Daniel Cryans wrote:
> 
> Brian (guessing it's your name from your email address),
> 
> Please be more specific about your table design. For example, a "column"
> in
> HBase is a very vague word since it may refer to a column family or a
> column
> key inside a column family. Also, what kind of load you expect to have?
> 
> Maybe answering to this will also help you understanding HBase.
> 
> Thx,
> 
> J-D
> 
> On Fri, Jul 18, 2008 at 4:41 PM, imbmay <[EMAIL PROTECTED]> wrote:
> 
>>
>> I want to use hbase to maintain a very large dataset which needs to be
>> updated pretty much continuously.  I'm creating a record for each entity
>> and
>> including a creation timestamp column as well as between 10 and 1000
>> additional columns named for distinct events related to the record
>> entity.
>> Being new to hbase the approach I've taken is to create a map/reduce app
>> that for each input record:
>>
>> Does a lookup in the table using HTable get(row, column) on the timestamp
>> colum to determine if there is an existing row for the entity.
>> If there is no existing record for the entity, the event history for the
>> entity is added to the table with one column added per unique event id.
>> If there is an existing record for the entity, it just adds the most
>> recent
>> event to the table.
>>
>> I'd like feedback as to whether this is a reasonable approach in terms of
>> general performance and reliability or if there is a different pattern
>> better suited to hbase with map/reduce or if I should even be using
>> map/reduce for this.
>>
>> Thanks in advance.
>>
>>
>> --
>> View this message in context:
>> http://www.nabble.com/Table-Updates-with-Map-Reduce-tp18537368p18537368.html
>> Sent from the HBase User mailing list archive at Nabble.com.
>>
>>
> 
> 

-- 
View this message in context: 
http://www.nabble.com/Table-Updates-with-Map-Reduce-tp18537368p18548888.html
Sent from the HBase User mailing list archive at Nabble.com.

Reply via email to