The table was created with two column families: createdAt and event, the former is the timestamp, so 1 entry per entity and the latter is a collection of events. In the latter entries take the form event:1524, event:1207, etc. and for the time being I'm storing only the event time. The input is a set of text files generated at a rate of about 600 an hour with up to 50,000 entries per file. Each line in the text file contains a unique entity ID, a timestamp of the first time it was seen, an event code and a history of the last 100 event codes. In cases where I haven't seen an entity before I want to add everything in the history; when the entity has been seen previously I just want to add the last event. I'm keeping the table design simple to start with while I'm getting familiar with HBase.
The principal area of concern I have is regarding the reading of the data from the HBase table during the map/reduce process to determine if an entity already exists. If I'm running the map/reduce on a single machine then its pretty easy to keep track of previously unknown entities; but if I'm running in a cluster a new entity may show up in the inputs to several concurrent [EMAIL PROTECTED] Jean-Daniel Cryans wrote: > > Brian (guessing it's your name from your email address), > > Please be more specific about your table design. For example, a "column" > in > HBase is a very vague word since it may refer to a column family or a > column > key inside a column family. Also, what kind of load you expect to have? > > Maybe answering to this will also help you understanding HBase. > > Thx, > > J-D > > On Fri, Jul 18, 2008 at 4:41 PM, imbmay <[EMAIL PROTECTED]> wrote: > >> >> I want to use hbase to maintain a very large dataset which needs to be >> updated pretty much continuously. I'm creating a record for each entity >> and >> including a creation timestamp column as well as between 10 and 1000 >> additional columns named for distinct events related to the record >> entity. >> Being new to hbase the approach I've taken is to create a map/reduce app >> that for each input record: >> >> Does a lookup in the table using HTable get(row, column) on the timestamp >> colum to determine if there is an existing row for the entity. >> If there is no existing record for the entity, the event history for the >> entity is added to the table with one column added per unique event id. >> If there is an existing record for the entity, it just adds the most >> recent >> event to the table. >> >> I'd like feedback as to whether this is a reasonable approach in terms of >> general performance and reliability or if there is a different pattern >> better suited to hbase with map/reduce or if I should even be using >> map/reduce for this. >> >> Thanks in advance. >> >> >> -- >> View this message in context: >> http://www.nabble.com/Table-Updates-with-Map-Reduce-tp18537368p18537368.html >> Sent from the HBase User mailing list archive at Nabble.com. >> >> > > -- View this message in context: http://www.nabble.com/Table-Updates-with-Map-Reduce-tp18537368p18548888.html Sent from the HBase User mailing list archive at Nabble.com.
