Thank you, supports what I was thinking.
Jean-Daniel Cryans wrote: > > Ok now a have a good picture of your situation (took me a moment). > > I guess that even if it's concurrent it will not be that much of a > problem. > Keeping the max version at 1 will insure that even if 3 mappers insert the > history of one entity, the data that overlaps will still be inserted in > your > "event:" family and the rest will be discarded. Your biggest concern will > be > the efficiency of reading data from HBase so your mappers should have a > local cache. > > Hope this helps, > > J-D > > On Sat, Jul 19, 2008 at 5:22 PM, imbmay <[EMAIL PROTECTED]> wrote: > >> >> The table was created with two column families: createdAt and event, the >> former is the timestamp, so 1 entry per entity and the latter is a >> collection of events. In the latter entries take the form event:1524, >> event:1207, etc. and for the time being I'm storing only the event time. >> The input is a set of text files generated at a rate of about 600 an hour >> with up to 50,000 entries per file. Each line in the text file contains >> a >> unique entity ID, a timestamp of the first time it was seen, an event >> code >> and a history of the last 100 event codes. In cases where I haven't seen >> an >> entity before I want to add everything in the history; when the entity >> has >> been seen previously I just want to add the last event. I'm keeping the >> table design simple to start with while I'm getting familiar with HBase. >> >> The principal area of concern I have is regarding the reading of the data >> from the HBase table during the map/reduce process to determine if an >> entity >> already exists. If I'm running the map/reduce on a single machine then >> its >> pretty easy to keep track of previously unknown entities; but if I'm >> running >> in a cluster a new entity may show up in the inputs to several concurrent >> [EMAIL PROTECTED] >> >> >> Jean-Daniel Cryans wrote: >> > >> > Brian (guessing it's your name from your email address), >> > >> > Please be more specific about your table design. For example, a >> "column" >> > in >> > HBase is a very vague word since it may refer to a column family or a >> > column >> > key inside a column family. Also, what kind of load you expect to have? >> > >> > Maybe answering to this will also help you understanding HBase. >> > >> > Thx, >> > >> > J-D >> > >> > On Fri, Jul 18, 2008 at 4:41 PM, imbmay <[EMAIL PROTECTED]> >> wrote: >> > >> >> >> >> I want to use hbase to maintain a very large dataset which needs to be >> >> updated pretty much continuously. I'm creating a record for each >> entity >> >> and >> >> including a creation timestamp column as well as between 10 and 1000 >> >> additional columns named for distinct events related to the record >> >> entity. >> >> Being new to hbase the approach I've taken is to create a map/reduce >> app >> >> that for each input record: >> >> >> >> Does a lookup in the table using HTable get(row, column) on the >> timestamp >> >> colum to determine if there is an existing row for the entity. >> >> If there is no existing record for the entity, the event history for >> the >> >> entity is added to the table with one column added per unique event >> id. >> >> If there is an existing record for the entity, it just adds the most >> >> recent >> >> event to the table. >> >> >> >> I'd like feedback as to whether this is a reasonable approach in terms >> of >> >> general performance and reliability or if there is a different pattern >> >> better suited to hbase with map/reduce or if I should even be using >> >> map/reduce for this. >> >> >> >> Thanks in advance. >> >> >> >> >> >> -- >> >> View this message in context: >> >> >> http://www.nabble.com/Table-Updates-with-Map-Reduce-tp18537368p18537368.html >> >> Sent from the HBase User mailing list archive at Nabble.com. >> >> >> >> >> > >> > >> >> -- >> View this message in context: >> http://www.nabble.com/Table-Updates-with-Map-Reduce-tp18537368p18548888.html >> Sent from the HBase User mailing list archive at Nabble.com. >> >> > > -- View this message in context: http://www.nabble.com/Table-Updates-with-Map-Reduce-tp18537368p18576436.html Sent from the HBase User mailing list archive at Nabble.com.
