Ok now a have a good picture of your situation (took me a moment).

I guess that even if it's concurrent it will not be that much of a problem.
Keeping the max version at 1 will insure that even if 3 mappers insert the
history of one entity, the data that overlaps will still be inserted in your
"event:" family and the rest will be discarded. Your biggest concern will be
the efficiency of reading data from HBase so your mappers should have a
local cache.

Hope this helps,

J-D

On Sat, Jul 19, 2008 at 5:22 PM, imbmay <[EMAIL PROTECTED]> wrote:

>
> The table was created with two column families: createdAt and event, the
> former is the timestamp, so 1 entry per entity and the latter is a
> collection of events.  In the latter entries take the form event:1524,
> event:1207, etc. and for the time being I'm storing only the event time.
> The input is a set of text files generated at a rate of about 600 an hour
> with up to 50,000 entries per file.  Each line in the text file contains a
> unique entity ID, a timestamp of the first time it was seen, an event code
> and a history of the last 100 event codes.  In cases where I haven't seen
> an
> entity before I want to add everything in the history; when the entity has
> been seen previously I just want to add the last event.  I'm keeping the
> table design simple to start with while I'm getting familiar with HBase.
>
> The principal area of concern I have is regarding the reading of the data
> from the HBase table during the map/reduce process to determine if an
> entity
> already exists.  If I'm running the map/reduce on a single machine then its
> pretty easy to keep track of previously unknown entities; but if I'm
> running
> in a cluster a new entity may show up in the inputs to several concurrent
> [EMAIL PROTECTED]
>
>
> Jean-Daniel Cryans wrote:
> >
> > Brian (guessing it's your name from your email address),
> >
> > Please be more specific about your table design. For example, a "column"
> > in
> > HBase is a very vague word since it may refer to a column family or a
> > column
> > key inside a column family. Also, what kind of load you expect to have?
> >
> > Maybe answering to this will also help you understanding HBase.
> >
> > Thx,
> >
> > J-D
> >
> > On Fri, Jul 18, 2008 at 4:41 PM, imbmay <[EMAIL PROTECTED]> wrote:
> >
> >>
> >> I want to use hbase to maintain a very large dataset which needs to be
> >> updated pretty much continuously.  I'm creating a record for each entity
> >> and
> >> including a creation timestamp column as well as between 10 and 1000
> >> additional columns named for distinct events related to the record
> >> entity.
> >> Being new to hbase the approach I've taken is to create a map/reduce app
> >> that for each input record:
> >>
> >> Does a lookup in the table using HTable get(row, column) on the
> timestamp
> >> colum to determine if there is an existing row for the entity.
> >> If there is no existing record for the entity, the event history for the
> >> entity is added to the table with one column added per unique event id.
> >> If there is an existing record for the entity, it just adds the most
> >> recent
> >> event to the table.
> >>
> >> I'd like feedback as to whether this is a reasonable approach in terms
> of
> >> general performance and reliability or if there is a different pattern
> >> better suited to hbase with map/reduce or if I should even be using
> >> map/reduce for this.
> >>
> >> Thanks in advance.
> >>
> >>
> >> --
> >> View this message in context:
> >>
> http://www.nabble.com/Table-Updates-with-Map-Reduce-tp18537368p18537368.html
> >> Sent from the HBase User mailing list archive at Nabble.com.
> >>
> >>
> >
> >
>
> --
> View this message in context:
> http://www.nabble.com/Table-Updates-with-Map-Reduce-tp18537368p18548888.html
> Sent from the HBase User mailing list archive at Nabble.com.
>
>

Reply via email to