Re: Unique row ID constraint

Tatsuya Kawano Sat, 08 May 2010 16:22:43 -0700

OK. I found HTable#checkAndPut() perfectly works for me.
Here is my final code (in Scala.)


Thanks Bruno for writing the blog article relating this topic.
That was very informative.

Outerthought :: HBase row locks
http://outerthought.org/blog/blog/380-OTC.html


===============================================

lazy val UniqueIndexQualifier = "unq".getBytes

lazy val AbsenceMarker    = Array[Byte]()   // Empty byte array
lazy val ExistenceMarker   = Array[Byte](0x01)


def insert(table: HTable, put: Put): Unit = {
  put.add(Family, UniqueIndexQualifier, ExistenceMarker)
  val succeeded = table.checkAndPut(put.getRow, Family,
UniqueIndexQualifier, AbsenceMarker, put)

  if (! succeeded) {
    throw new DuplicateRowException("Tried to insert a duplicate row: "
            + Bytes.toString(put.getRow))
  }
}

def update(table: HTable, put: Put): Unit = {
  val succeeded = table.checkAndPut(put.getRow, Family,
UniqueIndexQualifier, ExistenceMarker, put)

  if (! succeeded) {
    throw new RowNotFoundException("Tried to update a non-existing row: "
            + Bytes.toString(put.getRow))
  }
}

===============================================


Thanks,

-- 
河野 達也
Tatsuya Kawano (Mr.)
Tokyo, Japan

twitter: http://twitter.com/tatsuya6502



2010/5/1 Tatsuya Kawano <tatsuy...@snowcocoa.info>:
> Thanks all for your responses; they are very helpful.
>
> 4/30/2010 Todd Lipcon <t...@cloudera.com>:
>> Note that your solution is not correct in the case of failure, since the
>> check and put are not atomic with each other.
>>
>> If your client or server fails between the ICV and the put, no other clients
>> will be able to put the row, but there will be no data.
>
> I agree; my solution is a bit fragile. If I stick with this plan, I
> could try to delete the counter after the put fails. However, it seems
> the delete also won't work, because the possible cause of the put
> failure can be network disruption or region server problem, etc.)  So,
> I'm going to have to leave some kind of failure log, so I can remove
> the reserved key later by hand.
>
>
> 4/30/2010 Guilherme Germoglio <germog...@gmail.com>:
>> Can the keys be randomly generated or they must be incremental? Remember
>> that you can achieve higher throughput if they are randomly generated since
>> the insertions will possibly load all machines more evenly.
>>
>> Using UUIDs may ensure key uniqueness (I don't hope a UUID clash soon :-)
>> and load balance over the cluster,
>
> 4/30/2010 Michael Segel <michael_se...@hotmail.com>:
>> UUIDs wont clash. Especially if you're using version 5 which is a truncated 
>> SHA-1 hash of the UUID.
>
> Thanks for the info. Well, for my case, I'd like to use a combination
> of the business data as the row key, so I can scan them. But, I'll
> keep UUID option for other cases.
>
>
> 4/30/2010 Guilherme Germoglio <germog...@gmail.com>:
>> but if you are paranoid enough you can
>> also check whether a row already exists by using
>> checkAndPut<http://hadoop.apache.org/hbase/docs/r0.20.3/api/org/apache/hadoop/hbase/client/HTable.html#checkAndPut(byte[],
>> byte[], byte[], byte[], org.apache.hadoop.hbase.client.Put)> (just check for
>> an empty byte array values in a column that you can ensure it has always
>> some value).
>
> So, checkAndPut() seems ideal for my case. I didn't realize I can use
> it to check whether a row already exists. I'll give it a try!
>
>
> Thanks,
> Tatsuya
>
> --
> 河野 達也
> Tatsuya Kawano (Mr.)
> Tokyo, Japan
>
> twitter: http://twitter.com/tatsuya6502
>
>
>
>
>
>
> 2010年4月30日5:09 Michael Segel <michael_se...@hotmail.com>:
>>
>> UUIDs wont clash. Especially if you're using version 5 which is a truncated 
>> SHA-1 hash of the UUID.
>>
>>
>>> From: germog...@gmail.com
>>> Date: Thu, 29 Apr 2010 13:58:42 -0300
>>> Subject: Re: Unique row ID constraint
>>> To: hbase-user@hadoop.apache.org
>>>
>>> Hello Tatsuya,
>>>
>>> Can the keys be randomly generated or they must be incremental? Remember
>>> that you can achieve higher throughput if they are randomly generated since
>>> the insertions will possibly load all machines more evenly.
>>>
>>> Using UUIDs may ensure key uniqueness (I don't hope a UUID clash soon :-)
>>> and load balance over the cluster, but if you are paranoid enough you can
>>> also check whether a row already exists by using
>>> checkAndPut<http://hadoop.apache.org/hbase/docs/r0.20.3/api/org/apache/hadoop/hbase/client/HTable.html#checkAndPut(byte[],
>>> byte[], byte[], byte[], org.apache.hadoop.hbase.client.Put)> (just check for
>>> an empty byte array values in a column that you can ensure it has always
>>> some value).
>>>
>>> On Thu, Apr 29, 2010 at 1:36 PM, Todd Lipcon <t...@cloudera.com> wrote:
>>>
>>> > Hi Tatsuya,
>>> >
>>> > Note that your solution is not correct in the case of failure, since the
>>> > check and put are not atomic with each other.
>>> >
>>> > If your client or server fails between the ICV and the put, no other
>>> > clients
>>> > will be able to put the row, but there will be no data.
>>> >
>>> > -Todd
>>> >
>>> >
>>> > On Thu, Apr 29, 2010 at 1:33 AM, Tatsuya Kawano <tatsuy...@snowcocoa.info
>>> > >wrote:
>>> >
>>> > > Hi Stack and Ryan,
>>> > >
>>> > > Thanks for your advices. I knew using row lock wasn't ideal, but I
>>> > > couldn't find an appropriate atomic operation to do Compare And Swap.
>>> > >
>>> > > So, thanks Stack for helping me to find it. I found
>>> > > incrementColumnValue() atomic operation just works for me since it
>>> > > automatically initializes the column value with 0 when the column
>>> > > doesn't exist. I cat try to increment the column value by 1, and if it
>>> > > returns 1, I can be sure that I'm the first one who has created the
>>> > > column and row.
>>> > >
>>> > > So, my updated code is much simpler and now lock-free.
>>> > >
>>> > > ===============================================
>>> > >  def insert(table: HTable, put: Put): Unit = {
>>> > >    val count = table.incrementColumnValue(put.getRow, family, 
>>> > > uniqueQual,
>>> > > 1)
>>> > >
>>> > >    if (count == 1) {
>>> > >      table.put(put)
>>> > >
>>> > >    } else {
>>> > >       throw new DuplicateRowException("Tried to insert a duplicate row: 
>>> > > "
>>> > >               + Bytes.toString(put.getRow))
>>> > >    }
>>> > >  }
>>> > > ===============================================
>>> > >
>>> > > Thanks,
>>> > > Tatsuya
>>> > >
>>> > >
>>> > >
>>> > > 2010/4/29 Ryan Rawson <ryano...@gmail.com>:
>>> > > > I would strongly discourage people from building on top of
>>> > > > lockRow/unlockRow.  The problem is if a row is not available, lockRow
>>> > > > will hold a responder thread and you can end up with a deadlock
>>> > > > because the lock holder won't be able to unlock.  Sure the expiry
>>> > > > system kicks in, but 60 seconds is kind of infinity in database terms
>>> > > > :-)
>>> > > >
>>> > > > I would probably go with either ICV or CAS to build the tools you
>>> > > > want.  With CAS you can accomplish a lot of things locking
>>> > > > accomplishes, but more efficiently.
>>> > > >
>>> > > > On Wed, Apr 28, 2010 at 9:42 AM, Stack <st...@duboce.net> wrote:
>>> > > >> Would the incrementValue [1] work for this?
>>> > > >> St.Ack
>>> > > >>
>>> > > >> 1.
>>> > >
>>> > http://hadoop.apache.org/hbase/docs/r0.20.3/api/org/apache/hadoop/hbase/client/HTable.html#incrementColumnValue%28byte[],%20byte[],%20byte[],%20long%29
>>> > > >>
>>> > > >> On Wed, Apr 28, 2010 at 7:40 AM, Tatsuya Kawano
>>> > > >> <tatsuy...@snowcocoa.info> wrote:
>>> > > >>> Hi,
>>> > > >>>
>>> > > >>> I'd like to implement unique row ID constraint (like the primary key
>>> > > >>> constraint in RDBMS) in my application framework.
>>> > > >>>
>>> > > >>> Here is a code fragment from my current implementation (HBase
>>> > > >>> 0.20.4rc) written in Scala. It works as expected, but is there any
>>> > > >>> better (shorter) way to do this like checkAndPut()?  I'd like to 
>>> > > >>> pass
>>> > > >>> a single Put object to my function (method) rather than passing
>>> > rowId,
>>> > > >>> family, qualifier and value separately. I can't do this now because 
>>> > > >>> I
>>> > > >>> have to give the rowLock object when I instantiate the Put.
>>> > > >>>
>>> > > >>> ===============================================
>>> > > >>> def insert(table: HTable, rowId: Array[Byte], family: Array[Byte],
>>> > > >>>                               qualifier: Array[Byte], value:
>>> > > >>> Array[Byte]): Unit = {
>>> > > >>>
>>> > > >>>    val get = new Get(rowId)
>>> > > >>>
>>> > > >>>    val lock = table.lockRow(rowId) // will expire in one minute
>>> > > >>>    try {
>>> > > >>>      if (table.exists(get)) {
>>> > > >>>        throw new DuplicateRowException("Tried to insert a duplicate
>>> > > row: "
>>> > > >>>                + Bytes.toString(rowId))
>>> > > >>>
>>> > > >>>      } else {
>>> > > >>>        val put = new Put(rowId, lock)
>>> > > >>>        put.add(family, qualifier, value)
>>> > > >>>
>>> > > >>>        table.put(put)
>>> > > >>>      }
>>> > > >>>
>>> > > >>>    } finally {
>>> > > >>>      table.unlockRow(lock)
>>> > > >>>    }
>>> > > >>>
>>> > > >>> }
>>> > > >>> ===============================================
>>> > > >>>
>>> > > >>> Thanks,
>>> > > >>>
>>> > > >>> --
>>> > > >>> 河野 達也
>>> > > >>> Tatsuya Kawano (Mr.)
>>> > > >>> Tokyo, Japan
>>> > > >>>
>>> > > >>> twitter: http://twitter.com/tatsuya6502
>>> > >
>>> >
>>> >
>>> >
>>> > --
>>> > Todd Lipcon
>>> > Software Engineer, Cloudera
>>> >
>>>
>>>
>>>
>>> --
>>> Guilherme
>>>
>>> msn: guigermog...@hotmail.com
>>> homepage: http://sites.google.com/site/germoglio/

Re: Unique row ID constraint

Reply via email to