[ 
https://issues.apache.org/jira/browse/HBASE-7645?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Guido Serra aka Zeph updated HBASE-7645:
----------------------------------------

    Description: 
if I call a couple of times SQOOP on the same dataset, outputting to HBase,
I will end up with duplicated data...

{code}
hbase(main):030:0> get "dump_HKFAS.sales_order", "1", {COLUMN => 
"mysql:created_at", VERSIONS => 4}
COLUMN                             CELL                                         
                                                    
mysql:created_at                  timestamp=1358853505756, value=2011-12-21 
18:07:38.0                                             
mysql:created_at                  timestamp=1358790515451, value=2011-12-21 
18:07:38.0                                             
2 row(s) in 0.0040 seconds

today's sqoop run
hbase(main):031:0> Date.new(1358853505756).toString()
=> "Tue Jan 22 11:18:25 UTC 2013"

yesterday's sqoop run
hbase(main):032:0> Date.new(1358790515451).toString()
=> "Mon Jan 21 17:48:35 UTC 2013"
{code}

the fact that the Put.add() method writes the kv without checking if, apart of 
the timestamp, the value has not changed, is it by design? or a bug?

I mean, what's the idea behind? Shall it be SQOOP (the client application) 
supposed to handle the read on the value before issuing an add() statement call?

from: trunk/hbase-server/src/main/java/org/apache/hadoop/hbase/client/Put.java
{code}

  public Put add(byte [] family, byte [] qualifier, byte [] value) {
    return add(family, qualifier, this.ts, value);
  }

  public Put add(byte [] family, byte [] qualifier, long ts, byte [] value) {
    List<KeyValue> list = getKeyValueList(family);
    KeyValue kv = createPutKeyValue(family, qualifier, ts, value);
    list.add(kv);
    familyMap.put(kv.getFamily(), list);
    return this;
  }
{code}

  was:
if I call a couple of times SQOOP on the same dataset, outputting to HBase,
I will end up with duplicated data...

{code}
hbase(main):030:0> get "dump_HKFAS.sales_order", "1", {COLUMN => 
"mysql:created_at", VERSIONS => 4}
COLUMN                             CELL                                         
                                                    
mysql:created_at                  timestamp=1358853505756, value=2011-12-21 
18:07:38.0                                             
mysql:created_at                  timestamp=1358790515451, value=2011-12-21 
18:07:38.0                                             
2 row(s) in 0.0040 seconds

today's sqoop run
hbase(main):031:0> Date.new(1358853505756).toString()
=> "Tue Jan 22 11:18:25 UTC 2013"

yesterday's sqoop run
hbase(main):032:0> Date.new(1358790515451).toString()
=> "Mon Jan 21 17:48:35 UTC 2013"
{code}

the fact that the Put.add() method writes the kv without checking if, apart of 
the timestamp, the value has not changed, is it by design? or a bug?

from: trunk/hbase-server/src/main/java/org/apache/hadoop/hbase/client/Put.java
{code}

  public Put add(byte [] family, byte [] qualifier, byte [] value) {
    return add(family, qualifier, this.ts, value);
  }

  public Put add(byte [] family, byte [] qualifier, long ts, byte [] value) {
    List<KeyValue> list = getKeyValueList(family);
    KeyValue kv = createPutKeyValue(family, qualifier, ts, value);
    list.add(kv);
    familyMap.put(kv.getFamily(), list);
    return this;
  }
{code}

    
> put without timestamp duplicates the record/row
> -----------------------------------------------
>
>                 Key: HBASE-7645
>                 URL: https://issues.apache.org/jira/browse/HBASE-7645
>             Project: HBase
>          Issue Type: Brainstorming
>          Components: Client
>            Reporter: Guido Serra aka Zeph
>
> if I call a couple of times SQOOP on the same dataset, outputting to HBase,
> I will end up with duplicated data...
> {code}
> hbase(main):030:0> get "dump_HKFAS.sales_order", "1", {COLUMN => 
> "mysql:created_at", VERSIONS => 4}
> COLUMN                             CELL                                       
>                                                       
> mysql:created_at                  timestamp=1358853505756, value=2011-12-21 
> 18:07:38.0                                             
> mysql:created_at                  timestamp=1358790515451, value=2011-12-21 
> 18:07:38.0                                             
> 2 row(s) in 0.0040 seconds
> today's sqoop run
> hbase(main):031:0> Date.new(1358853505756).toString()
> => "Tue Jan 22 11:18:25 UTC 2013"
> yesterday's sqoop run
> hbase(main):032:0> Date.new(1358790515451).toString()
> => "Mon Jan 21 17:48:35 UTC 2013"
> {code}
> the fact that the Put.add() method writes the kv without checking if, apart 
> of the timestamp, the value has not changed, is it by design? or a bug?
> I mean, what's the idea behind? Shall it be SQOOP (the client application) 
> supposed to handle the read on the value before issuing an add() statement 
> call?
> from: trunk/hbase-server/src/main/java/org/apache/hadoop/hbase/client/Put.java
> {code}
>   public Put add(byte [] family, byte [] qualifier, byte [] value) {
>     return add(family, qualifier, this.ts, value);
>   }
>   public Put add(byte [] family, byte [] qualifier, long ts, byte [] value) {
>     List<KeyValue> list = getKeyValueList(family);
>     KeyValue kv = createPutKeyValue(family, qualifier, ts, value);
>     list.add(kv);
>     familyMap.put(kv.getFamily(), list);
>     return this;
>   }
> {code}

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

Reply via email to