[jira] [Commented] (HBASE-3967) Support deletes in HFileOutputFormat based bulk import mechanism

Bogdan-Alexandru Matican (JIRA) Tue, 14 Jun 2011 23:25:07 -0700

    [ 
https://issues.apache.org/jira/browse/HBASE-3967?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13049645#comment-13049645
 ]


Bogdan-Alexandru Matican commented on HBASE-3967:
-------------------------------------------------

Hello and thank you!

>From what I gathered by looking through the hadoop code, the MapTask class 
>will try to get serializers for the respective classes, based on their actual 
>.class field, which basically means that even if they will fail the check (so 
>if we take it out), the serialization process should be ok afterwards.

I back-traced it through the code:

* SerializationFactory gets configuration from job 
http://grepcode.com/file/repository.cloudera.com/content/repositories/releases/com.cloudera.hadoop/hadoop-core/0.20.2-320/org/apache/hadoop/mapred/MapTask.java#794

* SerializationFactory gathers the proper Serialization from the conf 
(currently only for the WritableSerialization) 
http://grepcode.com/file/repository.cloudera.com/content/repositories/releases/com.cloudera.hadoop/hadoop-core/0.20.2-320/org/apache/hadoop/io/serializer/SerializationFactory.java#50

* WritableSerialization delivers proper Serializer and Deserializer that will 
work well on the underlying objects as it calls readFields and write on the 
respective Writable object that they hold 
http://grepcode.com/file/repository.cloudera.com/content/repositories/releases/com.cloudera.hadoop/hadoop-core/0.20.2-320/org/apache/hadoop/io/serializer/WritableSerialization.java#89

I am posting this to make sure I have it right, as I haven't tested by actually 
modifying the hadoop code myself and trying it. (I should probably checkout, 
modify and build for testing it...) 

This will also ensure backwards compatibility, as anything previously written 
cannot possibly break. Also, as with your note of Postel's Law, this will 
increase the amount of use-cases that get accepted, while not causing any 
potential problems.

Currently, for this respective Put+Delete case, I finished the initial 
implementation with the union thing and it works, but if this works and 
removing those two checks would make the MR code more effective in general, 
then probably that should change too :)


> Support deletes in HFileOutputFormat based bulk import mechanism
> ----------------------------------------------------------------
>
>                 Key: HBASE-3967
>                 URL: https://issues.apache.org/jira/browse/HBASE-3967
>             Project: HBase
>          Issue Type: Improvement
>            Reporter: Kannan Muthukkaruppan
>            Assignee: Bogdan-Alexandru Matican
>
> During bulk imports, it'll be useful to be able to do delete mutations 
> (either to delete data that already exists in HBase or was inserted earlier 
> during this run of the import). 
> For example, we have a use case, where we are processing a log of data which 
> may have both inserts and deletes in the mix and we want to upload that into 
> HBase using the bulk import mechanism.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (HBASE-3967) Support deletes in HFileOutputFormat based bulk import mechanism

Reply via email to