[ 
https://issues.apache.org/jira/browse/HBASE-8593?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13791343#comment-13791343
 ] 

Istvan Vajnorak commented on HBASE-8593:
----------------------------------------

Dear All,

As part of a Hadoop POC, i also bumped into this issue and decided to roll out 
my own loader based on the ImportTSV tool.
On top of type safety, i also had the problem of having key chunks in the file 
which i had to merge into one "compound key", and therefore decieded to extend 
the DSL of the input pattern with type and with key fragment awareness similar 
to this:

$HBASE_HOME/bin/hbase com.msci.appdev.hbase.report.job.ReportImportJob 
-Dcom.msci.reports.mappingRule=KEY_PART1[i],KEY_PART2[i],KEY_PART3[s],o:t[s],o:p[i],o:r[i],o:c[s]
 -Dcom.msci.reports.tablename=swarm_of_reports_5_billion 
-Dcom.msci.reports.inputPath=hdfs://ddc-rm-lapp0001.dev.msci.org:8020/opt/data/import/swarm_of_reports/ca737044-8b13-4fb1-b56f-e0ac66f13230.tsv
 
-Dcom.msci.reports.outputPath=hdfs://ddc-rm-lapp0001.dev.msci.org:8020/opt/data/import/swarm_reports_again
 -Dcom.msci.reports.performBulkLoad=true

The system takes i,s,l,d type parameters, and should it find no such info, it 
treates the column value as String.
Type recognition then delegates to the Bytes class for transformation such as:

public enum InputDataType {

    SHORT("s") {
        @Override
        public byte[] toBytes(String value) {
            return Bytes.toBytes(Short.parseShort(value));
        }
    },
...

Should this be of any interest, i can share the code to some extent that could 
help to assess if this approach is viable or not for large scale.

One thing i noticed was the possible overhead of getting type safe on the CPU, 
but it can be saved on the IO front where much less data needed to be written 
out in some cases.

Example:
 I can encode the number 2147483647 in an int on 4 bytes, while in String form 
it will be 10 bytes represented in UTF8.

Best regards, 
 Istvan



> Type support in ImportTSV tool
> ------------------------------
>
>                 Key: HBASE-8593
>                 URL: https://issues.apache.org/jira/browse/HBASE-8593
>             Project: HBase
>          Issue Type: Sub-task
>          Components: mapreduce
>            Reporter: Anoop Sam John
>            Assignee: rajeshbabu
>             Fix For: 0.96.0
>
>         Attachments: HBASE-8593.patch, HBASE-8593_v2.patch, 
> HBASE-8593_v4.patch
>
>
> Now the ImportTSV tool treats all the table column to be of type String. It 
> converts the input data into bytes considering its type to be String. Some 
> times user will need a type of say int/float to get added to table by using 
> this tool.



--
This message was sent by Atlassian JIRA
(v6.1#6144)

Reply via email to