[
https://issues.apache.org/jira/browse/HBASE-8593?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13791343#comment-13791343
]
Istvan Vajnorak commented on HBASE-8593:
----------------------------------------
Dear All,
As part of a Hadoop POC, i also bumped into this issue and decided to roll out
my own loader based on the ImportTSV tool.
On top of type safety, i also had the problem of having key chunks in the file
which i had to merge into one "compound key", and therefore decieded to extend
the DSL of the input pattern with type and with key fragment awareness similar
to this:
$HBASE_HOME/bin/hbase com.msci.appdev.hbase.report.job.ReportImportJob
-Dcom.msci.reports.mappingRule=KEY_PART1[i],KEY_PART2[i],KEY_PART3[s],o:t[s],o:p[i],o:r[i],o:c[s]
-Dcom.msci.reports.tablename=swarm_of_reports_5_billion
-Dcom.msci.reports.inputPath=hdfs://ddc-rm-lapp0001.dev.msci.org:8020/opt/data/import/swarm_of_reports/ca737044-8b13-4fb1-b56f-e0ac66f13230.tsv
-Dcom.msci.reports.outputPath=hdfs://ddc-rm-lapp0001.dev.msci.org:8020/opt/data/import/swarm_reports_again
-Dcom.msci.reports.performBulkLoad=true
The system takes i,s,l,d type parameters, and should it find no such info, it
treates the column value as String.
Type recognition then delegates to the Bytes class for transformation such as:
public enum InputDataType {
SHORT("s") {
@Override
public byte[] toBytes(String value) {
return Bytes.toBytes(Short.parseShort(value));
}
},
...
Should this be of any interest, i can share the code to some extent that could
help to assess if this approach is viable or not for large scale.
One thing i noticed was the possible overhead of getting type safe on the CPU,
but it can be saved on the IO front where much less data needed to be written
out in some cases.
Example:
I can encode the number 2147483647 in an int on 4 bytes, while in String form
it will be 10 bytes represented in UTF8.
Best regards,
Istvan
> Type support in ImportTSV tool
> ------------------------------
>
> Key: HBASE-8593
> URL: https://issues.apache.org/jira/browse/HBASE-8593
> Project: HBase
> Issue Type: Sub-task
> Components: mapreduce
> Reporter: Anoop Sam John
> Assignee: rajeshbabu
> Fix For: 0.96.0
>
> Attachments: HBASE-8593.patch, HBASE-8593_v2.patch,
> HBASE-8593_v4.patch
>
>
> Now the ImportTSV tool treats all the table column to be of type String. It
> converts the input data into bytes considering its type to be String. Some
> times user will need a type of say int/float to get added to table by using
> this tool.
--
This message was sent by Atlassian JIRA
(v6.1#6144)