[ 
https://issues.apache.org/jira/browse/HBASE-13702?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14568310#comment-14568310
 ] 

Apekshit Sharma commented on HBASE-13702:
-----------------------------------------

Did few runs on single node cluster.
Intel(R) Xeon(R) CPU E5-2680 v2 @ 2.80GHz, single core, hyper threading = 4

Dataset 1: 1000 rows, key length = 20, #columns = 100, column length = 1k
size of input csv file: 100M

Dataset 2: 10000 rows, key length = 20, #columns = 100, column length = 1k
size of input csv file: 1G

*Non Bulk Mode:*

Dataset 1
Dry mode: <1 sec
Non-dry mode: ~4 sec

Dataset 2
Dry mode: ~10s
Non dry mode: ~24 s
num automatic splits: 8

Verified row count after each run.

*Bulk Mode:*

Dataset 2
10000 rows, key length = 20, #columns = 100, column length = 1k
size of input csv file: 1G

dry mode: ~40 sec (table not existent on start, verified no table and output 
dir after run)
non-dry mode: ~60 sec (table not existent on start, verified table and output 
dir exists after run)
num automatic splits: 8

Since the runs are in order of seconds/minutes, I think we can and should test 
all functionality in dry-run.

> ImportTsv: Add dry-run functionality and log bad rows
> -----------------------------------------------------
>
>                 Key: HBASE-13702
>                 URL: https://issues.apache.org/jira/browse/HBASE-13702
>             Project: HBase
>          Issue Type: New Feature
>            Reporter: Apekshit Sharma
>            Assignee: Apekshit Sharma
>         Attachments: HBASE-13702.patch
>
>
> ImportTSV job skips bad records by default (keeps a count though). 
> -Dimporttsv.skip.bad.lines=false can be used to fail if a bad row is 
> encountered. 
> To be easily able to determine which rows are corrupted in an input, rather 
> than failing on one row at a time seems like a good feature to have.
> Moreover, there should be 'dry-run' functionality in such kinds of tools, 
> which can essentially does a quick run of tool without making any changes but 
> reporting any errors/warnings and success/failure.
> To identify corrupted rows, simply logging them should be enough. In worst 
> case, all rows will be logged and size of logs will be same as input size, 
> which seems fine. However, user might have to do some work figuring out where 
> the logs. Is there some link we can show to the user when the tool starts 
> which can help them with that?
> For the dry run, we can simply use if-else to skip over writing out KVs, and 
> any other mutations, if present.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Reply via email to