[jira] [Commented] (HBASE-13702) ImportTsv: Add dry-run functionality and log bad rows

Apekshit Sharma (JIRA) Thu, 28 May 2015 09:07:19 -0700

    [ 
https://issues.apache.org/jira/browse/HBASE-13702?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14563169#comment-14563169
 ]


Apekshit Sharma commented on HBASE-13702:
-----------------------------------------

So the way I see dry-run functionality for adhoc tools like this one is "check 
tool will run successfully on given data without making any (permanent) change 
to system". So ideally, users should get all errors/warning in dry run and 
actual run should be like butter, instead of getting stuck in a half-commit 
stage where some things went through and other didn't (unless it's acceptable).
On practical side, I am with you if it makes sense to remove some trivial logic 
if it shaves of huge run-time. I don't have practical exp. of runtimes of this 
tool, but I would guess any processing in mapper shouldn't take much time 
compared to final stage of writing Put mutations to table (in non-bulk 
mode)/hfiles to disk(bulk mode) which dry-run already skips. If my assumptions 
are wrong, please let me know.

> ImportTsv: Add dry-run functionality and log bad rows
> -----------------------------------------------------
>
>                 Key: HBASE-13702
>                 URL: https://issues.apache.org/jira/browse/HBASE-13702
>             Project: HBase
>          Issue Type: New Feature
>            Reporter: Apekshit Sharma
>            Assignee: Apekshit Sharma
>         Attachments: HBASE-13702.patch
>
>
> ImportTSV job skips bad records by default (keeps a count though). 
> -Dimporttsv.skip.bad.lines=false can be used to fail if a bad row is 
> encountered. 
> To be easily able to determine which rows are corrupted in an input, rather 
> than failing on one row at a time seems like a good feature to have.
> Moreover, there should be 'dry-run' functionality in such kinds of tools, 
> which can essentially does a quick run of tool without making any changes but 
> reporting any errors/warnings and success/failure.
> To identify corrupted rows, simply logging them should be enough. In worst 
> case, all rows will be logged and size of logs will be same as input size, 
> which seems fine. However, user might have to do some work figuring out where 
> the logs. Is there some link we can show to the user when the tool starts 
> which can help them with that?
> For the dry run, we can simply use if-else to skip over writing out KVs, and 
> any other mutations, if present.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (HBASE-13702) ImportTsv: Add dry-run functionality and log bad rows

Reply via email to