Drill can point out the filename and location of corrupted records in a
file but we don't have a good mechanism to deal with the following
scenario:

Consider a text file with 2 records:
$ cat t4.csv
10,2001
11,http://www.cnn.com

0: jdbc:drill:zk=local> alter session set `exec.errors.verbose` = true;

0: jdbc:drill:zk=local> select cast(columns[0] as init), cast(columns[1] as
bigint) from dfs.`/Users/asinha/data/t4.csv`;

Error: SYSTEM ERROR: NumberFormatException: http://www.cnn.com

Fragment 0:0

[Error Id: 72aad22c-a345-4100-9a57-dcd8436105f7 on 10.250.56.140:31010]

  (java.lang.NumberFormatException) http://www.cnn.com
    org.apache.drill.exec.expr.fn.impl.StringFunctionHelpers.nfeL():91

org.apache.drill.exec.expr.fn.impl.StringFunctionHelpers.varCharToLong():62
    org.apache.drill.exec.test.generated.ProjectorGen1.doEval():62
    org.apache.drill.exec.test.generated.ProjectorGen1.projectRecords():62

org.apache.drill.exec.physical.impl.project.ProjectRecordBatch.doWork():172

The problem is user does not have a clue about the original source of this
error.  This is a pain point especially when dealing with thousands of
files.

1.  We can start by providing the column index where the problem occurred.
2.  Can a scan batch keep track of the file it originated from ? Since the
Project in the
     above query is pushed right above the scan, it could get the filename
from the record
     batch (assuming we can store this piece of information).  This won't
be possible
     for other Projects elsewhere in the plan.
3.  What about the location within the file ?   Unless the projection is
pushed into the scan
     itself, I don't see a good way to provide this information.

A related topic is how to tell Drill to ignore such records when doing a
query or a CTAS ?
That could be a separate discussion.

Thoughts ?
Aman

Reply via email to