Drill can point out the filename and location of corrupted records in a file but we don't have a good mechanism to deal with the following scenario:
Consider a text file with 2 records: $ cat t4.csv 10,2001 11,http://www.cnn.com 0: jdbc:drill:zk=local> alter session set `exec.errors.verbose` = true; 0: jdbc:drill:zk=local> select cast(columns[0] as init), cast(columns[1] as bigint) from dfs.`/Users/asinha/data/t4.csv`; Error: SYSTEM ERROR: NumberFormatException: http://www.cnn.com Fragment 0:0 [Error Id: 72aad22c-a345-4100-9a57-dcd8436105f7 on 10.250.56.140:31010] (java.lang.NumberFormatException) http://www.cnn.com org.apache.drill.exec.expr.fn.impl.StringFunctionHelpers.nfeL():91 org.apache.drill.exec.expr.fn.impl.StringFunctionHelpers.varCharToLong():62 org.apache.drill.exec.test.generated.ProjectorGen1.doEval():62 org.apache.drill.exec.test.generated.ProjectorGen1.projectRecords():62 org.apache.drill.exec.physical.impl.project.ProjectRecordBatch.doWork():172 The problem is user does not have a clue about the original source of this error. This is a pain point especially when dealing with thousands of files. 1. We can start by providing the column index where the problem occurred. 2. Can a scan batch keep track of the file it originated from ? Since the Project in the above query is pushed right above the scan, it could get the filename from the record batch (assuming we can store this piece of information). This won't be possible for other Projects elsewhere in the plan. 3. What about the location within the file ? Unless the projection is pushed into the scan itself, I don't see a good way to provide this information. A related topic is how to tell Drill to ignore such records when doing a query or a CTAS ? That could be a separate discussion. Thoughts ? Aman
