[
https://issues.apache.org/jira/browse/DRILL-3764?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14741912#comment-14741912
]
Jacques Nadeau commented on DRILL-3764:
---------------------------------------
Let's make sure to put together a design on this before we start to implement.
I want to be cautious that we are clear about function errors as compared with
problems with reader conversions (something that doesn't really exist today).
Function exceptions can happen throughout the pipeline that may not even be on
the same machine as where the data was read. In general, the concept of line
number/record number or filename is something that should really stay within
the boundary of the scan operation.
> Support the ability to identify and/or skip records when a function
> evaluation fails
> ------------------------------------------------------------------------------------
>
> Key: DRILL-3764
> URL: https://issues.apache.org/jira/browse/DRILL-3764
> Project: Apache Drill
> Issue Type: Improvement
> Components: Functions - Drill
> Affects Versions: 1.1.0
> Reporter: Aman Sinha
> Assignee: Mehant Baid
>
> Drill can point out the filename and location of corrupted records in a file
> but it does not have a good mechanism to deal with the following scenario:
> Consider a text file with 2 records:
> {code}
> $ cat t4.csv
> 10,2001
> 11,http://www.cnn.com
> {code}
> {code}
> 0: jdbc:drill:zk=local> alter session set `exec.errors.verbose` = true;
> 0: jdbc:drill:zk=local> select cast(columns[0] as init), cast(columns[1] as
> bigint) from dfs.`t4.csv`;
> Error: SYSTEM ERROR: NumberFormatException: http://www.cnn.com
> Fragment 0:0
> [Error Id: 72aad22c-a345-4100-9a57-dcd8436105f7 on 10.250.56.140:31010]
> (java.lang.NumberFormatException) http://www.cnn.com
> org.apache.drill.exec.expr.fn.impl.StringFunctionHelpers.nfeL():91
>
> org.apache.drill.exec.expr.fn.impl.StringFunctionHelpers.varCharToLong():62
> org.apache.drill.exec.test.generated.ProjectorGen1.doEval():62
> org.apache.drill.exec.test.generated.ProjectorGen1.projectRecords():62
>
> org.apache.drill.exec.physical.impl.project.ProjectRecordBatch.doWork():172
> {code}
> The problem is user does not have the context of where the error occurred
> -either the file name or the record number. This becomes a pain point
> especially when CTAS is being used to do data conversion from (say) text
> format to Parquet format. The CTAS may be accessing thousands of files and 1
> such casting (or another function) failure aborts the query.
> It would substantially improve the user experience if we provided:
> 1) the filename and record number where this failure occurred
> 2) the ability to skip such records depending on a session option
> 3) the ability to write such records to a staging table for future ingestion
> Please see discussion on dev list:
> http://mail-archives.apache.org/mod_mbox/drill-dev/201509.mbox/%3cCAFyDVvLuPLgTNZ56S6=J=9Vb=aBs=pdw7nrhkkdupbdxgfa...@mail.gmail.com%3e
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)