[
https://issues.apache.org/jira/browse/DRILL-3764?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14949575#comment-14949575
]
Sean Hsuan-Yi Chu commented on DRILL-3764:
------------------------------------------
The option might not be viable if the data type is non-nullable.
Further, we cannot just cast it to nullable data type since the batches prior
to the current one might have been sent to the downstream operator. And
changing the type to nullable would cause SchemaChange issues.
> Support the ability to identify and/or skip records when a function
> evaluation fails
> ------------------------------------------------------------------------------------
>
> Key: DRILL-3764
> URL: https://issues.apache.org/jira/browse/DRILL-3764
> Project: Apache Drill
> Issue Type: Improvement
> Components: Functions - Drill
> Affects Versions: 1.1.0
> Reporter: Aman Sinha
> Fix For: Future
>
>
> Drill can point out the filename and location of corrupted records in a file
> but it does not have a good mechanism to deal with the following scenario:
> Consider a text file with 2 records:
> {code}
> $ cat t4.csv
> 10,2001
> 11,http://www.cnn.com
> {code}
> {code}
> 0: jdbc:drill:zk=local> alter session set `exec.errors.verbose` = true;
> 0: jdbc:drill:zk=local> select cast(columns[0] as init), cast(columns[1] as
> bigint) from dfs.`t4.csv`;
> Error: SYSTEM ERROR: NumberFormatException: http://www.cnn.com
> Fragment 0:0
> [Error Id: 72aad22c-a345-4100-9a57-dcd8436105f7 on 10.250.56.140:31010]
> (java.lang.NumberFormatException) http://www.cnn.com
> org.apache.drill.exec.expr.fn.impl.StringFunctionHelpers.nfeL():91
>
> org.apache.drill.exec.expr.fn.impl.StringFunctionHelpers.varCharToLong():62
> org.apache.drill.exec.test.generated.ProjectorGen1.doEval():62
> org.apache.drill.exec.test.generated.ProjectorGen1.projectRecords():62
>
> org.apache.drill.exec.physical.impl.project.ProjectRecordBatch.doWork():172
> {code}
> The problem is user does not have the context of where the error occurred
> -either the file name or the record number. This becomes a pain point
> especially when CTAS is being used to do data conversion from (say) text
> format to Parquet format. The CTAS may be accessing thousands of files and 1
> such casting (or another function) failure aborts the query.
> It would substantially improve the user experience if we provided:
> 1) the filename and record number where this failure occurred
> 2) the ability to skip such records depending on a session option
> 3) the ability to write such records to a staging table for future ingestion
> Please see discussion on dev list:
> http://mail-archives.apache.org/mod_mbox/drill-dev/201509.mbox/%3cCAFyDVvLuPLgTNZ56S6=J=9Vb=aBs=pdw7nrhkkdupbdxgfa...@mail.gmail.com%3e
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)