@Jacques, On your point a) about expressing failures and the compilation model, I had thought about previously using the interpreter to figure out which expression against the current row failed, once we have caught an exception out of some part of the complete code-generated expression evaluation. Do you think this would possibly address your concern? Do you think anything more than the problematic input data and the expression that failed would be produced by the functions in this new standardized error format?
- Jason On Wed, Sep 2, 2015 at 8:43 PM, Jacques Nadeau <[email protected]> wrote: > I'd like to propose a few things to solve this: > > a) Functions should be able to express failures in a standardized way. I'm > thinking a new type of injectable and/or a certain type of exception > (although more dangerous/possibly requires rewrite given compilation > model). > b) Users (session/system level) should be able to set a setting where > function errors are handled a certain way. Options could include query > failure, ignore + inform as warning/notice, and save records for later > analysis (maybe in v2). > c) Readers that have a notorious problem (e.g. Text) should support > projection/expression pushdown so that they can create these kinds of > errors and provide additional context as part of that. > d) We should also implement dot drill files so that users can prescribe > this projection/data validation process by default for files/diretories > (which would provide the behavior as c above. > e) We should get more serious about providing useful virtual fields. This > should include filename (similar to directory name). > > Once a record leaves an operator, I don't think we should carry any > additional provenance with it. It would be too heavy weight as a default > behavior. > > > > > > > -- > Jacques Nadeau > CTO and Co-Founder, Dremio > > On Tue, Sep 1, 2015 at 9:08 AM, Aman Sinha <[email protected]> wrote: > > > Drill can point out the filename and location of corrupted records in a > > file but we don't have a good mechanism to deal with the following > > scenario: > > > > Consider a text file with 2 records: > > $ cat t4.csv > > 10,2001 > > 11,http://www.cnn.com > > > > 0: jdbc:drill:zk=local> alter session set `exec.errors.verbose` = true; > > > > 0: jdbc:drill:zk=local> select cast(columns[0] as init), cast(columns[1] > as > > bigint) from dfs.`/Users/asinha/data/t4.csv`; > > > > Error: SYSTEM ERROR: NumberFormatException: http://www.cnn.com > > > > Fragment 0:0 > > > > [Error Id: 72aad22c-a345-4100-9a57-dcd8436105f7 on 10.250.56.140:31010] > > > > (java.lang.NumberFormatException) http://www.cnn.com > > org.apache.drill.exec.expr.fn.impl.StringFunctionHelpers.nfeL():91 > > > > > org.apache.drill.exec.expr.fn.impl.StringFunctionHelpers.varCharToLong():62 > > org.apache.drill.exec.test.generated.ProjectorGen1.doEval():62 > > > org.apache.drill.exec.test.generated.ProjectorGen1.projectRecords():62 > > > > > org.apache.drill.exec.physical.impl.project.ProjectRecordBatch.doWork():172 > > > > The problem is user does not have a clue about the original source of > this > > error. This is a pain point especially when dealing with thousands of > > files. > > > > 1. We can start by providing the column index where the problem > occurred. > > 2. Can a scan batch keep track of the file it originated from ? Since > the > > Project in the > > above query is pushed right above the scan, it could get the > filename > > from the record > > batch (assuming we can store this piece of information). This won't > > be possible > > for other Projects elsewhere in the plan. > > 3. What about the location within the file ? Unless the projection is > > pushed into the scan > > itself, I don't see a good way to provide this information. > > > > A related topic is how to tell Drill to ignore such records when doing a > > query or a CTAS ? > > That could be a separate discussion. > > > > Thoughts ? > > Aman > > >
