Interesting idea. The question I have is how would this work when you have a combination of generated code related to expressions and code not related to expressions.
-- Jacques Nadeau CTO and Co-Founder, Dremio On Thu, Sep 3, 2015 at 11:31 AM, Jason Altekruse <[email protected]> wrote: > @Jacques, > > On your point a) about expressing failures and the compilation model, I had > thought about previously using the interpreter to figure out which > expression against the current row failed, once we have caught an exception > out of some part of the complete code-generated expression evaluation. Do > you think this would possibly address your concern? Do you think anything > more than the problematic input data and the expression that failed would > be produced by the functions in this new standardized error format? > > - Jason > > On Wed, Sep 2, 2015 at 8:43 PM, Jacques Nadeau <[email protected]> wrote: > > > I'd like to propose a few things to solve this: > > > > a) Functions should be able to express failures in a standardized way. > I'm > > thinking a new type of injectable and/or a certain type of exception > > (although more dangerous/possibly requires rewrite given compilation > > model). > > b) Users (session/system level) should be able to set a setting where > > function errors are handled a certain way. Options could include query > > failure, ignore + inform as warning/notice, and save records for later > > analysis (maybe in v2). > > c) Readers that have a notorious problem (e.g. Text) should support > > projection/expression pushdown so that they can create these kinds of > > errors and provide additional context as part of that. > > d) We should also implement dot drill files so that users can prescribe > > this projection/data validation process by default for files/diretories > > (which would provide the behavior as c above. > > e) We should get more serious about providing useful virtual fields. > This > > should include filename (similar to directory name). > > > > Once a record leaves an operator, I don't think we should carry any > > additional provenance with it. It would be too heavy weight as a default > > behavior. > > > > > > > > > > > > > > -- > > Jacques Nadeau > > CTO and Co-Founder, Dremio > > > > On Tue, Sep 1, 2015 at 9:08 AM, Aman Sinha <[email protected]> wrote: > > > > > Drill can point out the filename and location of corrupted records in a > > > file but we don't have a good mechanism to deal with the following > > > scenario: > > > > > > Consider a text file with 2 records: > > > $ cat t4.csv > > > 10,2001 > > > 11,http://www.cnn.com > > > > > > 0: jdbc:drill:zk=local> alter session set `exec.errors.verbose` = true; > > > > > > 0: jdbc:drill:zk=local> select cast(columns[0] as init), > cast(columns[1] > > as > > > bigint) from dfs.`/Users/asinha/data/t4.csv`; > > > > > > Error: SYSTEM ERROR: NumberFormatException: http://www.cnn.com > > > > > > Fragment 0:0 > > > > > > [Error Id: 72aad22c-a345-4100-9a57-dcd8436105f7 on 10.250.56.140:31010 > ] > > > > > > (java.lang.NumberFormatException) http://www.cnn.com > > > org.apache.drill.exec.expr.fn.impl.StringFunctionHelpers.nfeL():91 > > > > > > > > > org.apache.drill.exec.expr.fn.impl.StringFunctionHelpers.varCharToLong():62 > > > org.apache.drill.exec.test.generated.ProjectorGen1.doEval():62 > > > > > org.apache.drill.exec.test.generated.ProjectorGen1.projectRecords():62 > > > > > > > > > org.apache.drill.exec.physical.impl.project.ProjectRecordBatch.doWork():172 > > > > > > The problem is user does not have a clue about the original source of > > this > > > error. This is a pain point especially when dealing with thousands of > > > files. > > > > > > 1. We can start by providing the column index where the problem > > occurred. > > > 2. Can a scan batch keep track of the file it originated from ? Since > > the > > > Project in the > > > above query is pushed right above the scan, it could get the > > filename > > > from the record > > > batch (assuming we can store this piece of information). This > won't > > > be possible > > > for other Projects elsewhere in the plan. > > > 3. What about the location within the file ? Unless the projection > is > > > pushed into the scan > > > itself, I don't see a good way to provide this information. > > > > > > A related topic is how to tell Drill to ignore such records when doing > a > > > query or a CTAS ? > > > That could be a separate discussion. > > > > > > Thoughts ? > > > Aman > > > > > >
