Jacques Thanks for the details. I am trying to understand whats the difference between 3 & 4. Here is how I am thinking of the scenario. Its probably better to discuss this in hang out.
I have some data coming from an external system and I expect them to be in certain format. Checked first couple of rows and it seem to be sticking to the format. I have written a Drill query or a view to interpret the data (for ex: converting certain fields to a date or timestamp, casting to a specific type etc). However certain records seem to be corrupted such as prefixed with non-printable characters. I need to have special handling for these records (for ex: I want to identify what these are so I can either fix them or choose to skip them). I am still in data exploration phase at this point. There is an extension use case of ETL/Data import where I probably have millions of text files coming in and I am using Drill to convert all of this to Parquet using CTAS. Some of the records in these files could be corrupted and I need special handling for these (potentially skip them or move them to a separate file) without interrupting the whole data conversion. -Neeraja On Tue, Oct 27, 2015 at 8:52 AM, Jacques Nadeau <[email protected]> wrote: > There seem to be multiple user requirements that are being considered in > Hsuan & Juliens' proposals: > > 1. Drill doesn't have enough information to parse my data, I want to give > Drill help. (Examples might me: the field delimiter is "|", the proto idl > encoding for a protobuf file is "...", provide an external avro schema ) > 2. While Drill can parse my data, the structure output is incomplete. It > may be missing field types and/or field names. I want to tell Drill how to > interpret that data since the format itself doesn't provide an adequate way > to express this (typically text files as opposed to json, parquet) > 3. I've defined an expected structure to my data files. If some records > don't match that, I want to have special handling to manage those records > (e.g. drop, warn number of drops, create separate file with provenance of > each failing record) > 4. I have an arbitrary query and I want any data-specific execution > failures to be squelched to allow the query to complete with whatever data > remains. > > My recommendation is that we have three new features: > > A. table with options (what julien is working on) > B. .drill files (https://issues.apache.org/jira/browse/DRILL-3572) > C. alter table ascribe metadata (to create a .drill file through sql) > D. Support using table with options (A) to override settings in .drill (B) > > I believe that A & B (and C since it is simply a derivative of B) should > provide the capability to achieve requirements 1-3 above. > > When Neeraja talks of the exploration use case, feature A is probably the > most common way that people will do this. In the case of use case 3 above, > if someone wants to use a "recordPositionAndError" behavior (see > DRILL-3572), they will most likely want to do that in the context of a > query (as opposed to a view or .drill). As such, you would probably create > a .drill file that did warn or ignore. Then layer over the top (via feature > D) a recordPositionAndError if you want that for a certain situation. > > My main thought on Hsuan's initial proposal is it seems to try to provide > an incomplete resolution of #4 above. It isn't clear to me that use case #4 > is a critical use case for most users. If it is, can we get some concrete > examples of it as opposed to use cases 1-3? If it is a critical use case, I > think we should solve it in a more general way (for example I don't think > we should try to maintain file-based record provenance in that context). > Among other things, the current proposal has the weird problem of not being > consistent in how the user experiences the behavior (depending on what plan > Drill decides to execute.) > > Note, there were some questions about how 1-3 could be solved using B so > I've provided an example in the Jira: > https://issues.apache.org/jira/browse/DRILL-3572 > > > > -- > Jacques Nadeau > CTO and Co-Founder, Dremio > > On Mon, Oct 26, 2015 at 4:09 PM, Zelaine Fong <[email protected]> wrote: > > > My understanding of Jacques' proposal is that he suggests we use .drill > > instead of requiring the user to do an explicit cast in their select > > query. That way, the changes for enhancement would be restricted to the > > scanner. > > > > Did I interpret the alternative approach correctly? > > > > -- Zelaine > > > > On Mon, Oct 26, 2015 at 4:05 PM, Hsuan Yi Chu <[email protected]> > wrote: > > > > > Hi, > > > > > > Luckily, we will have hang-out tomorrow. > > > > > > Maybe we could have an example to elaborate how .drill can be used in a > > > cast-query? > > > > > > Thanks. > > > > > > > > > On Mon, Oct 26, 2015 at 3:31 PM, Neeraja Rentachintala < > > > [email protected]> wrote: > > > > > > > Jacques > > > > I have responded to one of your comments on the doc. > > > > can you pls review and comment. I am not clear on the approach you > are > > > > suggesting using .drill and what would that mean to user experience. > It > > > > would be great if you can add an example. > > > > > > > > Similar to other thread (initiated by Julien) we have around being > able > > > to > > > > provide file parsing hints from the query itself for self service > data > > > > exploration purposes, we need this feature to be fairly light weight > > > from a > > > > user experience point of view. i.e me as a business user got hold of > > some > > > > external data, want to take a look by running adhoc queries on Drill > , > > I > > > > should be able to do it without having to go through whole setup of > > > .drill > > > > etc which will come later as the data is 'operationalized' > > > > > > > > thanks > > > > -Neeraja > > > > > > > > On Mon, Oct 26, 2015 at 2:49 PM, Jacques Nadeau <[email protected]> > > > > wrote: > > > > > > > > > Hsuan was kind enough to put together a provocative discussion on > the > > > > > mailing list about skipping records. I've started a way too long > > thread > > > > in > > > > > the comments discussion but would like to get other feedback from > the > > > > > community. The main point of contention I have is that the big goal > > of > > > > this > > > > > design is to provide "data import" like capabilities for Drill. In > > that > > > > > context, I suggested a scan based approach to schema enforcement > (and > > > bad > > > > > record capture/storage). I think it is a simpler approach and > solves > > > the > > > > > vast majority of user needs. Hsuan's initial proposal was a much > > > broader > > > > > reaching proposal that supports an arbitrary number of expression > > types > > > > > within project and filter (assuming they are proximate to the > scan). > > > > > > > > > > Would love to get others feedback and thoughts on the doc to what > the > > > MVP > > > > > for this feature really is. > > > > > > > > > > > > > > > > > > > > > > > > > https://docs.google.com/document/d/1jCeYW924_SFwf-nOqtXrO68eixmAitM-tLngezzXw3Y/edit > > > > > > > > > > > > > > > -- > > > > > Jacques Nadeau > > > > > CTO and Co-Founder, Dremio > > > > > > > > > > > > > > >
