Re: Request for more feedback on "Support the Ability to Identify And Skip Records" design

Neeraja Rentachintala Tue, 27 Oct 2015 10:35:55 -0700

Jacques
Thanks for the details.
I am trying to understand whats the difference between 3 & 4.
Here is how I am thinking of the scenario. Its probably better to discuss
this in hang out.


I have some data coming from an external system and I expect them to be in
certain format. Checked first couple of rows and it seem to be sticking to
the format. I have written a Drill query or a view to interpret the data
(for ex: converting certain fields to a date or timestamp, casting to a
specific type etc). However certain records seem to be corrupted such as
prefixed with non-printable characters. I need to have special handling for
these records (for ex: I want to identify what these are so I can either
fix them or choose to skip them). I am still in data exploration phase at
this point.

There is an extension use case of ETL/Data import where I probably have
millions of text files coming in and I am using Drill to convert all of
this to Parquet using CTAS. Some of the records in these files could be
corrupted and I need special handling for these (potentially skip them or
move them to a separate file) without interrupting the whole data
conversion.

-Neeraja

On Tue, Oct 27, 2015 at 8:52 AM, Jacques Nadeau <[email protected]> wrote:

> There seem to be multiple user requirements that are being considered in
> Hsuan & Juliens' proposals:
>
> 1. Drill doesn't have enough information to parse my data, I want to give
> Drill help. (Examples might me: the field delimiter is "|", the proto idl
> encoding for a protobuf file is "...", provide an external avro schema )
> 2. While Drill can parse my data, the structure output is incomplete. It
> may be missing field types and/or field names. I want to tell Drill how to
> interpret that data since the format itself doesn't provide an adequate way
> to express this (typically text files as opposed to json, parquet)
> 3. I've defined an expected structure to my data files. If some records
> don't match that, I want to have special handling to manage those records
> (e.g. drop, warn number of drops, create separate file with provenance of
> each failing record)
> 4. I have an arbitrary query and I want any data-specific execution
> failures to be squelched to allow the query to complete with whatever data
> remains.
>
> My recommendation is that we have three new features:
>
> A. table with options (what julien is working on)
> B. .drill files (https://issues.apache.org/jira/browse/DRILL-3572)
> C. alter table ascribe metadata (to create a .drill file through sql)
> D. Support using table with options (A) to override settings in .drill (B)
>
> I believe that A & B (and C since it is simply a derivative of B) should
> provide the capability to achieve requirements 1-3 above.
>
> When Neeraja talks of the exploration use case, feature A is probably the
> most common way that people will do this. In the case of use case 3 above,
> if someone wants to use a "recordPositionAndError" behavior (see
> DRILL-3572), they will most likely want to do that in the context of a
> query (as opposed to a view or .drill).  As such, you would probably create
> a .drill file that did warn or ignore. Then layer over the top (via feature
> D) a recordPositionAndError if you want that for a certain situation.
>
> My main thought on Hsuan's initial proposal is it seems to try to provide
> an incomplete resolution of #4 above. It isn't clear to me that use case #4
> is a critical use case for most users. If it is, can we get some concrete
> examples of it as opposed to use cases 1-3? If it is a critical use case, I
> think we should solve it in a more general way (for example I don't think
> we should try to maintain file-based record provenance in that context).
> Among other things, the current proposal has the weird problem of not being
> consistent in how the user experiences the behavior (depending on what plan
> Drill decides to execute.)
>
> Note, there were some questions about how 1-3 could be solved using B so
> I've provided an example in the Jira:
> https://issues.apache.org/jira/browse/DRILL-3572
>
>
>
> --
> Jacques Nadeau
> CTO and Co-Founder, Dremio
>
> On Mon, Oct 26, 2015 at 4:09 PM, Zelaine Fong <[email protected]> wrote:
>
> > My understanding of Jacques' proposal is that he suggests we use .drill
> > instead of requiring the user to do an explicit cast in their select
> > query.  That way, the changes for enhancement would be restricted to the
> > scanner.
> >
> > Did I interpret the alternative approach correctly?
> >
> > -- Zelaine
> >
> > On Mon, Oct 26, 2015 at 4:05 PM, Hsuan Yi Chu <[email protected]>
> wrote:
> >
> > > Hi,
> > >
> > > Luckily, we will have hang-out tomorrow.
> > >
> > > Maybe we could have an example to elaborate how .drill can be used in a
> > > cast-query?
> > >
> > > Thanks.
> > >
> > >
> > > On Mon, Oct 26, 2015 at 3:31 PM, Neeraja Rentachintala <
> > > [email protected]> wrote:
> > >
> > > > Jacques
> > > > I have responded to one of your comments on the doc.
> > > > can you pls review and comment. I am not clear on the approach you
> are
> > > > suggesting using .drill and what would that mean to user experience.
> It
> > > > would be great if you can add an example.
> > > >
> > > > Similar to other thread (initiated by Julien) we have around being
> able
> > > to
> > > > provide file parsing hints from the query itself for self service
> data
> > > > exploration purposes, we need this feature to be fairly light weight
> > > from a
> > > > user experience point of view. i.e me as a business user got hold of
> > some
> > > > external data, want to take a look by running adhoc queries on Drill
> ,
> > I
> > > > should be able to do it without having to go through whole setup of
> > > .drill
> > > > etc which will come later as the data is 'operationalized'
> > > >
> > > > thanks
> > > > -Neeraja
> > > >
> > > > On Mon, Oct 26, 2015 at 2:49 PM, Jacques Nadeau <[email protected]>
> > > > wrote:
> > > >
> > > > > Hsuan was kind enough to put together a provocative discussion on
> the
> > > > > mailing list about skipping records. I've started a way too long
> > thread
> > > > in
> > > > > the comments discussion but would like to get other feedback from
> the
> > > > > community. The main point of contention I have is that the big goal
> > of
> > > > this
> > > > > design is to provide "data import" like capabilities for Drill. In
> > that
> > > > > context, I suggested a scan based approach to schema enforcement
> (and
> > > bad
> > > > > record capture/storage). I think it is a simpler approach and
> solves
> > > the
> > > > > vast majority of user needs. Hsuan's initial proposal was a much
> > > broader
> > > > > reaching proposal that supports an arbitrary number of expression
> > types
> > > > > within project and filter (assuming they are proximate to the
> scan).
> > > > >
> > > > > Would love to get others feedback and thoughts on the doc to what
> the
> > > MVP
> > > > > for this feature really is.
> > > > >
> > > > >
> > > > >
> > > >
> > >
> >
> https://docs.google.com/document/d/1jCeYW924_SFwf-nOqtXrO68eixmAitM-tLngezzXw3Y/edit
> > > > >
> > > > >
> > > > > --
> > > > > Jacques Nadeau
> > > > > CTO and Co-Founder, Dremio
> > > > >
> > > >
> > >
> >
>

Re: Request for more feedback on "Support the Ability to Identify And Skip Records" design

Reply via email to