Hey Parth,

I think I can provide a little clarification on this point you mentioned:

> BTW, I'm not convinced that record level error handling directives belong
> in this. I know Jacques had some thoughts about that, but I wouldn't mind
> if someone explained it to me again :)

I believe the reason Jacques was proposing this be handled with a dot drill
file was to handle
his concern with the initial planning time proposal made by Sean. With the
initial proposal we could have conflicting
results based on where a project appeared in the plan. If the project with
a cast appeared above
a scan, the proposal was to change the behavior of the read itself.
Unfortunately we don't have a concept of
an operation pinned above a scan, and actually quite frequently where we
cannot get benefits
of pruning by pushing something down, we generally push projects up
the tree, assuming that other operations like filters, joins and aggregates
are all contracting.
In these cases we will need to evaluate the expressions in the project on
fewer rows if we wait
for other operations to reduce the overall size.

Dot drill files are meant to add additional information for the scan
operation. Assigning schema to
a format that otherwise lacks it, as well as defining behavior for when
trying to materialize into the schema
fails (warn, error, write corrupt rows to a log file) seem like a good
candidates for making use of this feature.

On Fri, Nov 6, 2015 at 1:13 PM, Parth Chandra <[email protected]> wrote:

> Hi Julien,
>
>   In an earlier discussion, regarding 'insert into' we had discussed the
> idea of keeping a merged schema (a common schema that applies to all the
> files in the directory) in a .drill file.  The metadata cache file also has
> the same information and, in addition, has stats.  We never did specify
> what a merged schema contains.
>
>   My understanding was that the .drill file, when available, becomes the
> source of schema information. I can see both the metadata cache and the
> insert into functionality using a common format. For these two sets of
> functionality, I don't see a need for the file to be human readable and if
> a more efficient format is available, I think we should use that. This is
> particularly true if we need to keep per file information.
>
>   Is that how we are thinking of the .drill file? Or are we talking about a
> .drill.format (?) file. I guess this is similar to Ted's question.
>
>   BTW, I'm not convinced that record level error handling directives belong
> in this. I know Jacques had some thoughts about that, but I wouldn't mind
> if someone explained it to me again :) . To me record level error handling
> is really a query level directive, not something that applies to all the
> data (in a directory) all the time. Keeping an open mind on this though.
>
>   Something about the inheritance rules based on similar questions
> regarding the metadata cache file - The metadata cache file is built based
> on all the files in the hierarchy under the current directory. So if you
> have a hierarchy
>   A
>    -- B
>       -- C
>    -- D
> there is a metadata cache file in A, B, C and D. The cache file in A
> contains info on all the files in B, C and D. If you update the directory C
> and refresh metadata for C, then _only_ C will get updated and the changes
> are not propagated upwards. If you refresh metadata for A, all the changes
> are seen by A, B, and C. For the use case you're outlining, I would think
> looking only at the directory the files are in should suffice.
>
>
> Parth
>
>
>
>
> On Sun, Nov 1, 2015 at 10:18 PM, Julien Le Dem <[email protected]> wrote:
>
> > Hello,
> > I'd like to capture the requirement for dot drill files.
> > Here is my understanding:
> > A ".drill" file is in JSON format and is a mechanism provided by the
> > FileSystemPlugin to define the format plugin to use collocated with the
> > files containing the data in a file system. It will override any
> extension
> > or magic number header mapping.
> > It will enable configuring the format plugin and record level error
> > handling mechanism (bad record skipping, etc). It could be extended to
> > support more in the future.
> > Is this correct? Are there inheritance rules if more than one file is
> found
> > in the hierarchy? Does drill look only at the dir containing the files or
> > also all parent directories?
> >
> > --
> > Julien
> >
>

Reply via email to