Hey Parth, I think I can provide a little clarification on this point you mentioned:
> BTW, I'm not convinced that record level error handling directives belong > in this. I know Jacques had some thoughts about that, but I wouldn't mind > if someone explained it to me again :) I believe the reason Jacques was proposing this be handled with a dot drill file was to handle his concern with the initial planning time proposal made by Sean. With the initial proposal we could have conflicting results based on where a project appeared in the plan. If the project with a cast appeared above a scan, the proposal was to change the behavior of the read itself. Unfortunately we don't have a concept of an operation pinned above a scan, and actually quite frequently where we cannot get benefits of pruning by pushing something down, we generally push projects up the tree, assuming that other operations like filters, joins and aggregates are all contracting. In these cases we will need to evaluate the expressions in the project on fewer rows if we wait for other operations to reduce the overall size. Dot drill files are meant to add additional information for the scan operation. Assigning schema to a format that otherwise lacks it, as well as defining behavior for when trying to materialize into the schema fails (warn, error, write corrupt rows to a log file) seem like a good candidates for making use of this feature. On Fri, Nov 6, 2015 at 1:13 PM, Parth Chandra <[email protected]> wrote: > Hi Julien, > > In an earlier discussion, regarding 'insert into' we had discussed the > idea of keeping a merged schema (a common schema that applies to all the > files in the directory) in a .drill file. The metadata cache file also has > the same information and, in addition, has stats. We never did specify > what a merged schema contains. > > My understanding was that the .drill file, when available, becomes the > source of schema information. I can see both the metadata cache and the > insert into functionality using a common format. For these two sets of > functionality, I don't see a need for the file to be human readable and if > a more efficient format is available, I think we should use that. This is > particularly true if we need to keep per file information. > > Is that how we are thinking of the .drill file? Or are we talking about a > .drill.format (?) file. I guess this is similar to Ted's question. > > BTW, I'm not convinced that record level error handling directives belong > in this. I know Jacques had some thoughts about that, but I wouldn't mind > if someone explained it to me again :) . To me record level error handling > is really a query level directive, not something that applies to all the > data (in a directory) all the time. Keeping an open mind on this though. > > Something about the inheritance rules based on similar questions > regarding the metadata cache file - The metadata cache file is built based > on all the files in the hierarchy under the current directory. So if you > have a hierarchy > A > -- B > -- C > -- D > there is a metadata cache file in A, B, C and D. The cache file in A > contains info on all the files in B, C and D. If you update the directory C > and refresh metadata for C, then _only_ C will get updated and the changes > are not propagated upwards. If you refresh metadata for A, all the changes > are seen by A, B, and C. For the use case you're outlining, I would think > looking only at the directory the files are in should suffice. > > > Parth > > > > > On Sun, Nov 1, 2015 at 10:18 PM, Julien Le Dem <[email protected]> wrote: > > > Hello, > > I'd like to capture the requirement for dot drill files. > > Here is my understanding: > > A ".drill" file is in JSON format and is a mechanism provided by the > > FileSystemPlugin to define the format plugin to use collocated with the > > files containing the data in a file system. It will override any > extension > > or magic number header mapping. > > It will enable configuring the format plugin and record level error > > handling mechanism (bad record skipping, etc). It could be extended to > > support more in the future. > > Is this correct? Are there inheritance rules if more than one file is > found > > in the hierarchy? Does drill look only at the dir containing the files or > > also all parent directories? > > > > -- > > Julien > > >
