One comment... I think this is an overreach: "The union schema is saved in a .drill file"
There is no point in saving something derived in a .drill file. Maybe a cache file (extending the other work on metadata caching). The idea for .drill files is that they are supplementary to what can be acquired from the data directly. -- Jacques Nadeau CTO and Co-Founder, Dremio On Wed, Jul 29, 2015 at 6:58 PM, Parth Chandra <[email protected]> wrote: > Notes based on discussion in hangout, JAcques' notes and later discussion: > > > 1. For Parquet files > - Merge schema information into a union schema > - The union schema is saved in a .drill file > - The union schema can be cached > - Union schema should maintain ordinal position of fields. > 2. At the time the table is read, we can check directory timestamp and > regenerate merged schema > 3. Insert into does not need to update the merged schema. (It should > check if there is an incompatible schema being written? An incompatible > schema is one that cannot be merged successfully. ) > - Suggestion 1 - Allow flexible insert. (Issue a warning if there is > going to be a schema change conflict). Queries on columns with > no conflict > can still run. > - Suggestion 2 - Do not allow insert unless the schema can be > merged. > 4. Rules for merging can be relaxed as we improve our ability to promote > from one type to another > 5. Concept of schema change different from a merged schema. Schema > change is needed for reading schema less sources like JSON. > 6. If there is a different column order (or difference in number of > columns) in the table files, an insert based on ordinal positions > cannot be > allowed. > 7. Suggestion - go with a naming based insert. Ordinal based insert can > be enabled based on a switch (?) > 8. Reuse partitioning information from 'create table' in 'insert into'. > No partitioning info is needed in the 'insert into' statement. > > > > On Mon, Jul 27, 2015 at 6:43 PM, Khurram Faraaz <[email protected]> > wrote: > > > I have a question related to insert into partitioned tables. I see that > > DB2, Oracle, and Postgres do not support a PARTITION BY clause in their > > INSERT INTO implementation. MySQL does however provide support for a > > PARTITION BY clause in their INSERT INTO implementation. I do not know > what > > the standard ANSI SQL specification says about INSERT INTO statement. We > > may want to keep it compliant with standard SQL syntax. > > > > >> B ) Support insert into a partitioned table > > >> ---------------------- > > > > >> Ex: INSERT INTO T1 col1, col2,col3 partition by col1,col2 SELECT > > col4,col,col6 from T2 > > > > >> * Target column specification is required when inserting data into an > > >> already partitioned table > > >> * Requirements A.3-A.7 above apply for insert into partitioned tables > > >> as well > > >> * A partition by clause along with one or more columns is required > > >> * All the columns specified in partition by clause must exist in the > > >> target column list > > >> * Partition by columns specified do not need to match to the list of > > >> columns that the original table partitioned with (i.e if the > > >> original table is partitioned with col1, col2, new data during > > >> insert can be partitioned by col3 or just with col1 or col2..) > > > > On Mon, Jul 27, 2015 at 5:40 PM, Mehant Baid <[email protected]> > > wrote: > > > > > I wanted to start a conversation around supporting the "Insert into > > Table" > > > feature. As of 1.2 we initially want to support inserting into a table > > with > > > Parquet files. Support for Json, CSV and other sources will follow as > > > future enhancements. > > > > > > Aman, Jinfeng, Neeraja and I had an initial discussion about this and > > > Neeraja provided a good summary of our discussion (pasted below) also > > > stating some of the requirements for this feature. > > > > > > A ) Support Insert into a non-partitioned table > > > --------------------- > > > > > > Ex: INSERT INTO T1 [col1, col2, col3] SELECT col4, col5, col6 from T2 > > > (Source table: T2, Target table T1) > > > Requirements: > > > > > > 1. Target table column list specification is optional for Insert > > statement > > > 2. When specified, the column list in the Insert statement should > > > contain all the columns present in the target table (i.e No support > > > for partial insert) > > > 3. The column names specified for the source table do not need to match > > > to the target table column names. Match is performed based on > ordinal. > > > 4. # of Source table columns specified must be same as # of target > > > table columns > > > 5. Types of specified source table columns must match to the types of > > > target table columns > > > 6. Specification of * is not allowed in the Select table syntax > > > 7. Select table syntax can specify constant values for one or more > > columns > > > > > > > > > B ) Support insert into a partitioned table > > > ---------------------- > > > > > > Ex: INSERT INTO T1 col1, col2,col3 partition by col1,col2 SELECT > > > col4,col,col6 from T2 > > > > > > * Target column specification is required when inserting data into an > > > already partitioned table > > > * Requirements A.3-A.7 above apply for insert into partitioned tables > > > as well > > > * A partition by clause along with one or more columns is required > > > * All the columns specified in partition by clause must exist in the > > > target column list > > > * Partition by columns specified do not need to match to the list of > > > columns that the original table partitioned with (i.e if the > > > original table is partitioned with col1, col2, new data during > > > insert can be partitioned by col3 or just with col1 or col2..) > > > > > > > > > Couple of open questions from the design perspective are > > > > > > 1. How do we perform validation. Validation of data types, number of > > > columns being inserted etc. In addition to validation we need to make > > sure > > > that when we insert into an existing tables we insert data with the > > > existing column names (select column list can have different names). > This > > > poses problems around needing to know the metadata at planning time, > two > > > approaches that have been floating around are > > > * DotDrill files: We can store metadata, partitioning columns > and > > > other useful information here and we can perform validation during > > planning > > > time. However the challenges with introducing DotDrill files include > > > - consistency between metadata and the actual data > (Nothing > > > preventing users to copy files directly). > > > - security around DotDrill files (can be dealt in the same > > > way we perform security checks for drill tables in hdfs) > > > - interface to change the DotDrill file, in the case we > need > > > to add a column to the table or add a new partition etc. > > > > > > * Explicit Syntax/ No metadata approach: Another approach is to > > > avoid DotDrill files and use explicit syntax to glean as much > information > > > as possible from the SQL statement itself. Some of the challenges with > > this > > > approach are > > > - Gathering metadata information: Since we have no idea > what > > > the existing schema is we would need to perform a "mini scan" to learn > > the > > > schema at planning time to be able to perform some validation. The > > problem > > > with this approach is how do we determine how many files we need to > read > > in > > > order to learn the schema? If we use a sample set and not all the files > > > have the same schema, > > > we could have non-deterministic results based on the > > > sample of files read. Also reading all the files and merging the schema > > > seems like an expensive cost to pay. > > > - From the user's perspective, while inserting into a > > > partitioned table, user will have to specify the partitioning columns > > again > > > in the Insert statement, despite having specified the partition columns > > in > > > the CTAS. > > > > > > 2. What is a reasonable assumption for a Drill table in terms of > changing > > > schema. Having the same exact schema for all files in a table is too > > rigid > > > an assumption at this point? > > > > > > One thing to remember with DotDrill file is to also the repercussions > on > > > Drop table, Show tables, Describe table etc. i.e. it might make it > easier > > > to be able to support these operations. > > > > > > Thanks > > > Mehant > > > > > >
