Notes based on discussion in hangout, JAcques' notes and later discussion:
1. For Parquet files
- Merge schema information into a union schema
- The union schema is saved in a .drill file
- The union schema can be cached
- Union schema should maintain ordinal position of fields.
2. At the time the table is read, we can check directory timestamp and
regenerate merged schema
3. Insert into does not need to update the merged schema. (It should
check if there is an incompatible schema being written? An incompatible
schema is one that cannot be merged successfully. )
- Suggestion 1 - Allow flexible insert. (Issue a warning if there is
going to be a schema change conflict). Queries on columns with
no conflict
can still run.
- Suggestion 2 - Do not allow insert unless the schema can be
merged.
4. Rules for merging can be relaxed as we improve our ability to promote
from one type to another
5. Concept of schema change different from a merged schema. Schema
change is needed for reading schema less sources like JSON.
6. If there is a different column order (or difference in number of
columns) in the table files, an insert based on ordinal positions cannot be
allowed.
7. Suggestion - go with a naming based insert. Ordinal based insert can
be enabled based on a switch (?)
8. Reuse partitioning information from 'create table' in 'insert into'.
No partitioning info is needed in the 'insert into' statement.
On Mon, Jul 27, 2015 at 6:43 PM, Khurram Faraaz <[email protected]>
wrote:
> I have a question related to insert into partitioned tables. I see that
> DB2, Oracle, and Postgres do not support a PARTITION BY clause in their
> INSERT INTO implementation. MySQL does however provide support for a
> PARTITION BY clause in their INSERT INTO implementation. I do not know what
> the standard ANSI SQL specification says about INSERT INTO statement. We
> may want to keep it compliant with standard SQL syntax.
>
> >> B ) Support insert into a partitioned table
> >> ----------------------
>
> >> Ex: INSERT INTO T1 col1, col2,col3 partition by col1,col2 SELECT
> col4,col,col6 from T2
>
> >> * Target column specification is required when inserting data into an
> >> already partitioned table
> >> * Requirements A.3-A.7 above apply for insert into partitioned tables
> >> as well
> >> * A partition by clause along with one or more columns is required
> >> * All the columns specified in partition by clause must exist in the
> >> target column list
> >> * Partition by columns specified do not need to match to the list of
> >> columns that the original table partitioned with (i.e if the
> >> original table is partitioned with col1, col2, new data during
> >> insert can be partitioned by col3 or just with col1 or col2..)
>
> On Mon, Jul 27, 2015 at 5:40 PM, Mehant Baid <[email protected]>
> wrote:
>
> > I wanted to start a conversation around supporting the "Insert into
> Table"
> > feature. As of 1.2 we initially want to support inserting into a table
> with
> > Parquet files. Support for Json, CSV and other sources will follow as
> > future enhancements.
> >
> > Aman, Jinfeng, Neeraja and I had an initial discussion about this and
> > Neeraja provided a good summary of our discussion (pasted below) also
> > stating some of the requirements for this feature.
> >
> > A ) Support Insert into a non-partitioned table
> > ---------------------
> >
> > Ex: INSERT INTO T1 [col1, col2, col3] SELECT col4, col5, col6 from T2
> > (Source table: T2, Target table T1)
> > Requirements:
> >
> > 1. Target table column list specification is optional for Insert
> statement
> > 2. When specified, the column list in the Insert statement should
> > contain all the columns present in the target table (i.e No support
> > for partial insert)
> > 3. The column names specified for the source table do not need to match
> > to the target table column names. Match is performed based on ordinal.
> > 4. # of Source table columns specified must be same as # of target
> > table columns
> > 5. Types of specified source table columns must match to the types of
> > target table columns
> > 6. Specification of * is not allowed in the Select table syntax
> > 7. Select table syntax can specify constant values for one or more
> columns
> >
> >
> > B ) Support insert into a partitioned table
> > ----------------------
> >
> > Ex: INSERT INTO T1 col1, col2,col3 partition by col1,col2 SELECT
> > col4,col,col6 from T2
> >
> > * Target column specification is required when inserting data into an
> > already partitioned table
> > * Requirements A.3-A.7 above apply for insert into partitioned tables
> > as well
> > * A partition by clause along with one or more columns is required
> > * All the columns specified in partition by clause must exist in the
> > target column list
> > * Partition by columns specified do not need to match to the list of
> > columns that the original table partitioned with (i.e if the
> > original table is partitioned with col1, col2, new data during
> > insert can be partitioned by col3 or just with col1 or col2..)
> >
> >
> > Couple of open questions from the design perspective are
> >
> > 1. How do we perform validation. Validation of data types, number of
> > columns being inserted etc. In addition to validation we need to make
> sure
> > that when we insert into an existing tables we insert data with the
> > existing column names (select column list can have different names). This
> > poses problems around needing to know the metadata at planning time, two
> > approaches that have been floating around are
> > * DotDrill files: We can store metadata, partitioning columns and
> > other useful information here and we can perform validation during
> planning
> > time. However the challenges with introducing DotDrill files include
> > - consistency between metadata and the actual data (Nothing
> > preventing users to copy files directly).
> > - security around DotDrill files (can be dealt in the same
> > way we perform security checks for drill tables in hdfs)
> > - interface to change the DotDrill file, in the case we need
> > to add a column to the table or add a new partition etc.
> >
> > * Explicit Syntax/ No metadata approach: Another approach is to
> > avoid DotDrill files and use explicit syntax to glean as much information
> > as possible from the SQL statement itself. Some of the challenges with
> this
> > approach are
> > - Gathering metadata information: Since we have no idea what
> > the existing schema is we would need to perform a "mini scan" to learn
> the
> > schema at planning time to be able to perform some validation. The
> problem
> > with this approach is how do we determine how many files we need to read
> in
> > order to learn the schema? If we use a sample set and not all the files
> > have the same schema,
> > we could have non-deterministic results based on the
> > sample of files read. Also reading all the files and merging the schema
> > seems like an expensive cost to pay.
> > - From the user's perspective, while inserting into a
> > partitioned table, user will have to specify the partitioning columns
> again
> > in the Insert statement, despite having specified the partition columns
> in
> > the CTAS.
> >
> > 2. What is a reasonable assumption for a Drill table in terms of changing
> > schema. Having the same exact schema for all files in a table is too
> rigid
> > an assumption at this point?
> >
> > One thing to remember with DotDrill file is to also the repercussions on
> > Drop table, Show tables, Describe table etc. i.e. it might make it easier
> > to be able to support these operations.
> >
> > Thanks
> > Mehant
> >
>