I wanted to start a conversation around supporting the "Insert into Table" feature. As of 1.2 we initially want to support inserting into a table with Parquet files. Support for Json, CSV and other sources will follow as future enhancements.

Aman, Jinfeng, Neeraja and I had an initial discussion about this and Neeraja provided a good summary of our discussion (pasted below) also stating some of the requirements for this feature.

 A ) Support Insert into a non-partitioned table
---------------------

Ex: INSERT INTO T1 [col1, col2, col3]  SELECT col4, col5, col6 from T2
(Source table: T2, Target table T1)
Requirements:

1. Target table column list specification is optional for Insert statement
2. When specified, the column list in the Insert statement should
   contain all the columns present in the target table (i.e No support
   for partial insert)
3. The column names specified for the source table do not need to match
   to the target table column names. Match is performed based on ordinal.
4.   # of Source table columns specified must be same as # of target
   table columns
5. Types of specified source table columns must match to the types of
   target table columns
6. Specification of * is not allowed in the Select table syntax
7. Select table syntax can specify constant values for one or more columns


 B ) Support insert into a partitioned table
----------------------

Ex: INSERT INTO T1 col1, col2,col3 partition by col1,col2 SELECT col4,col,col6 from T2

 * Target column specification is required when inserting data into an
   already partitioned table
 * Requirements A.3-A.7 above apply for insert into partitioned tables
   as well
 * A partition by clause along with one or more columns is required
 * All the columns specified in partition by clause must exist in the
   target column list
 * Partition by columns specified do not need to match to the list of
   columns that the original table partitioned with (i.e if the
   original table is partitioned with col1, col2,  new data during
   insert can be partitioned by col3 or just with col1 or col2..)


Couple of open questions from the design perspective are

1. How do we perform validation. Validation of data types, number of columns being inserted etc. In addition to validation we need to make sure that when we insert into an existing tables we insert data with the existing column names (select column list can have different names). This poses problems around needing to know the metadata at planning time, two approaches that have been floating around are * DotDrill files: We can store metadata, partitioning columns and other useful information here and we can perform validation during planning time. However the challenges with introducing DotDrill files include - consistency between metadata and the actual data (Nothing preventing users to copy files directly). - security around DotDrill files (can be dealt in the same way we perform security checks for drill tables in hdfs) - interface to change the DotDrill file, in the case we need to add a column to the table or add a new partition etc.

* Explicit Syntax/ No metadata approach: Another approach is to avoid DotDrill files and use explicit syntax to glean as much information as possible from the SQL statement itself. Some of the challenges with this approach are - Gathering metadata information: Since we have no idea what the existing schema is we would need to perform a "mini scan" to learn the schema at planning time to be able to perform some validation. The problem with this approach is how do we determine how many files we need to read in order to learn the schema? If we use a sample set and not all the files have the same schema, we could have non-deterministic results based on the sample of files read. Also reading all the files and merging the schema seems like an expensive cost to pay. - From the user's perspective, while inserting into a partitioned table, user will have to specify the partitioning columns again in the Insert statement, despite having specified the partition columns in the CTAS.

2. What is a reasonable assumption for a Drill table in terms of changing schema. Having the same exact schema for all files in a table is too rigid an assumption at this point?

One thing to remember with DotDrill file is to also the repercussions on Drop table, Show tables, Describe table etc. i.e. it might make it easier to be able to support these operations.

Thanks
Mehant

Reply via email to