[DISCUSS] Insert into Table support

Mehant Baid Mon, 27 Jul 2015 17:42:07 -0700

I wanted to start a conversation around supporting the "Insert intoTable" feature. As of 1.2 we initially want to support inserting into atable with Parquet files. Support for Json, CSV and other sources willfollow as future enhancements.

Aman, Jinfeng, Neeraja and I had an initial discussion about this andNeeraja provided a good summary of our discussion (pasted below) alsostating some of the requirements for this feature.


 A ) Support Insert into a non-partitioned table
---------------------

Ex: INSERT INTO T1 [col1, col2, col3]  SELECT col4, col5, col6 from T2
(Source table: T2, Target table T1)
Requirements:

1. Target table column list specification is optional for Insert statement
2. When specified, the column list in the Insert statement should
   contain all the columns present in the target table (i.e No support
   for partial insert)
3. The column names specified for the source table do not need to match
   to the target table column names. Match is performed based on ordinal.
4.   # of Source table columns specified must be same as # of target
   table columns
5. Types of specified source table columns must match to the types of
   target table columns
6. Specification of * is not allowed in the Select table syntax
7. Select table syntax can specify constant values for one or more columns


 B ) Support insert into a partitioned table
----------------------

Ex: INSERT INTO T1 col1, col2,col3 partition by col1,col2 SELECTcol4,col,col6 from T2


 * Target column specification is required when inserting data into an
   already partitioned table
 * Requirements A.3-A.7 above apply for insert into partitioned tables
   as well
 * A partition by clause along with one or more columns is required
 * All the columns specified in partition by clause must exist in the
   target column list
 * Partition by columns specified do not need to match to the list of
   columns that the original table partitioned with (i.e if the
   original table is partitioned with col1, col2,  new data during
   insert can be partitioned by col3 or just with col1 or col2..)


Couple of open questions from the design perspective are

1. How do we perform validation. Validation of data types, number ofcolumns being inserted etc. In addition to validation we need to makesure that when we insert into an existing tables we insert data with theexisting column names (select column list can have different names).This poses problems around needing to know the metadata at planningtime, two approaches that have been floating around are* DotDrill files: We can store metadata, partitioning columnsand other useful information here and we can perform validation duringplanning time. However the challenges with introducing DotDrill filesinclude- consistency between metadata and the actual data(Nothing preventing users to copy files directly).- security around DotDrill files (can be dealt in the sameway we perform security checks for drill tables in hdfs)- interface to change the DotDrill file, in the case weneed to add a column to the table or add a new partition etc.

* Explicit Syntax/ No metadata approach: Another approach is toavoid DotDrill files and use explicit syntax to glean as muchinformation as possible from the SQL statement itself. Some of thechallenges with this approach are- Gathering metadata information: Since we have no ideawhat the existing schema is we would need to perform a "mini scan" tolearn the schema at planning time to be able to perform some validation.The problem with this approach is how do we determine how many files weneed to read in order to learn the schema? If we use a sample set andnot all the files have the same schema,we could have non-deterministic results based on thesample of files read. Also reading all the files and merging the schemaseems like an expensive cost to pay.- From the user's perspective, while inserting into apartitioned table, user will have to specify the partitioning columnsagain in the Insert statement, despite having specified the partitioncolumns in the CTAS.

2. What is a reasonable assumption for a Drill table in terms ofchanging schema. Having the same exact schema for all files in a tableis too rigid an assumption at this point?

One thing to remember with DotDrill file is to also the repercussions onDrop table, Show tables, Describe table etc. i.e. it might make iteasier to be able to support these operations.


Thanks
Mehant

[DISCUSS] Insert into Table support

Reply via email to