One comment... I think this is an overreach:

"The union schema is saved in a .drill file"

There is no point in saving something derived in a .drill file.  Maybe a
cache file (extending the other work on metadata caching).  The idea for
.drill files is that they are supplementary to what can be acquired from
the data directly.

--
Jacques Nadeau
CTO and Co-Founder, Dremio

On Wed, Jul 29, 2015 at 6:58 PM, Parth Chandra <[email protected]> wrote:

> Notes based on discussion in hangout, JAcques' notes and later discussion:
>
>
>    1. For Parquet files
>       - Merge schema information into a union schema
>       - The union schema is saved in a .drill file
>       - The union schema can be cached
>       - Union schema should maintain ordinal position of fields.
>    2. At the time the table is read, we can check directory timestamp and
>    regenerate merged schema
>    3. Insert into does not need to update the merged schema. (It should
>    check if there is an incompatible schema being written? An incompatible
>    schema is one that cannot be merged successfully. )
>    -   Suggestion 1 - Allow flexible insert. (Issue a warning if there is
>       going to be a schema change conflict). Queries on columns with
> no conflict
>       can still run.
>       -   Suggestion 2 - Do not allow insert unless the schema can be
>       merged.
>    4. Rules for merging can be relaxed as we improve our ability to promote
>    from one type to another
>    5. Concept of schema change different from a merged schema. Schema
>    change is needed for reading schema less sources like JSON.
>    6. If there is a different column order (or difference in number of
>    columns) in the table files, an insert based on ordinal positions
> cannot be
>    allowed.
>    7. Suggestion - go with a naming based insert. Ordinal based insert can
>    be enabled based on a switch (?)
>    8. Reuse partitioning information from 'create table' in 'insert into'.
>    No partitioning info is needed in the 'insert into' statement.
>
>
>
> On Mon, Jul 27, 2015 at 6:43 PM, Khurram Faraaz <[email protected]>
> wrote:
>
> > I have a question related to insert into partitioned tables. I see that
> > DB2, Oracle, and Postgres do not support a PARTITION BY clause in their
> > INSERT INTO implementation. MySQL does however provide support for a
> > PARTITION BY clause in their INSERT INTO implementation. I do not know
> what
> > the standard ANSI SQL specification says about INSERT INTO statement. We
> > may want to keep it compliant with standard SQL syntax.
> >
> > >> B ) Support insert into a partitioned table
> > >> ----------------------
> >
> > >> Ex: INSERT INTO T1 col1, col2,col3  partition by col1,col2 SELECT
> > col4,col,col6 from T2
> >
> > >> * Target column specification is required when inserting data into an
> > >>   already partitioned table
> > >> * Requirements A.3-A.7 above apply for insert into partitioned tables
> > >>   as well
> > >> * A partition by clause along with one or more columns is required
> > >> * All the columns specified in partition by clause must exist in the
> > >>   target column list
> > >> * Partition by columns specified do not need to match to the list of
> > >>   columns that the original table partitioned with (i.e if the
> > >>   original table is partitioned with col1, col2,  new data during
> > >>   insert can be partitioned by col3 or just with col1 or col2..)
> >
> > On Mon, Jul 27, 2015 at 5:40 PM, Mehant Baid <[email protected]>
> > wrote:
> >
> > > I wanted to start a conversation around supporting the "Insert into
> > Table"
> > > feature. As of 1.2 we initially want to support inserting into a table
> > with
> > > Parquet files. Support for Json, CSV and other sources will follow as
> > > future enhancements.
> > >
> > > Aman, Jinfeng, Neeraja and I had an initial discussion about this and
> > > Neeraja provided a good summary of our discussion (pasted below) also
> > > stating some of the requirements for this feature.
> > >
> > >  A ) Support Insert into a non-partitioned table
> > > ---------------------
> > >
> > > Ex: INSERT INTO T1 [col1, col2, col3]  SELECT col4, col5, col6 from T2
> > > (Source table: T2, Target table T1)
> > > Requirements:
> > >
> > > 1. Target table column list specification is optional for Insert
> > statement
> > > 2. When specified, the column list in the Insert statement should
> > >    contain all the columns present in the target table (i.e No support
> > >    for partial insert)
> > > 3. The column names specified for the source table do not need to match
> > >    to the target table column names. Match is performed based on
> ordinal.
> > > 4.   # of Source table columns specified must be same as # of target
> > >    table columns
> > > 5. Types of specified source table columns must match to the types of
> > >    target table columns
> > > 6. Specification of * is not allowed in the Select table syntax
> > > 7. Select table syntax can specify constant values for one or more
> > columns
> > >
> > >
> > >  B ) Support insert into a partitioned table
> > > ----------------------
> > >
> > > Ex: INSERT INTO T1 col1, col2,col3  partition by col1,col2 SELECT
> > > col4,col,col6 from T2
> > >
> > >  * Target column specification is required when inserting data into an
> > >    already partitioned table
> > >  * Requirements A.3-A.7 above apply for insert into partitioned tables
> > >    as well
> > >  * A partition by clause along with one or more columns is required
> > >  * All the columns specified in partition by clause must exist in the
> > >    target column list
> > >  * Partition by columns specified do not need to match to the list of
> > >    columns that the original table partitioned with (i.e if the
> > >    original table is partitioned with col1, col2,  new data during
> > >    insert can be partitioned by col3 or just with col1 or col2..)
> > >
> > >
> > > Couple of open questions from the design perspective are
> > >
> > > 1. How do we perform validation. Validation of data types, number of
> > > columns being inserted etc. In addition to validation we need to make
> > sure
> > > that when we insert into an existing tables we insert data with the
> > > existing column names (select column list can have different names).
> This
> > > poses problems around needing to know the metadata at planning time,
> two
> > > approaches that have been floating around are
> > >         * DotDrill files: We can store metadata, partitioning columns
> and
> > > other useful information here and we can perform validation during
> > planning
> > > time. However the challenges with introducing DotDrill files include
> > >              - consistency between metadata and the actual data
> (Nothing
> > > preventing users to copy files directly).
> > >              - security around DotDrill files (can be dealt in the same
> > > way we perform security checks for drill tables in hdfs)
> > >              - interface to change the DotDrill file, in the case we
> need
> > > to add a column to the table or add a new partition etc.
> > >
> > >         * Explicit Syntax/ No metadata approach: Another approach is to
> > > avoid DotDrill files and use explicit syntax to glean as much
> information
> > > as possible from the SQL statement itself. Some of the challenges with
> > this
> > > approach are
> > >              - Gathering metadata information: Since we have no idea
> what
> > > the existing schema is we would need to perform a "mini scan" to learn
> > the
> > > schema at planning time to be able to perform some validation. The
> > problem
> > > with this approach is how do we determine how many files we need to
> read
> > in
> > > order to learn the schema? If we use a sample set and not all the files
> > > have the same schema,
> > >                 we could have non-deterministic results based on the
> > > sample of files read. Also reading all the files and merging the schema
> > > seems like an expensive cost to pay.
> > >              - From the user's perspective, while inserting into a
> > > partitioned table, user will have to specify the partitioning columns
> > again
> > > in the Insert statement, despite having specified the partition columns
> > in
> > > the CTAS.
> > >
> > > 2. What is a reasonable assumption for a Drill table in terms of
> changing
> > > schema. Having the same exact schema for all files in a table is too
> > rigid
> > > an assumption at this point?
> > >
> > > One thing to remember with DotDrill file is to also the repercussions
> on
> > > Drop table, Show tables, Describe table etc. i.e. it might make it
> easier
> > > to be able to support these operations.
> > >
> > > Thanks
> > > Mehant
> > >
> >
>

Reply via email to