[jira] [Updated] (IGNITE-6917) SQL: implement COPY command for efficient data loading

Kirill Shirokov (JIRA) Wed, 17 Jan 2018 03:56:36 -0800

     [ 
https://issues.apache.org/jira/browse/IGNITE-6917?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]


Kirill Shirokov updated IGNITE-6917:
------------------------------------
    Description: 
Inspired by Postgres [1]

Common use case - bulk data load through JDBC/ODBC interface. Currently it is 
only possible to execute single commands one by one. We already can batch them 
to improve performance, but there is still big room for improvement.

We should think of a completely new command - {{COPY}}. It will accept a file 
(or input stream in general case) on the client side, then transfer data to the 
cluster, and then execute update inside the cluster, e.g. through streamer.

First of all we need to create quick and dirty prototype to assess potential 
performance improvement. It speedup is confirmed, we should build base 
implementation which will accept only files. But at the same time we should 
understand how it will evolve in future: multiple file formats (probably 
including Hadoop formarts, e.g. Parquet), escape characters, input streams, 
etc..

[1] [https://www.postgresql.org/docs/9.6/static/sql-copy.html]
h1. Proposed syntax

Curent implementation:
{noformat}
COPY 
    FROM "file.name"
    INTO <schema>.<table>
    [COLUMNS (col-name, ...)]
    FORMAT <format-name>
{noformat}
We may want to gradually add features to this command in future to have 
something like this:
{noformat}
COPY
    FROM 'file.name' [CHARSET <charset-name>]
    INTO <schema>.<table> [CREATE [IF NOT EXISTS]]
    [(REPLACE|IGNORE|ABORT ON [<max-error-number>])) EXISTING ROWS]
    [COLUMNS (col-name [<data-type>] [NULLABLE] [ESCAPES], ...) [MATCH HEADER]]
    FORMAT (csv|tsv)

-- CSV format options:
    [COLSEP='column-separator']
    [ROWSEP='row-separators']
    [QUOTE='quote-char']
    [ESCAPE='escape-char']
    [NULL='null-sequence']

    [ROWS <first>-<last>]
--or--
    [SKIP ROWS <num>] [MAX ROWS <num>]

    [COLS <first>-<last>]
--or--
    [SKIP COLS <num>] [MAX COLS <num>]

    [MATCH HEADER ROW]
    [BATCH (<num> ROWS | <num> (KB|MB|GB))]
    [LOCK (TABLE|ROWS)]
    [NOLOGGING] 
{noformat}
h1. Implementations from other vendors
h2. PostgreSQL
{noformat}
 
COPY table_name [ ( column_name [, ...] ) ]
    FROM { 'filename' | STDIN }
    [ [ WITH ]
          [ BINARY ]
          [ OIDS ]
          [ DELIMITER [ AS ] 'delimiter' ]
          [ NULL [ AS ] 'null string' ]
          [ CSV [ HEADER ]
                [ QUOTE [ AS ] 'quote' ]
                [ ESCAPE [ AS ] 'escape' ]
                [ FORCE NOT NULL column_name [, ...] ] ] ]
{noformat}
([https://www.postgresql.org/docs/9.2/static/sql-copy.html])
h3. Notes
* Server-side file import
* Client-side: only from STDIN
* Protocol implementation: via special command in the protocol
* Special bulk data loaders in implemented as part of JDBC driver package: 
org.postgresql.copy.CopyManager
([https://jdbc.postgresql.org/documentation/publicapi/org/postgresql/copy/CopyManager.html])
 * Custom loaders available (e.g., 
[https://github.com/bytefish/PgBulkInsert.git])
h2. MySQL
{noformat}
LOAD DATA [LOW_PRIORITY | CONCURRENT] [LOCAL] INFILE 'file_name'
    [REPLACE | IGNORE]
    INTO TABLE tbl_name
    [PARTITION (partition_name [, partition_name] ...)]
    [CHARACTER SET charset_name]
    [{FIELDS | COLUMNS}
        [TERMINATED BY 'string']
        [[OPTIONALLY] ENCLOSED BY 'char']
        [ESCAPED BY 'char']
    ]
    [LINES
        [STARTING BY 'string']
        [TERMINATED BY 'string']
    ]
    [IGNORE number {LINES | ROWS}]
    [(col_name_or_user_var
        [, col_name_or_user_var] ...)]
    [SET col_name={expr | DEFAULT},
        [, col_name={expr | DEFAULT}] ...]
{noformat}
([https://dev.mysql.com/doc/refman/5.7/en/load-data.html])
h3. Notes
* Both client- and server-side import
* Protocol implementation via a hack:  if result set returned with column count 
== -1, read 
file name from server and send it immediately.
h2. Microsoft SQL Server
{noformat}
BULK INSERT
   [ database_name . [ schema_name ] . | schema_name . ] [ table_name | 
view_name ]
      FROM 'data_file'
     [ WITH
    (
   [ [ , ] BATCHSIZE = batch_size ]
   [ [ , ] CHECK_CONSTRAINTS ]
   [ [ , ] CODEPAGE = { 'ACP' | 'OEM' | 'RAW' | 'code_page' } ]
   [ [ , ] DATAFILETYPE =
      { 'char' | 'native'| 'widechar' | 'widenative' } ]
   [ [ , ] DATASOURCE = 'data_source_name' ]
   [ [ , ] ERRORFILE = 'file_name' ]
   [ [ , ] ERRORFILE_DATASOURCE = 'data_source_name' ]
   [ [ , ] FIRSTROW = first_row ]
   [ [ , ] FIRE_TRIGGERS ]
   [ [ , ] FORMATFILE_DATASOURCE = 'data_source_name' ]
   [ [ , ] KEEPIDENTITY ]
   [ [ , ] KEEPNULLS ]
   [ [ , ] KILOBYTES_PER_BATCH = kilobytes_per_batch ]
   [ [ , ] LASTROW = last_row ]
   [ [ , ] MAXERRORS = max_errors ]
   [ [ , ] ORDER ( { column [ ASC | DESC ] } [ ,...n ] ) ]
   [ [ , ] ROWS_PER_BATCH = rows_per_batch ]
   [ [ , ] ROWTERMINATOR = 'row_terminator' ]
   [ [ , ] TABLOCK ]

   -- input file format options
   [ [ , ] FORMAT = 'CSV' ]
   [ [ , ] FIELDQUOTE = 'quote_characters']
   [ [ , ] FORMATFILE = 'format_file_path' ]
   [ [ , ] FIELDTERMINATOR = 'field_terminator' ]
   [ [ , ] ROWTERMINATOR = 'row_terminator' ]
    )]
{noformat}
([https://docs.microsoft.com/en-us/sql/t-sql/statements/bulk-insert-transact-sql])
h3. Notes
* Server-side import
* CLI utility to import from client side
* Protocol implementation: Special packet types: column definition and row
* Custom bulk data supplied in JDBC driver package: 
com.microsoft.sqlserver.jdbc.SqlServerBulkCopy.

  was:
Inspired by Postgres [1]

Common use case - bulk data load through JDBC/ODBC interface. Currently it is 
only possible to execute single commands one by one. We already can batch them 
to improve performance, but there is still big room for improvement.

We should think of a completely new command - {{COPY}}. It will accept a file 
(or input stream in general case) on the client side, then transfer data to the 
cluster, and then execute update inside the cluster, e.g. through streamer.

First of all we need to create quick and dirty prototype to assess potential 
performance improvement. It speedup is confirmed, we should build base 
implementation which will accept only files. But at the same time we should 
understand how it will evolve in future: multiple file formats (probably 
including Hadoop formarts, e.g. Parquet), escape characters, input streams, 
etc..

[1] [https://www.postgresql.org/docs/9.6/static/sql-copy.html]
h1. Proposed syntax

Curent implementation:
{noformat}
COPY 
    FROM "file.name"
    INTO <schema>.<table>
    [COLUMNS (col-name, ...)]
    FORMAT <format-name>
{noformat}

We may want to gradually add features to this command in future to have 
something like this:

{noformat}
COPY
    FROM 'file.name' [CHARSET <charset-name>]
    INTO <schema>.<table> [CREATE [IF NOT EXISTS]]
    [(REPLACE|IGNORE|ABORT ON [<max-error-number>])) EXISTING ROWS]
    [COLUMNS (col-name [<data-type>] [NULLABLE] [ESCAPES], ...) [MATCH HEADER]]
    FORMAT (csv|tsv)

-- CSV format options:
    [COLSEP='column-separator']
    [ROWSEP='row-separators']
    [QUOTE='quote-char']
    [ESCAPE='escape-char']
    [NULL='null-sequence']

    [ROWS <first>-<last>]
--or--
    [SKIP ROWS <num>] [MAX ROWS <num>]

    [COLS <first>-<last>]
--or--
    [SKIP COLS <num>] [MAX COLS <num>]

    [MATCH HEADER ROW]
    [BATCH (<num> ROWS | <num> (KB|MB|GB))]
    [LOCK (TABLE|ROWS)]
    [NOLOGGING] 
{noformat} 
h1. Implementations from other vendors
h2. Postgres

{noformat} 
COPY table_name [ ( column_name [, ...] ) ]
    FROM { 'filename' | STDIN }
    [ [ WITH ]
          [ BINARY ]
          [ OIDS ]
          [ DELIMITER [ AS ] 'delimiter' ]
          [ NULL [ AS ] 'null string' ]
          [ CSV [ HEADER ]
                [ QUOTE [ AS ] 'quote' ]
                [ ESCAPE [ AS ] 'escape' ]
                [ FORCE NOT NULL column_name [, ...] ] ] ]
{noformat} 

(https://www.postgresql.org/docs/9.2/static/sql-copy.html)

h2. MySQL

{noformat}
LOAD DATA [LOW_PRIORITY | CONCURRENT] [LOCAL] INFILE 'file_name'
    [REPLACE | IGNORE]
    INTO TABLE tbl_name
    [PARTITION (partition_name [, partition_name] ...)]
    [CHARACTER SET charset_name]
    [{FIELDS | COLUMNS}
        [TERMINATED BY 'string']
        [[OPTIONALLY] ENCLOSED BY 'char']
        [ESCAPED BY 'char']
    ]
    [LINES
        [STARTING BY 'string']
        [TERMINATED BY 'string']
    ]
    [IGNORE number {LINES | ROWS}]
    [(col_name_or_user_var
        [, col_name_or_user_var] ...)]
    [SET col_name={expr | DEFAULT},
        [, col_name={expr | DEFAULT}] ...]
{noformat}

(https://dev.mysql.com/doc/refman/5.7/en/load-data.html)

h2. Microsoft SQL Server

{noformat}
BULK INSERT
   [ database_name . [ schema_name ] . | schema_name . ] [ table_name | 
view_name ]
      FROM 'data_file'
     [ WITH
    (
   [ [ , ] BATCHSIZE = batch_size ]
   [ [ , ] CHECK_CONSTRAINTS ]
   [ [ , ] CODEPAGE = { 'ACP' | 'OEM' | 'RAW' | 'code_page' } ]
   [ [ , ] DATAFILETYPE =
      { 'char' | 'native'| 'widechar' | 'widenative' } ]
   [ [ , ] DATASOURCE = 'data_source_name' ]
   [ [ , ] ERRORFILE = 'file_name' ]
   [ [ , ] ERRORFILE_DATASOURCE = 'data_source_name' ]
   [ [ , ] FIRSTROW = first_row ]
   [ [ , ] FIRE_TRIGGERS ]
   [ [ , ] FORMATFILE_DATASOURCE = 'data_source_name' ]
   [ [ , ] KEEPIDENTITY ]
   [ [ , ] KEEPNULLS ]
   [ [ , ] KILOBYTES_PER_BATCH = kilobytes_per_batch ]
   [ [ , ] LASTROW = last_row ]
   [ [ , ] MAXERRORS = max_errors ]
   [ [ , ] ORDER ( { column [ ASC | DESC ] } [ ,...n ] ) ]
   [ [ , ] ROWS_PER_BATCH = rows_per_batch ]
   [ [ , ] ROWTERMINATOR = 'row_terminator' ]
   [ [ , ] TABLOCK ]

   -- input file format options
   [ [ , ] FORMAT = 'CSV' ]
   [ [ , ] FIELDQUOTE = 'quote_characters']
   [ [ , ] FORMATFILE = 'format_file_path' ]
   [ [ , ] FIELDTERMINATOR = 'field_terminator' ]
   [ [ , ] ROWTERMINATOR = 'row_terminator' ]
    )]
{noforma}

(https://docs.microsoft.com/en-us/sql/t-sql/statements/bulk-insert-transact-sql)


> SQL: implement COPY command for efficient data loading
> ------------------------------------------------------
>
>                 Key: IGNITE-6917
>                 URL: https://issues.apache.org/jira/browse/IGNITE-6917
>             Project: Ignite
>          Issue Type: New Feature
>          Components: sql
>            Reporter: Vladimir Ozerov
>            Assignee: Kirill Shirokov
>            Priority: Major
>              Labels: iep-1
>
> Inspired by Postgres [1]
> Common use case - bulk data load through JDBC/ODBC interface. Currently it is 
> only possible to execute single commands one by one. We already can batch 
> them to improve performance, but there is still big room for improvement.
> We should think of a completely new command - {{COPY}}. It will accept a file 
> (or input stream in general case) on the client side, then transfer data to 
> the cluster, and then execute update inside the cluster, e.g. through 
> streamer.
> First of all we need to create quick and dirty prototype to assess potential 
> performance improvement. It speedup is confirmed, we should build base 
> implementation which will accept only files. But at the same time we should 
> understand how it will evolve in future: multiple file formats (probably 
> including Hadoop formarts, e.g. Parquet), escape characters, input streams, 
> etc..
> [1] [https://www.postgresql.org/docs/9.6/static/sql-copy.html]
> h1. Proposed syntax
> Curent implementation:
> {noformat}
> COPY 
>     FROM "file.name"
>     INTO <schema>.<table>
>     [COLUMNS (col-name, ...)]
>     FORMAT <format-name>
> {noformat}
> We may want to gradually add features to this command in future to have 
> something like this:
> {noformat}
> COPY
>     FROM 'file.name' [CHARSET <charset-name>]
>     INTO <schema>.<table> [CREATE [IF NOT EXISTS]]
>     [(REPLACE|IGNORE|ABORT ON [<max-error-number>])) EXISTING ROWS]
>     [COLUMNS (col-name [<data-type>] [NULLABLE] [ESCAPES], ...) [MATCH 
> HEADER]]
>     FORMAT (csv|tsv)
> -- CSV format options:
>     [COLSEP='column-separator']
>     [ROWSEP='row-separators']
>     [QUOTE='quote-char']
>     [ESCAPE='escape-char']
>     [NULL='null-sequence']
>     [ROWS <first>-<last>]
> --or--
>     [SKIP ROWS <num>] [MAX ROWS <num>]
>     [COLS <first>-<last>]
> --or--
>     [SKIP COLS <num>] [MAX COLS <num>]
>     [MATCH HEADER ROW]
>     [BATCH (<num> ROWS | <num> (KB|MB|GB))]
>     [LOCK (TABLE|ROWS)]
>     [NOLOGGING] 
> {noformat}
> h1. Implementations from other vendors
> h2. PostgreSQL
> {noformat}
>  
> COPY table_name [ ( column_name [, ...] ) ]
>     FROM { 'filename' | STDIN }
>     [ [ WITH ]
>           [ BINARY ]
>           [ OIDS ]
>           [ DELIMITER [ AS ] 'delimiter' ]
>           [ NULL [ AS ] 'null string' ]
>           [ CSV [ HEADER ]
>                 [ QUOTE [ AS ] 'quote' ]
>                 [ ESCAPE [ AS ] 'escape' ]
>                 [ FORCE NOT NULL column_name [, ...] ] ] ]
> {noformat}
> ([https://www.postgresql.org/docs/9.2/static/sql-copy.html])
> h3. Notes
> * Server-side file import
> * Client-side: only from STDIN
> * Protocol implementation: via special command in the protocol
> * Special bulk data loaders in implemented as part of JDBC driver package: 
> org.postgresql.copy.CopyManager
> ([https://jdbc.postgresql.org/documentation/publicapi/org/postgresql/copy/CopyManager.html])
>  * Custom loaders available (e.g., 
> [https://github.com/bytefish/PgBulkInsert.git])
> h2. MySQL
> {noformat}
> LOAD DATA [LOW_PRIORITY | CONCURRENT] [LOCAL] INFILE 'file_name'
>     [REPLACE | IGNORE]
>     INTO TABLE tbl_name
>     [PARTITION (partition_name [, partition_name] ...)]
>     [CHARACTER SET charset_name]
>     [{FIELDS | COLUMNS}
>         [TERMINATED BY 'string']
>         [[OPTIONALLY] ENCLOSED BY 'char']
>         [ESCAPED BY 'char']
>     ]
>     [LINES
>         [STARTING BY 'string']
>         [TERMINATED BY 'string']
>     ]
>     [IGNORE number {LINES | ROWS}]
>     [(col_name_or_user_var
>         [, col_name_or_user_var] ...)]
>     [SET col_name={expr | DEFAULT},
>         [, col_name={expr | DEFAULT}] ...]
> {noformat}
> ([https://dev.mysql.com/doc/refman/5.7/en/load-data.html])
> h3. Notes
> * Both client- and server-side import
> * Protocol implementation via a hack:  if result set returned with column 
> count == -1, read 
> file name from server and send it immediately.
> h2. Microsoft SQL Server
> {noformat}
> BULK INSERT
>    [ database_name . [ schema_name ] . | schema_name . ] [ table_name | 
> view_name ]
>       FROM 'data_file'
>      [ WITH
>     (
>    [ [ , ] BATCHSIZE = batch_size ]
>    [ [ , ] CHECK_CONSTRAINTS ]
>    [ [ , ] CODEPAGE = { 'ACP' | 'OEM' | 'RAW' | 'code_page' } ]
>    [ [ , ] DATAFILETYPE =
>       { 'char' | 'native'| 'widechar' | 'widenative' } ]
>    [ [ , ] DATASOURCE = 'data_source_name' ]
>    [ [ , ] ERRORFILE = 'file_name' ]
>    [ [ , ] ERRORFILE_DATASOURCE = 'data_source_name' ]
>    [ [ , ] FIRSTROW = first_row ]
>    [ [ , ] FIRE_TRIGGERS ]
>    [ [ , ] FORMATFILE_DATASOURCE = 'data_source_name' ]
>    [ [ , ] KEEPIDENTITY ]
>    [ [ , ] KEEPNULLS ]
>    [ [ , ] KILOBYTES_PER_BATCH = kilobytes_per_batch ]
>    [ [ , ] LASTROW = last_row ]
>    [ [ , ] MAXERRORS = max_errors ]
>    [ [ , ] ORDER ( { column [ ASC | DESC ] } [ ,...n ] ) ]
>    [ [ , ] ROWS_PER_BATCH = rows_per_batch ]
>    [ [ , ] ROWTERMINATOR = 'row_terminator' ]
>    [ [ , ] TABLOCK ]
>    -- input file format options
>    [ [ , ] FORMAT = 'CSV' ]
>    [ [ , ] FIELDQUOTE = 'quote_characters']
>    [ [ , ] FORMATFILE = 'format_file_path' ]
>    [ [ , ] FIELDTERMINATOR = 'field_terminator' ]
>    [ [ , ] ROWTERMINATOR = 'row_terminator' ]
>     )]
> {noformat}
> ([https://docs.microsoft.com/en-us/sql/t-sql/statements/bulk-insert-transact-sql])
> h3. Notes
> * Server-side import
> * CLI utility to import from client side
> * Protocol implementation: Special packet types: column definition and row
> * Custom bulk data supplied in JDBC driver package: 
> com.microsoft.sqlserver.jdbc.SqlServerBulkCopy.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

[jira] [Updated] (IGNITE-6917) SQL: implement COPY command for efficient data loading

Reply via email to