[jira] [Commented] (IGNITE-6917) SQL: implement COPY command for efficient data loading

Kirill Shirokov (JIRA) Thu, 01 Feb 2018 05:45:29 -0800

    [ 
https://issues.apache.org/jira/browse/IGNITE-6917?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16348588#comment-16348588
 ]


Kirill Shirokov commented on IGNITE-6917:
-----------------------------------------

> 8. BulkLoadParameters - do we really need this class? It just has two fields. 
> For now I see no point in it. Do you expect many more new fields to come?

Yes, we expect more fields to come.

> 9. BulkLoadParser:
> 9.1 Unused private field.

FIXED. Thanks!

> 9.2 Too long line in processBatch declaration.

FIXED.

> 9.3 It seems plain wrong that arguments of processBatch are related with 
> JDBC. JDBC is a connectivity interface, nothing else, and these references 
> are in deeper "guts". We need to decouple these somehow - one day we might 
> work to support this feature in ODBC, or somewhere else, and then it'd be 
> inevitable. I'd just introduce an enum for various command types.

TODO Yes, I agree.

> 10. BulkLoadFormat:
> 10.1 Unused import

FIXED.

> 10.2 Odd empty lines in the beginning of class, in switch, etc

FIXED.

> 11. BulkLoadParser and BulkLoadFormat seem to be related 1:1 - probably they 
> should be single class and implementations could be merged accordingly too. 
> Can you come up with a case where we would really need to separate this 
> entities? Say, one parser for few formats, or one format yielding more than 
> one type of parser?

Bottom line:
Discussed and fixed.

(Yes, it's 1:1. But this way we would have to create and configure csv parser 
in SQL parsing code -- which is controversial. We may don't have enough context 
to create such objects in the SQL parsing code. I'm a little bit afraid of 
creating such object in the parsing code and having it in inconsistent state 
until being fully initialized (configured) in the statement processing code 
(where we have enough context available)

So, I'm trying to keep separation between 
a. SQL command syntax processing (SqlBulkLoadCommand), which is not responsible 
for semantic actions and doesn't have full context and 
b. Semantic processing and creating appropriate context and processing objects 
in DmlStatementsProcessor).

> 12. BulkLoadCsvParser:
> 12.1 processBatch method looks like it could be actually defined in parent 
> class, and the logic to be changed in subclasses actually lives in 
> parseBatch. This is so because command types like CMD_FINISHED_EOF are 
> defined on higher level. Can we rework hierarchy accordingly?

Basically, yes, if we will have another parser in the future most probably this 
code would be moved to the superclass (although this is not clear for now). But 
OK, let's move it.
FIXED.

> 12.2 Unused parameter in ctor ( params ), do we have tests for params at all?

FIXED (already).

> 12.3 Odd empty lines after class header, method declarations

FIXED.

> 13. BulkLoadEntryConverter: do we really expect more than one implementation 
> of this interface? If no, let's just replace it with IgniteClosureX or 
> something like that. We have more than enough interfaces and abstract classes 
> to represent lambdas. Before refactoring this, please look at pt.10.3 as well

Most probably, we don't. Refactored to IgniteClosureX.

> 14. BulkLoadCsvFormat: amount of useful code in this class approaches zero, 
> and this further assures that parser and its format must be one.

TO DISCUSS

> 15. BulkLoadContext: too long comment line.

FIXED.

> 16. BulkLoadCacheWriter: do we expect more than one implementation of this 
> interface?quit If no, just remove this class and let BulkLoadContext know 
> about streamer. If yes, make it inherit one of Ignite's provided 
> classes/interfaces for lambdas. Before refactoring this, please look at 
> pt.10.3 as well

Yes, we expect multiple backends (IGNITE-7541) -- e.g., cache.putAll(...) vs. 
streamer, as we can handle duplicates via putAll() .

Please note the close() method, which can be usable in certain scenarios. 
Also we may want cache writer to gather statistics (how many rows added, 
replaced, etc.). 
So I would keep this as a separate interface.

> 17. JdbcRequestHandler:
> 17.1 Unused import

FIXED.

> 17.2 processBulkLoadFileBatch: all other methods just catch Exception, please 
> do the same here.

FIXED, but:
Does this anti-pattern makes sense here? Or we're about keeping similarity with 
existing code and bugward compatibility?

> 17.3 processBulkLoadFileBatch: it looks like doing actual job (data 
> conversion, feeding to streamer) is just plain wrong. Not only it's logic 
> that could be reused when we'll implement same feature for other interface 
> like JDBC, it's also not the area of responsibility of JDBC request handler. 
> This again shows the need of refactoring of BulkLoadContext: probably we need 
> to introduce a concrete class that will a) wrap streamer and b) do 
> conversion. It will just accept List<Object> and do with it whatever it 
> wants. JDBC must not know about this at all, it must just detect correct 
> context and feed lists of objects to that new class I've mentioned above.

FIXED.
I've refactored JdbcBulkLoadContext into BulkLoadProcessor which would keep a 
context plus methods to process the incoming batches.

> 17.4 executeQuery: I'd move whole stuff of if (fieldsCur instanceof 
> BulkLoadContextCursor) above, before if (results.size() == 1) - it would 
> decrease scope depth, make code better readable and decrease overall size of 
> diff (never hurts).

FIXED.

> 17.5 executeQuery: filesToSendResult - doesn't have anything at all to do 
> with files.

FIXED
Renamed to bulkLoadCmdResult

> 18. JdbcBulkLoadContext: unused field cmdParsingResult (getter has 0 usages), 
> not to mention that field name is very misleading - what does JDBC have to do 
> with parsing?

FIXED (removed)

> 19. JdbcBulkLoadBatchRequestResult: quite weird name, I don't think we should 
> mix words "request" and "result", we never do so in JDBC. Not to mention that 
> we never send JdbcBulkLoadBatchRequest before we get 
> JdbcBulkLoadBatchRequestResult  I'd just rename this to JdbcBulkLoadAckResult 
> as it's sent as an acknowledgement of COPY command and never else.

I agree. I just couldn't come up with a better name. Need to think again.

> SQL: implement COPY command for efficient data loading
> ------------------------------------------------------
>
>                 Key: IGNITE-6917
>                 URL: https://issues.apache.org/jira/browse/IGNITE-6917
>             Project: Ignite
>          Issue Type: New Feature
>          Components: sql
>            Reporter: Vladimir Ozerov
>            Assignee: Kirill Shirokov
>            Priority: Major
>              Labels: iep-1
>
> Inspired by Postgres [1]
> Common use case - bulk data load through JDBC/ODBC interface. Currently it is 
> only possible to execute single commands one by one. We already can batch 
> them to improve performance, but there is still big room for improvement.
> We should think of a completely new command - {{COPY}}. It will accept a file 
> (or input stream in general case) on the client side, then transfer data to 
> the cluster, and then execute update inside the cluster, e.g. through 
> streamer.
> First of all we need to create quick and dirty prototype to assess potential 
> performance improvement. It speedup is confirmed, we should build base 
> implementation which will accept only files. But at the same time we should 
> understand how it will evolve in future: multiple file formats (probably 
> including Hadoop formarts, e.g. Parquet), escape characters, input streams, 
> etc..
> [1] [https://www.postgresql.org/docs/9.6/static/sql-copy.html]
> h1. Proposed syntax
> Curent implementation:
> {noformat}
> COPY 
>     FROM "file.name"
>     INTO <schema>.<table>
>     [(col-name, ...)]
>     FORMAT <format-name>     -- Only CSV format is supported in the current 
> release
>     [BATCH_SIZE <batch-size-in-bytes>]
> {noformat}
> We may want to gradually add features to this command in future to have 
> something like this:
> {noformat}
> COPY
>     FROM "file.name"[CHARSET "<charset-name>"]
>     INTO <schema>.<table> [CREATE [IF NOT EXISTS]]
>     [(col-name [<data-type>] [NULLABLE] [ESCAPES], ...) [MATCH HEADER]]
>     FORMAT (csv|tsv|...)
> -- CSV format options:
>     [FIELDSEP='column-separators-regexp']
>     [LINESEP='row-separators-regexp']
>     [QUOTE='quote-chars']
>     [ESCAPE='escape-char']
>     [NULL='null-sequence']
>     [COMMENT='single-line-comment-start-char']
>     [TRIM_LINES]
>     [IMPORT_EMPTY_LINES]
>     [CHARSET "<charset-name>"]
>     [ROWS <first>-<last>]
> --or--
>     [SKIP ROWS <num>] [MAX ROWS <num>]
>     [COLS <first>-<last>]
> --or--
>     [SKIP COLS <num>] [MAX COLS <num>]
>     [(MATCH | SKIP) HEADER]
>     [(REPLACE|IGNORE|ABORT ON [<max-error-number>])) DUPLICATE KEYS]
>     [BATCH SIZE (<num> ROWS | <num>[K|M|G|T|P])]
>     [COMPRESS "codec-name" [codec options]]
>     [LOCK (TABLE|ROWS)]
>     [NOLOGGING]
>     [BACKEND (DIRECT | STREAMER)]
> {noformat}
> h1. Implementation decisions and notes
> h2. Parsing
> * We support CSV format described in RFC 4180.
> * Custom row and column separators, quoting characters are currently hardcoded
> * Escape sequences, line comment characters are currently not supported
> * We may want to support fixed-length formats (via format descriptors) in 
> future
> * We may want to strip comments from lines (for example, starting with '#')
> * We may want to allow user to either ignore empty lines or treat them as a 
> special case of record having all default values
> * We may allow user to enable whitespace trimming from beginning and end of a 
> line
> * We may want to allow user to specify error handling strategy: e.g., only 
> one quote character is present or escape sequence is invalid.
> h2. File handling
> * File character set to be supported in future
> * Skipped/imported row number (or first/last line or skip header option), 
> skipped/imported column number (or first/last column): to be supported in 
> future
> * Line start pattern (as in MySQL): no support planned
> * We currently support only client-side import. No server-side file import.
> * We may want to support client-side stdin import in future.
> * We do not handle importing multiple files from single command
> * We don't benefit from any kind of pre-sorting pre-partitioning data on 
> client side.
> * We don't include any any metadata, such as line number from client side.
> h3. Transferring data
> * We send file data via batches. In future we will support batch size 
> (specified with rows per batch or data block size 
> per batch).
> * We may want to implement data compression in future.
> * We connect to single node in JDBC driver (no multi-node connections).
> h3. Cache/tables/column handling
> * We don't create table in the bulk load command
> * We may want to have and option for reading header row, which contains 
> column names to match columns
> * In future we may wish to support COLUMNS (col1, _, col2, _, col3) syntax, 
> where '_' marker means a skipped column (MySQL uses '@dummy' for this)
> h3. Data types
> * Data types are converted as if they were supplied to INSERT SQL command.
> * We may want type conversion (automatic, custom using sql function, custom 
> via Java code, string auto-trimming) in future.
> * We will support optional null sequence ("\N") later
> * We may want to allow user to specify what to do if the same record exists 
> (e.g., ignore record, replace it, report error with a max. error counter 
> before failing the command)
> * We don't currently support any generated/autoincremented row IDs or any 
> custom generators.
> * We don't support any filtering/conditional expressions
> * We don't support any files/recordsets/tables with multiple conversion 
> errors generated during import.
> h3. Backend / Transactional / MVCC / other
> * We may want an option to select how do we insert the data into cache: using 
> cache.putAll(...), for example, or via data streamer interface (see BACKEND 
> option)
> * We don't use transactions
> * We don't create locks on rows or tables.
> * We don't try to minimize any indexing overhead (it's up to the user)
> * We may want to minimize WAL impact in future via NOLOGGING option.
> h3. Miscellanea
> * We don't supply an utility to load data
> * We don't currently supply any java loaders (as in PG and MSSQL) that stream 
> data (not neccessary from file)
> * Security-related questions are out of scope of this JIRA
> * We don't have triggers and constraints in Apache Ignite
> h1. Implementations from other vendors
> h2. PostgreSQL
> {noformat}
> COPY table_name [ ( column_name [, ...] ) ]
>     FROM { 'filename' | STDIN }
>     [ [ WITH ]
>           [ BINARY ]
>           [ OIDS ]
>           [ DELIMITER [ AS ] 'delimiter' ]
>           [ NULL [ AS ] 'null string' ]
>           [ CSV [ HEADER ]
>                 [ QUOTE [ AS ] 'quote' ]
>                 [ ESCAPE [ AS ] 'escape' ]
>                 [ FORCE NOT NULL column_name [, ...] ] ] ]
> {noformat}
> ([https://www.postgresql.org/docs/9.2/static/sql-copy.html])
> h3. Notes
>  * Server-side file import
>  * Client-side: only from STDIN
>  * Protocol implementation: via special command in the protocol
>  * Special bulk data loaders in implemented as part of JDBC driver package: 
> org.postgresql.copy.CopyManager
>  
> ([https://jdbc.postgresql.org/documentation/publicapi/org/postgresql/copy/CopyManager.html])
>  * Custom loaders available (e.g., 
> [https://github.com/bytefish/PgBulkInsert.git])
> h2. MySQL
> {noformat}
> LOAD DATA [LOW_PRIORITY | CONCURRENT] [LOCAL] INFILE 'file_name'
>     [REPLACE | IGNORE]
>     INTO TABLE tbl_name
>     [PARTITION (partition_name [, partition_name] ...)]
>     [CHARACTER SET charset_name]
>     [{FIELDS | COLUMNS}
>         [TERMINATED BY 'string']
>         [[OPTIONALLY] ENCLOSED BY 'char']
>         [ESCAPED BY 'char']
>     ]
>     [LINES
>         [STARTING BY 'string']
>         [TERMINATED BY 'string']
>     ]
>     [IGNORE number {LINES | ROWS}]
>     [(col_name_or_user_var
>         [, col_name_or_user_var] ...)]
>     [SET col_name={expr | DEFAULT},
>         [, col_name={expr | DEFAULT}] ...]
> {noformat}
> ([https://dev.mysql.com/doc/refman/5.7/en/load-data.html])
> h3. Notes
>  * Both client- and server-side import
>  * Protocol implementation via a hack: if result set returned with column 
> count == -1, read file name from server and send it immediately.
> h2. Microsoft SQL Server
> {noformat}
> BULK INSERT
>    [ database_name . [ schema_name ] . | schema_name . ] [ table_name | 
> view_name ]
>       FROM 'data_file'
>      [ WITH
>     (
>    [ [ , ] BATCHSIZE = batch_size ]
>    [ [ , ] CHECK_CONSTRAINTS ]
>    [ [ , ] CODEPAGE = { 'ACP' | 'OEM' | 'RAW' | 'code_page' } ]
>    [ [ , ] DATAFILETYPE =
>       { 'char' | 'native'| 'widechar' | 'widenative' } ]
>    [ [ , ] DATASOURCE = 'data_source_name' ]
>    [ [ , ] ERRORFILE = 'file_name' ]
>    [ [ , ] ERRORFILE_DATASOURCE = 'data_source_name' ]
>    [ [ , ] FIRSTROW = first_row ]
>    [ [ , ] FIRE_TRIGGERS ]
>    [ [ , ] FORMATFILE_DATASOURCE = 'data_source_name' ]
>    [ [ , ] KEEPIDENTITY ]
>    [ [ , ] KEEPNULLS ]
>    [ [ , ] KILOBYTES_PER_BATCH = kilobytes_per_batch ]
>    [ [ , ] LASTROW = last_row ]
>    [ [ , ] MAXERRORS = max_errors ]
>    [ [ , ] ORDER ( { column [ ASC | DESC ] } [ ,...n ] ) ]
>    [ [ , ] ROWS_PER_BATCH = rows_per_batch ]
>    [ [ , ] ROWTERMINATOR = 'row_terminator' ]
>    [ [ , ] TABLOCK ]
>    -- input file format options
>    [ [ , ] FORMAT = 'CSV' ]
>    [ [ , ] FIELDQUOTE = 'quote_characters']
>    [ [ , ] FORMATFILE = 'format_file_path' ]
>    [ [ , ] FIELDTERMINATOR = 'field_terminator' ]
>    [ [ , ] ROWTERMINATOR = 'row_terminator' ]
>     )]
> {noformat}
> ([https://docs.microsoft.com/en-us/sql/t-sql/statements/bulk-insert-transact-sql])
> h3. Notes
>  * Server-side import
>  * CLI utility to import from client side
>  * Protocol implementation: Special packet types: column definition and row
>  * Custom bulk data supplied in JDBC driver package: 
> com.microsoft.sqlserver.jdbc.SqlServerBulkCopy.
> h2. Oracle
> There is no bulk load SQL command. Bulk loading external data can be achieved 
> via:
>  *  Oracle External Tables
> ([https://docs.oracle.com/en/database/oracle/oracle-database/12.2/sutil/oracle-external-tables-concepts.html])
>  * SQL*Loader
> ([https://docs.oracle.com/database/121/SUTIL/GUID-8D037494-07FA-4226-B507-E1B2ED10C144.htm#SUTIL3311]).
> * There is a separate utility for Oracle TimesTen in-memory database:
>   [https://docs.oracle.com/database/121/TTREF/util.htm#TTREF324]
> h2. Vertica DB
> {noformat}
> COPY [ [db-name.]schema-name.]target-table 
> ... [ ( { column-as-expression | column }
> ...... [ DELIMITER [ AS ] 'char' ] 
> ...... [ ENCLOSED [ BY ] 'char' ]
> ...... [ ENFORCELENGTH ]
> ...... [ ESCAPE [ AS ] 'char' | NO ESCAPE ]
> ...... [ FILLER datatype]
> ...... [ FORMAT 'format' ] 
> ...... [ NULL [ AS ] 'string' ]
> ...... [ TRIM 'byte' ]
> ... [, ... ] ) ]
> ... [ COLUMN OPTION ( column 
> ...... [ DELIMITER [ AS ] 'char' ]
> ...... [ ENCLOSED [ BY ] 'char' ]
> ...... [ ENFORCELENGTH ]
> ...... [ ESCAPE [ AS ] 'char' | NO ESCAPE ]
> ...... [ FORMAT 'format' ]
> ...... [ NULL [ AS ] 'string' ]
> ...... [ TRIM 'byte' ]
> ... [, ... ] ) ]
> [ FROM { 
> ...STDIN 
> ...... [ BZIP | GZIP | LZO | UNCOMPRESSED ] 
> ...| 'pathToData' [ ON nodename | ON ANY NODE ] 
> ...... [ BZIP | GZIP | LZO | UNCOMPRESSED ] [, ...] 
> ...| LOCAL {STDIN | 'pathToData'} 
> ...... [ BZIP | GZIP | LZO | UNCOMPRESSED ] [, ...]
> ...| VERTICA source_database.[source_schema.]source_table[(source_column 
> [,...]) ]
> } ]
> ...[ NATIVE 
> .....| NATIVE VARCHAR 
> .....| FIXEDWIDTH COLSIZES (integer [,...]) 
> .....| ORC 
> .....| PARQUET 
> ...]
> ...[ WITH ]
> ......[ SOURCE source([arg=value [,...] ]) ]
> ......[ FILTER filter([arg=value [,...] ]) ] 
> ......[ PARSER parser([arg=value [,...] ]) ]
> ...[ DELIMITER [ AS ] 'char' ]
> ...[ TRAILING NULLCOLS ]
> ...[ NULL [ AS ] 'string' ]
> ...[ ESCAPE [ AS ] 'char' | NO ESCAPE ]
> ...[ ENCLOSED [ BY ] 'char' ]
> ...[ RECORD TERMINATOR 'string' ]
> ...[ SKIP records ]
> ...[ SKIP BYTES integer ]
> ...[ TRIM 'byte' ]
> ...[ REJECTMAX integer ]
> ...[ REJECTED DATA {'path' [ ON nodename ]  [, ...] | AS TABLE reject-table} ]
> ...[ EXCEPTIONS 'path' [ ON nodename ] [, ...] ]
> ...[ ENFORCELENGTH ]
> ...[ ERROR TOLERANCE ]
> ...[ ABORT ON ERROR ]
> ...[ [ STORAGE ] load-method ]
> ...[ STREAM NAME  'streamName']
> ...[ NO COMMIT ]
> {noformat}
> ([https://my.vertica.com/docs/7.2.x/HTML/index.htm#Authoring/SQLReferenceManual/Statements/COPY/COPY.htm])
> h2. Various solutions from vendors not mentioned above
> * Apache Hive:
>   
> [https://cwiki.apache.org/confluence/display/Hive/LanguageManual+DML#LanguageManualDML-Loadingfilesintotables]
> * Apache HBase:
>   [http://dwgeek.com/apache-hbase-bulk-load-csv-examples.html]
> * SAP IQ:
>   
> [https://help.sap.com/viewer/a898a5b484f21015a377cd1ccb6ee9b5/16.0.11.12/en-US/a6209de484f21015bcb2d858c21ab35e.html]
> * VoltDB:
>   [https://docs.voltdb.com/UsingVoltDB/clicsvloader.php]
> * MemSQL:
>   [https://docs.memsql.com/sql-reference/v6.0/load-data/]
>   (creating pipelines and connecting them to LOAD DATA statement is also a 
> notable feature)
> * Sybase:
>   
> [http://infocenter.sybase.com/help/index.jsp?topic=/com.sybase.help.sqlanywhere.12.0.1/dbreference/input-statement.html]
> * IBM DB2:
>   
> [https://www.ibm.com/support/knowledgecenter/en/SSEPGG_9.5.0/com.ibm.db2.luw.admin.cmd.doc/doc/r0008305.html]
> * IBM Informix:
>   
> [https://www.ibm.com/support/knowledgecenter/en/SSGU8G_12.1.0/com.ibm.sqls.doc/ids_sqs_0884.htm]
> * Apache Derby (AKA Java DB, Apache DB):
>   [https://db.apache.org/derby/docs/10.7/tools/ttoolsimporting.html]
> * Google Cloud Spanner: 
>   [https://cloud.google.com/spanner/docs/bulk-loading]



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

[jira] [Commented] (IGNITE-6917) SQL: implement COPY command for efficient data loading

Reply via email to