[jira] [Commented] (CASSANDRA-9048) Delimited File Bulk Loader

Aleksey Yeschenko (JIRA) Thu, 26 Mar 2015 13:38:50 -0700

    [ 
https://issues.apache.org/jira/browse/CASSANDRA-9048?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14382617#comment-14382617
 ]


Aleksey Yeschenko commented on CASSANDRA-9048:
----------------------------------------------

We already have plans for a Spark-based, multiple-format data import/export 
tool. CSV files will be the first supported format, with other Cassandra tables 
supported too (see CASSANDRA-8234).

That tool, once done, will go in the tree, and supersede CQLSH's COPY, among 
other things.

> Delimited File Bulk Loader
> --------------------------
>
>                 Key: CASSANDRA-9048
>                 URL: https://issues.apache.org/jira/browse/CASSANDRA-9048
>             Project: Cassandra
>          Issue Type: Improvement
>          Components: Tools
>            Reporter:  Brian Hess
>             Fix For: 3.0
>
>         Attachments: CASSANDRA-9048.patch
>
>
> There is a strong need for bulk loading data from delimited files into 
> Cassandra.  Starting with delimited files means that the data is not 
> currently in the SSTable format, and therefore cannot immediately leverage 
> Cassandra's bulk loading tool, sstableloader, directly.
> A tool supporting delimited files much closer matches the format of the data 
> more often than the SSTable format itself, and a tool that loads from 
> delimited files is very useful.
> In order for this bulk loader to be more generally useful to customers, it 
> should handle a number of options at a minimum:
> - support specifying the input file or to read the data from stdin (so other 
> command-line programs can pipe into the loader)
> - supply the CQL schema for the input data
> - support all data types other than collections (collections is a stretch 
> goal/need)
> - an option to specify the delimiter
> - an option to specify comma as the decimal delimiter (for international use 
> casese)
> - an option to specify how NULL values are specified in the file (e.g., the 
> empty string or the string NULL)
> - an option to specify how BOOLEAN values are specified in the file (e.g., 
> TRUE/FALSE or 0/1)
> - an option to specify the Date and Time format
> - an option to skip some number of rows at the beginning of the file
> - an option to only read in some number of rows from the file
> - an option to indicate how many parse errors to tolerate
> - an option to specify a file that will contain all the lines that did not 
> parse correctly (up to the maximum number of parse errors)
> - an option to specify the CQL port to connect to (with 9042 as the default).
> Additional options would be useful, but this set of options/features is a 
> start.
> A word on COPY.  COPY comes via CQLSH which requires the client to be the 
> same version as the server (e.g., 2.0 CQLSH does not work with 2.1 Cassandra, 
> etc).  This tool should be able to connect to any version of Cassandra 
> (within reason).  For example, it should be able to handle 2.0.x and 2.1.x.  
> Moreover, CQLSH's COPY command does not support a number of the options 
> above.  Lastly, the performance of COPY in 2.0.x is not high enough to be 
> considered a bulk ingest tool.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (CASSANDRA-9048) Delimited File Bulk Loader

Reply via email to