[ 
https://issues.apache.org/jira/browse/CASSANDRA-17831?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17583673#comment-17583673
 ] 

Brad Schoening commented on CASSANDRA-17831:
--------------------------------------------

[~aratnofsky] this Jira is aiming to address a problem described in the 
DataStax documentation on CQLSH 'COPY TO': 

     "{_}Note: Only use COPY FROM to import datasets that have less than 2 
million rows{_}."

Which leaves users in a quandary, use COPY FROM anyway on larger data sets or 
try to figure out how to install and configure a special purpose, one-off tool. 
 This feature would support users, not admins.  Users don't have access to 
SSTables and have authorized and authenticated access with roles and 
permissions over port 9042.  sstabledump doesn't really work in a DBaaS 
environment for users and, if you were to export sstables, with RF=3 you would 
have 3X the data volume.

Adding a big data friendly compact binary format for export in CQLSH would be 
both space and time efficient.  It would also be portable to other platforms, 
not a proprietary format.  Either Avro or Parquet would be a good choice, but 
it could be something else. 

This article [CSV Files for Storage? No Thanks. There’s a Better Option | by 
Dario Radečić | Towards Data 
Science|https://towardsdatascience.com/csv-files-for-storage-no-thanks-theres-a-better-option-72c78a414d1d]
 found 1TB of data in CSV files shrank to 130GB in parquet – and was 33X faster 
to read the data.

Vertica, for example, has an export to parquet in its SQL syntax:  

{{     EXPORT TO PARQUET ( directory = 'path') AS SELECT query‑expression;}}

          See: [EXPORT TO PARQUET 
(vertica.com)|https://www.vertica.com/docs/9.2.x/HTML/Content/Authoring/SQLReferenceManual/Statements/EXPORTTOPARQUET.htm]

Since CQSH has COPY FROM/TO with one format and Python offers great data 
wrangling libraries, implementation is mostly straightforward.  Yes, there are 
third party tools which can be setup, configured, and will be faster on very 
large data sets >TB, but it would be great to have a first-class, easy to use, 
efficient option to export data in moderate size <TB in CQLSH.  

To recap:
 * For small data, it's easy to use COPY FROM with CSV format.
 * Very large data requires a fast, distributed approach such as Spark or 
DSBulk and justifies the effort in installing and setting up those tools.
 * Moderate to large data sets don't have a good option for export/import, and 
it would be useful to have one that doesn't require installing another tool, 
but just works out of the box, even if it's a little slower. 

> Add support in CQLSH for COPY FROM / TO in compact Parquet format
> -----------------------------------------------------------------
>
>                 Key: CASSANDRA-17831
>                 URL: https://issues.apache.org/jira/browse/CASSANDRA-17831
>             Project: Cassandra
>          Issue Type: Improvement
>          Components: Tool/cqlsh
>            Reporter: Brad Schoening
>            Assignee: Brad Schoening
>            Priority: Normal
>
> CQL supports only CSV as a format for import and export. A binary big data 
> format such as Avro and/or Parquet would be more compact and highly portable 
> to other platforms.
> Parquet does not require a schema, so it appears the easier format to support.
> The existing syntax supports adding key value pair options, such as FORMAT = 
> PARQUET
> {{     COPY table_name ... FROM 'file_name'[, 'file2_name', ...] }}
>                      {{[WITH option = 'value' [AND ...]]}}
> Side by side comparisons of CSV and Parquet show a 80% plus saving in disk 
> space.
> [https://towardsdatascience.com/csv-files-for-storage-no-thanks-theres-a-better-option-72c78a414d1d]



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to