[
https://issues.apache.org/jira/browse/CASSANDRA-17831?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17583673#comment-17583673
]
Brad Schoening commented on CASSANDRA-17831:
--------------------------------------------
[~aratnofsky] this Jira is aiming to address a problem described in the
DataStax documentation on CQLSH 'COPY TO':
"{_}Note: Only use COPY FROM to import datasets that have less than 2
million rows{_}."
Which leaves users in a quandary, use COPY FROM anyway on larger data sets or
try to figure out how to install and configure a special purpose, one-off tool.
This feature would support users, not admins. Users don't have access to
SSTables and have authorized and authenticated access with roles and
permissions over port 9042. sstabledump doesn't really work in a DBaaS
environment for users and, if you were to export sstables, with RF=3 you would
have 3X the data volume.
Adding a big data friendly compact binary format for export in CQLSH would be
both space and time efficient. It would also be portable to other platforms,
not a proprietary format. Either Avro or Parquet would be a good choice, but
it could be something else.
This article [CSV Files for Storage? No Thanks. There’s a Better Option | by
Dario Radečić | Towards Data
Science|https://towardsdatascience.com/csv-files-for-storage-no-thanks-theres-a-better-option-72c78a414d1d]
found 1TB of data in CSV files shrank to 130GB in parquet – and was 33X faster
to read the data.
Vertica, for example, has an export to parquet in its SQL syntax:
{{ EXPORT TO PARQUET ( directory = 'path') AS SELECT query‑expression;}}
See: [EXPORT TO PARQUET
(vertica.com)|https://www.vertica.com/docs/9.2.x/HTML/Content/Authoring/SQLReferenceManual/Statements/EXPORTTOPARQUET.htm]
Since CQSH has COPY FROM/TO with one format and Python offers great data
wrangling libraries, implementation is mostly straightforward. Yes, there are
third party tools which can be setup, configured, and will be faster on very
large data sets >TB, but it would be great to have a first-class, easy to use,
efficient option to export data in moderate size <TB in CQLSH.
To recap:
* For small data, it's easy to use COPY FROM with CSV format.
* Very large data requires a fast, distributed approach such as Spark or
DSBulk and justifies the effort in installing and setting up those tools.
* Moderate to large data sets don't have a good option for export/import, and
it would be useful to have one that doesn't require installing another tool,
but just works out of the box, even if it's a little slower.
> Add support in CQLSH for COPY FROM / TO in compact Parquet format
> -----------------------------------------------------------------
>
> Key: CASSANDRA-17831
> URL: https://issues.apache.org/jira/browse/CASSANDRA-17831
> Project: Cassandra
> Issue Type: Improvement
> Components: Tool/cqlsh
> Reporter: Brad Schoening
> Assignee: Brad Schoening
> Priority: Normal
>
> CQL supports only CSV as a format for import and export. A binary big data
> format such as Avro and/or Parquet would be more compact and highly portable
> to other platforms.
> Parquet does not require a schema, so it appears the easier format to support.
> The existing syntax supports adding key value pair options, such as FORMAT =
> PARQUET
> {{ COPY table_name ... FROM 'file_name'[, 'file2_name', ...] }}
> {{[WITH option = 'value' [AND ...]]}}
> Side by side comparisons of CSV and Parquet show a 80% plus saving in disk
> space.
> [https://towardsdatascience.com/csv-files-for-storage-no-thanks-theres-a-better-option-72c78a414d1d]
--
This message was sent by Atlassian Jira
(v8.20.10#820010)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]