[jira] [Commented] (CASSANDRA-17831) Add support in CQLSH for COPY FROM / TO in compact Parquet format

David Capwell (Jira) Tue, 23 Aug 2022 10:47:05 -0700


    [ 
https://issues.apache.org/jira/browse/CASSANDRA-17831?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17583776#comment-17583776
 ]


David Capwell commented on CASSANDRA-17831:
-------------------------------------------

bq.      "Note: Only use COPY FROM to import datasets that have less than 2 
million rows."
bq. Which leaves users in a quandary, use COPY FROM anyway on larger data sets 
or try to figure out how to install and configure a special purpose, one-off 
tool.

The file format is not the problem, its that large queries are a problem for 
Cassandra. If your use case is actually "larger datasets" than CQLSH and the 
CQL language may not be best for you, and I would recommend the Spark Bulk 
Reader or the MapReduce InputFormat.

bq. Moderate to large data sets don't have a good option for export/import, and 
it would be useful to have one that doesn't require installing another tool, 
but just works out of the box, even if it's a little slower. 

It may be "moderate" to you, but for Cassandra it may actually be "large".  The 
file written to isn't really an issue, its now costly the query gets...  Let's 
take the example you gave of a 1TB CSV file being 130GB in Parquet... now, if 
you used Postgres to do the same thing, would the query fail?  Most likely... 
there are other tools to work with such a large dataset that don't crash the 
server... in this case Cassandra offers MapReduce and Spark support to better 
work with the database, and those tools can export to Parquet today.

> Add support in CQLSH for COPY FROM / TO in compact Parquet format
> -----------------------------------------------------------------
>
>                 Key: CASSANDRA-17831
>                 URL: https://issues.apache.org/jira/browse/CASSANDRA-17831
>             Project: Cassandra
>          Issue Type: Improvement
>          Components: Tool/cqlsh
>            Reporter: Brad Schoening
>            Priority: Normal
>
> CQL supports only CSV as a format for import and export. A binary big data 
> format such as Avro and/or Parquet would be more compact and highly portable 
> to other platforms.
> Parquet does not require a schema, so it appears the easier format to support.
> The existing syntax supports adding key value pair options, such as FORMAT = 
> PARQUET
> {{     COPY table_name ... FROM 'file_name'[, 'file2_name', ...] }}
>                      {{[WITH option = 'value' [AND ...]]}}
> Side by side comparisons of CSV and Parquet show a 80% plus saving in disk 
> space.
> [https://towardsdatascience.com/csv-files-for-storage-no-thanks-theres-a-better-option-72c78a414d1d]



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org
For additional commands, e-mail: commits-h...@cassandra.apache.org

[jira] [Commented] (CASSANDRA-17831) Add support in CQLSH for COPY FROM / TO in compact Parquet format

Reply via email to