[jira] [Commented] (CASSANDRA-17831) Add support in CQLSH for COPY FROM / TO in compact Parquet format

Abe Ratnofsky (Jira) Tue, 23 Aug 2022 09:32:06 -0700


    [ 
https://issues.apache.org/jira/browse/CASSANDRA-17831?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17583739#comment-17583739
 ]


Abe Ratnofsky commented on CASSANDRA-17831:
-------------------------------------------

[~bschoeni] thanks for elaborating here.

> Which leaves users in a quandary, use COPY FROM anyway on larger data sets or 
> try to figure out how to install and configure a special purpose, one-off 
> tool.

Many users who depend on this kind of functionality in production use something 
like Spark, via the Spark Connector for example: 
[https://github.com/datastax/spark-cassandra-connector]

Users with large data sets often prefer integrations with tools like Spark, so 
this is the happy path out of that quandary. You can always use Spark to 
generate Parquet files, and the Spark Connector supports connection limiting, 
paging, and throughput limiting to avoid adverse impacts on your cluster.

I understand that it feels like there's a missing middle between COPY TO for 
small data, and Spark for large data. One reason for this is that often 
"medium-sized" datasets grow to large datasets soon enough, so the time for 
medium-tools is short.

Maybe this request would make more sense for DataStax's dsbulk to support 
output in Parquet, in addition to CSV and JSON? 
https://docs.datastax.com/en/dsbulk/docs/dsbulkSimpleUnload.html

 

> Add support in CQLSH for COPY FROM / TO in compact Parquet format
> -----------------------------------------------------------------
>
>                 Key: CASSANDRA-17831
>                 URL: https://issues.apache.org/jira/browse/CASSANDRA-17831
>             Project: Cassandra
>          Issue Type: Improvement
>          Components: Tool/cqlsh
>            Reporter: Brad Schoening
>            Assignee: Brad Schoening
>            Priority: Normal
>
> CQL supports only CSV as a format for import and export. A binary big data 
> format such as Avro and/or Parquet would be more compact and highly portable 
> to other platforms.
> Parquet does not require a schema, so it appears the easier format to support.
> The existing syntax supports adding key value pair options, such as FORMAT = 
> PARQUET
> {{     COPY table_name ... FROM 'file_name'[, 'file2_name', ...] }}
>                      {{[WITH option = 'value' [AND ...]]}}
> Side by side comparisons of CSV and Parquet show a 80% plus saving in disk 
> space.
> [https://towardsdatascience.com/csv-files-for-storage-no-thanks-theres-a-better-option-72c78a414d1d]



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[jira] [Commented] (CASSANDRA-17831) Add support in CQLSH for COPY FROM / TO in compact Parquet format

Reply via email to