[jira] [Commented] (CASSANDRA-17831) Add support in CQLSH for COPY FROM / TO in compact Parquet format

Brad Schoening (Jira) Mon, 22 Aug 2022 11:42:05 -0700


    [ 
https://issues.apache.org/jira/browse/CASSANDRA-17831?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17583144#comment-17583144
 ]


Brad Schoening commented on CASSANDRA-17831:
--------------------------------------------

[~dcapwell]  while CQL does have a schema, of course, CSV export/import doesn't 
use one and parquet doesn't require one to my understanding.  It's feasible, 
but more complex to write a format which requires building a schema translation 
and upon import, validates the same.

Apache Arrow will export python table / dataframe to parquet:

 
{code:java}
        import pyarrow as pa
        import pyarrow.parquet as pq
        table = pa.table({'n_legs': [2, 2, 4, 4, 5, 100], 'animal': 
["Flamingo", "Parrot", "Dog", "Horse", "Brittle stars", "Centipede"]})
        pq.write_table(table, 'example.parquet')
{code}
 

My understanding is that parquet allows but does not require a schema, and the 
schema-less example above runs.

[~aratnofsky] in a SaaS cloud environment you don't necessarily have direct 
access to SSTables on disk, but you can read the data via CQLSH which already 
has the framework and syntax for export and importing data.  Reading SSTables 
is fine for an admin, but it bypasses all database roles and permissions.

Primarily, I'm suggesting this as a compact export format, since it can be 
auto-compressed as well. Although, the fact that parquet is highly portable is 
a bonus as well.  And ideally, export to various formats would be part of an 
existing tool (CQLSH), and not another single purpose tool for Cassandra.  If 
you want high performance you may want to use Spark.  If you want ease of use, 
you'd like to do it in CQLSH, just like CSV export.

> Add support in CQLSH for COPY FROM / TO in compact Parquet format
> -----------------------------------------------------------------
>
>                 Key: CASSANDRA-17831
>                 URL: https://issues.apache.org/jira/browse/CASSANDRA-17831
>             Project: Cassandra
>          Issue Type: Improvement
>          Components: Tool/cqlsh
>            Reporter: Brad Schoening
>            Assignee: Brad Schoening
>            Priority: Normal
>
> CQL supports only CSV as a format for import and export. A binary big data 
> format such as Avro and/or Parquet would be more compact and highly portable 
> to other platforms.
> Parquet does not require a schema, so it appears the easier format to support.
> The existing syntax supports adding key value pair options, such as FORMAT = 
> PARQUET
> {{     COPY table_name ... FROM 'file_name'[, 'file2_name', ...] }}
>                      {{[WITH option = 'value' [AND ...]]}}
> Side by side comparisons of CSV and Parquet show a 80% plus saving in disk 
> space.
> [https://towardsdatascience.com/csv-files-for-storage-no-thanks-theres-a-better-option-72c78a414d1d]



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[jira] [Commented] (CASSANDRA-17831) Add support in CQLSH for COPY FROM / TO in compact Parquet format

Reply via email to