[
https://issues.apache.org/jira/browse/CASSANDRA-17831?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17583144#comment-17583144
]
Brad Schoening commented on CASSANDRA-17831:
--------------------------------------------
[~dcapwell] while CQL does have a schema, of course, CSV export/import doesn't
use one and parquet doesn't require one to my understanding. It's feasible,
but more complex to write a format which requires building a schema translation
and upon import, validates the same.
Apache Arrow will export python table / dataframe to parquet:
{code:java}
import pyarrow as pa
import pyarrow.parquet as pq
table = pa.table({'n_legs': [2, 2, 4, 4, 5, 100], 'animal':
["Flamingo", "Parrot", "Dog", "Horse", "Brittle stars", "Centipede"]})
pq.write_table(table, 'example.parquet')
{code}
My understanding is that parquet allows but does not require a schema, and the
schema-less example above runs.
[~aratnofsky] in a SaaS cloud environment you don't necessarily have direct
access to SSTables on disk, but you can read the data via CQLSH which already
has the framework and syntax for export and importing data. Reading SSTables
is fine for an admin, but it bypasses all database roles and permissions.
Primarily, I'm suggesting this as a compact export format, since it can be
auto-compressed as well. Although, the fact that parquet is highly portable is
a bonus as well. And ideally, export to various formats would be part of an
existing tool (CQLSH), and not another single purpose tool for Cassandra. If
you want high performance you may want to use Spark. If you want ease of use,
you'd like to do it in CQLSH, just like CSV export.
> Add support in CQLSH for COPY FROM / TO in compact Parquet format
> -----------------------------------------------------------------
>
> Key: CASSANDRA-17831
> URL: https://issues.apache.org/jira/browse/CASSANDRA-17831
> Project: Cassandra
> Issue Type: Improvement
> Components: Tool/cqlsh
> Reporter: Brad Schoening
> Assignee: Brad Schoening
> Priority: Normal
>
> CQL supports only CSV as a format for import and export. A binary big data
> format such as Avro and/or Parquet would be more compact and highly portable
> to other platforms.
> Parquet does not require a schema, so it appears the easier format to support.
> The existing syntax supports adding key value pair options, such as FORMAT =
> PARQUET
> {{ COPY table_name ... FROM 'file_name'[, 'file2_name', ...] }}
> {{[WITH option = 'value' [AND ...]]}}
> Side by side comparisons of CSV and Parquet show a 80% plus saving in disk
> space.
> [https://towardsdatascience.com/csv-files-for-storage-no-thanks-theres-a-better-option-72c78a414d1d]
--
This message was sent by Atlassian Jira
(v8.20.10#820010)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]