[
https://issues.apache.org/jira/browse/CASSANDRA-17831?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17583156#comment-17583156
]
Abe Ratnofsky commented on CASSANDRA-17831:
-------------------------------------------
I'd say that Parquet still has a schema, and there would still need to be a
mapping between Cassandra-supported types and Parquet logical types:
[https://github.com/apache/parquet-format/blob/master/LogicalTypes.md]
What cloud environment are you running in? The safest and most efficient way to
handle this is by shipping SSTable files off of Cassandra instances a la
sendfile and reformatting them to Parquet on a separate instance, since
Cassandra has no use for Parquet internally (for now) and reformatting may be
seriously resource-demanding. There's already sstabledump, so I could see an
opportunity to add permission checking and integration with existing tools so
you don't need disk access. If you could export SSTables via cqlsh and there
was a separate tool for reformatting to Parquet, would that meet your needs?
I could see it making sense for the project to provide the tool that reformats
from SSTables to Parquet, since the Cassandra-to-Parquet logical type encoding
would benefit from a standardized approach, and a first-party tool could better
adapt to SSTable format changes, etc.
> Add support in CQLSH for COPY FROM / TO in compact Parquet format
> -----------------------------------------------------------------
>
> Key: CASSANDRA-17831
> URL: https://issues.apache.org/jira/browse/CASSANDRA-17831
> Project: Cassandra
> Issue Type: Improvement
> Components: Tool/cqlsh
> Reporter: Brad Schoening
> Assignee: Brad Schoening
> Priority: Normal
>
> CQL supports only CSV as a format for import and export. A binary big data
> format such as Avro and/or Parquet would be more compact and highly portable
> to other platforms.
> Parquet does not require a schema, so it appears the easier format to support.
> The existing syntax supports adding key value pair options, such as FORMAT =
> PARQUET
> {{ COPY table_name ... FROM 'file_name'[, 'file2_name', ...] }}
> {{[WITH option = 'value' [AND ...]]}}
> Side by side comparisons of CSV and Parquet show a 80% plus saving in disk
> space.
> [https://towardsdatascience.com/csv-files-for-storage-no-thanks-theres-a-better-option-72c78a414d1d]
--
This message was sent by Atlassian Jira
(v8.20.10#820010)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]