[ https://issues.apache.org/jira/browse/CASSANDRA-17831?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17583776#comment-17583776 ]
David Capwell commented on CASSANDRA-17831: ------------------------------------------- bq. "Note: Only use COPY FROM to import datasets that have less than 2 million rows." bq. Which leaves users in a quandary, use COPY FROM anyway on larger data sets or try to figure out how to install and configure a special purpose, one-off tool. The file format is not the problem, its that large queries are a problem for Cassandra. If your use case is actually "larger datasets" than CQLSH and the CQL language may not be best for you, and I would recommend the Spark Bulk Reader or the MapReduce InputFormat. bq. Moderate to large data sets don't have a good option for export/import, and it would be useful to have one that doesn't require installing another tool, but just works out of the box, even if it's a little slower. It may be "moderate" to you, but for Cassandra it may actually be "large". The file written to isn't really an issue, its now costly the query gets... Let's take the example you gave of a 1TB CSV file being 130GB in Parquet... now, if you used Postgres to do the same thing, would the query fail? Most likely... there are other tools to work with such a large dataset that don't crash the server... in this case Cassandra offers MapReduce and Spark support to better work with the database, and those tools can export to Parquet today. > Add support in CQLSH for COPY FROM / TO in compact Parquet format > ----------------------------------------------------------------- > > Key: CASSANDRA-17831 > URL: https://issues.apache.org/jira/browse/CASSANDRA-17831 > Project: Cassandra > Issue Type: Improvement > Components: Tool/cqlsh > Reporter: Brad Schoening > Priority: Normal > > CQL supports only CSV as a format for import and export. A binary big data > format such as Avro and/or Parquet would be more compact and highly portable > to other platforms. > Parquet does not require a schema, so it appears the easier format to support. > The existing syntax supports adding key value pair options, such as FORMAT = > PARQUET > {{ COPY table_name ... FROM 'file_name'[, 'file2_name', ...] }} > {{[WITH option = 'value' [AND ...]]}} > Side by side comparisons of CSV and Parquet show a 80% plus saving in disk > space. > [https://towardsdatascience.com/csv-files-for-storage-no-thanks-theres-a-better-option-72c78a414d1d] -- This message was sent by Atlassian Jira (v8.20.10#820010) --------------------------------------------------------------------- To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org For additional commands, e-mail: commits-h...@cassandra.apache.org