YAN Bo created CASSANDRA-15354:
----------------------------------
Summary: Cassandra CQLSSTableWriter and sstableloader support HDFS
Key: CASSANDRA-15354
URL: https://issues.apache.org/jira/browse/CASSANDRA-15354
Project: Cassandra
Issue Type: New Feature
Components: Legacy/Local Write-Read Paths, Local/SSTable, Tool/sstable
Reporter: YAN Bo
{code:java}
//代码占位符
rdd.foreachPartition( msgIterator => {
val writer = CQLSSTableWriter.builder()
.inDirectory(outputDir)
// set target schema
.forTable(SCHEMA)
// set CQL statement to put data
.using(INSERT_STMT)
// set partitioner if needed
// default is Murmur3Partitioner so set if you use different one.
.withPartitioner( new Murmur3Partitioner()).build()
msgIterator.foreach(msg => {
val items = msg.toString().split(",")
val javaList = new util.ArrayList[Object]();
items.foreach(t=> javaList.add(t))
writer.addRow(javaList)
})
writer.close()
})
{code}
Cassandra has provided bulkdata's export/import via SSTable, which is very
fancy for users. In some case we have TB-level data from HDFS to Cassandra, and
we can use spark to generate SSTable files by distributed computation with
codes like above. Unfortunately CQLSSTableWriter can only write data to local
path, and sstableloader can only load from local path. So if we use
CQLSSTableWriter in Spark or Hadoop MR program, we need to write other codes
put local sstables distributed in distributed nodes to HDFS, then download all
sstables from HDFS to the machine with sstableloader, bigdata stored and
transferred between pysical machines will bring many reliability problems.
So we'd better let CQLSSTableWriter can write data to HDFS directly or have
other writer which supports HDFS, and let sstableloader can load from HDFS path.
--
This message was sent by Atlassian Jira
(v8.3.4#803005)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]