Github user QiangCai commented on a diff in the pull request:
https://github.com/apache/carbondata/pull/2971#discussion_r238507037
--- Diff:
integration/spark-common/src/main/scala/org/apache/carbondata/spark/load/DataLoadProcessBuilderOnSpark.scala
---
@@ -156,4 +158,132 @@ object DataLoadProcessBuilderOnSpark {
Array((uniqueLoadStatusId, (loadMetadataDetails, executionErrors)))
}
}
+
+ /**
+ * 1. range partition the whole input data
+ * 2. for each range, sort the data and writ it to CarbonData files
+ */
+ def loadDataUsingRangeSort(
+ sparkSession: SparkSession,
+ dataFrame: Option[DataFrame],
+ model: CarbonLoadModel,
+ hadoopConf: Configuration): Array[(String, (LoadMetadataDetails,
ExecutionErrors))] = {
+ val originRDD = if (dataFrame.isDefined) {
--- End diff --
better, but after refactoring, the code logic is not clear. Now, these two
flows already reuse the process steps.
---