[GitHub] [incubator-doris] wyb commented on a change in pull request #5233: (#5224)some little fix for spark load

GitBox Thu, 14 Jan 2021 21:39:01 -0800


wyb commented on a change in pull request #5233:
URL: https://github.com/apache/incubator-doris/pull/5233#discussion_r557872981




##########
File path: 
fe/spark-dpp/src/main/java/org/apache/doris/load/loadv2/dpp/SparkDpp.java
##########
@@ -857,18 +864,39 @@ private Object convertPartitionKey(Object srcValue, Class 
dstClass) throws Spark
         }
 
         Dataset<Row> dataframe = spark.sql(sql.toString());
+        // Note(wb): in current spark load implementation, spark load can't be 
consistent with doris BE; The reason is as follows
+        // For stream load in doris BE, it runs as follow steps:
+        // step 1: type check
+        // step 2: expression calculation
+        // step 3: strict mode check
+        // step 4: nullable column check
+        // BE can do the four steps row by row
+        // but spark load relies on spark to do step2, so it can only do step 
1 for whole dataset and then do step 2 for whole dataset and so on;
+        // So in spark load, we first do step 1,3,4,and then do step 2.
         dataframe = checkDataFromHiveWithStrictMode(dataframe, baseIndex, 
fileGroup.columnMappings.keySet(), etlJobConfig.properties.strictMode,
-                    dstTableSchema);
+                dstTableSchema, dictBitmapColumnSet);
         dataframe = convertSrcDataframeToDstDataframe(baseIndex, dataframe, 
dstTableSchema, fileGroup);
         return dataframe;
     }
 
     private Dataset<Row> checkDataFromHiveWithStrictMode(
-            Dataset<Row> dataframe, EtlJobConfig.EtlIndex baseIndex, 
Set<String> mappingColKeys, boolean isStrictMode, StructType dstTableSchema) 
throws SparkDppException {
+            Dataset<Row> dataframe, EtlJobConfig.EtlIndex baseIndex, 
Set<String> mappingColKeys, boolean isStrictMode, StructType dstTableSchema,
+            Set<String> dictBitmapColumnSet) throws SparkDppException {
         List<EtlJobConfig.EtlColumn> columnNameNeedCheckArrayList = new 
ArrayList<>();
         List<ColumnParser> columnParserArrayList = new ArrayList<>();
         for (EtlJobConfig.EtlColumn column : baseIndex.columns) {
-            if (!StringUtils.equalsIgnoreCase(column.columnType, "varchar") &&
+            // note(wb): there are three data source for bitmap column
+            // case 1: global dict; need't check
+            // case 2: bitmap hash function; this func is not supported in 
spark load now, so ignore it here
+            // case 3: origin value is a integer value; it should be checked 
use LongParser
+            if (StringUtils.equalsIgnoreCase(column.columnType, "bitmap")) {
+                if (dictBitmapColumnSet.contains(column.columnName)) {

Review comment:
       column name tolowercase




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
[email protected]



---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[GitHub] [incubator-doris] wyb commented on a change in pull request #5233: (#5224)some little fix for spark load

Reply via email to