codope commented on code in PR #8629:
URL: https://github.com/apache/hudi/pull/8629#discussion_r1184599557


##########
hudi-utilities/src/main/java/org/apache/hudi/utilities/sources/helpers/CloudObjectsSelectorCommon.java:
##########
@@ -176,6 +179,18 @@ public static Option<Dataset<Row>> 
loadAsDataset(SparkSession spark, List<CloudO
     totalSize *= 1.1;
     long parquetMaxFileSize = props.getLong(PARQUET_MAX_FILE_SIZE.key(), 
Long.parseLong(PARQUET_MAX_FILE_SIZE.defaultValue()));
     int numPartitions = (int) Math.max(totalSize / parquetMaxFileSize, 1);
-    return Option.of(reader.load(paths.toArray(new 
String[cloudObjectMetadata.size()])).coalesce(numPartitions));
+    Dataset<Row> dataset = reader.load(paths.toArray(new 
String[cloudObjectMetadata.size()])).coalesce(numPartitions);
+
+    // add partition column from source path if configured
+    if (props.containsKey(PATH_BASED_PARTITION_FIELDS)) {
+      String[] partitionKeysToAdd = 
props.getString(PATH_BASED_PARTITION_FIELDS).split(",");
+      // Add partition column for all path-based partition keys
+      for (String partitionKey : partitionKeysToAdd) {
+        String partitionPathPattern = String.format("%s=", partitionKey);
+        LOG.info(String.format("Adding column %s to dataset", partitionKey));
+        dataset = dataset.withColumn(partitionKey, 
split(split(input_file_name(), partitionPathPattern).getItem(1), 
"/").getItem(0));

Review Comment:
   What happens if the partitionPathPattern is not found in the file name?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Reply via email to