codope commented on code in PR #8629:
URL: https://github.com/apache/hudi/pull/8629#discussion_r1184599557
##########
hudi-utilities/src/main/java/org/apache/hudi/utilities/sources/helpers/CloudObjectsSelectorCommon.java:
##########
@@ -176,6 +179,18 @@ public static Option<Dataset<Row>>
loadAsDataset(SparkSession spark, List<CloudO
totalSize *= 1.1;
long parquetMaxFileSize = props.getLong(PARQUET_MAX_FILE_SIZE.key(),
Long.parseLong(PARQUET_MAX_FILE_SIZE.defaultValue()));
int numPartitions = (int) Math.max(totalSize / parquetMaxFileSize, 1);
- return Option.of(reader.load(paths.toArray(new
String[cloudObjectMetadata.size()])).coalesce(numPartitions));
+ Dataset<Row> dataset = reader.load(paths.toArray(new
String[cloudObjectMetadata.size()])).coalesce(numPartitions);
+
+ // add partition column from source path if configured
+ if (props.containsKey(PATH_BASED_PARTITION_FIELDS)) {
+ String[] partitionKeysToAdd =
props.getString(PATH_BASED_PARTITION_FIELDS).split(",");
+ // Add partition column for all path-based partition keys
+ for (String partitionKey : partitionKeysToAdd) {
+ String partitionPathPattern = String.format("%s=", partitionKey);
+ LOG.info(String.format("Adding column %s to dataset", partitionKey));
+ dataset = dataset.withColumn(partitionKey,
split(split(input_file_name(), partitionPathPattern).getItem(1),
"/").getItem(0));
Review Comment:
What happens if the partitionPathPattern is not found in the file name?
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]