voonhous commented on code in PR #18837:
URL: https://github.com/apache/hudi/pull/18837#discussion_r3340193901
##########
hudi-trino-plugin/src/main/java/io/trino/plugin/hudi/HudiUtil.java:
##########
@@ -321,37 +341,33 @@ public static Schema.Field getFieldFromSchema(String
columnName, Schema schema)
"Failed to get column " + columnName + " from table schema");
}
- public static List<HiveColumnHandle>
prependHudiMetaColumns(List<HiveColumnHandle> dataColumns)
+ public static List<HiveColumnHandle>
prependHudiMetaAndOrderingColumns(HudiTableHandle tableHandle,
List<HiveColumnHandle> dataColumns)
{
- //For efficient lookup
- Set<String> dataColumnNames = dataColumns.stream()
+ Set<String> existingColumns = dataColumns.stream()
.map(HiveColumnHandle::getName)
- .collect(Collectors.toSet());
-
- // If all Hudi meta columns are already present, return the original
list
- if (dataColumnNames.containsAll(HOODIE_META_COLUMNS)) {
- return dataColumns;
- }
-
- // Identify only the meta columns that are missing from dataColumns to
avoid duplicates
- List<String> missingMetaColumns = HOODIE_META_COLUMNS.stream()
- .filter(metaColumn -> !dataColumnNames.contains(metaColumn))
- .toList();
+ .collect(Collectors.toCollection(HashSet::new));
List<HiveColumnHandle> columns = new ArrayList<>();
- // Create and prepend the new HiveColumnHandles for the missing meta
columns
- columns.addAll(IntStream.range(0, missingMetaColumns.size())
- .boxed()
- .map(i -> new HiveColumnHandle(
- missingMetaColumns.get(i),
+ // Add missing Hudi meta columns first
+ for (int i = 0; i < HOODIE_META_COLUMNS.size(); i++) {
+ String metaColumn = HOODIE_META_COLUMNS.get(i);
+ if (existingColumns.add(metaColumn)) { // add() returns false if
already present
+ columns.add(new HiveColumnHandle(
+ metaColumn,
i,
Review Comment:
Only a logical identifier here - it is never used for ordering/position
lookups, so the skipped-duplicate case does not cause a mismatch. Traced
through the page-source path:
- Default resolution is by name: `shouldUseParquetColumnNames` defaults to
`true`, so `createPageSource` resolves columns via `getBaseColumnName()` and
never consults `baseHiveColumnIndex`.
- The by-index path remaps by name anyway: when `useColumnNames=false`,
`remapColumnIndicesToPhysical` overwrites each handle index with the physical
index looked up by column name from the file schema, so the `i` assigned here
is discarded before it is used as an index.
- Output column order is driven by list position (the order handles are
added to the list), and `constructSchema` builds the requested schema from
column names - not from `baseHiveColumnIndex`.
So skipping a duplicate meta column cannot produce a positional mismatch.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]