amogh-jahagirdar commented on code in PR #15268:
URL: https://github.com/apache/iceberg/pull/15268#discussion_r2784990510


##########
spark/v4.1/spark/src/main/java/org/apache/iceberg/spark/source/BaseReader.java:
##########
@@ -252,5 +251,40 @@ protected <V> V getOrLoad(String key, Supplier<V> 
valueSupplier, long valueSize)
         return cache.getOrLoad(table().name(), key, valueSupplier, valueSize);
       }
     }
+
+    // field lookup for serializable tables that assumes fetching historic 
schemas is expensive
+    private static class FieldLookup implements Function<Integer, 
Types.NestedField> {
+      private final Table table;
+
+      private volatile Map<Integer, Types.NestedField> historicSchemaFields;
+
+      private FieldLookup(Table table) {
+        this.table = table;
+      }
+
+      @Override
+      public Types.NestedField apply(Integer id) {
+        Types.NestedField field = table.schema().findField(id);
+        return field != null ? field : historicSchemaFields().get(id);
+      }
+
+      private Map<Integer, Types.NestedField> historicSchemaFields() {

Review Comment:
   I actually like `historic` but I'm also good with previous or prior, not 
very opinionated here



##########
api/src/main/java/org/apache/iceberg/Schema.java:
##########
@@ -629,4 +631,30 @@ public static void checkCompatibility(Schema schema, int 
formatVersion) {
               formatVersion, Joiner.on("\n- ").join(problems.values())));
     }
   }
+
+  // indexes all fields from schemas, preferring field definitions from higher 
schema IDs
+  public static Map<Integer, NestedField> indexFields(Collection<Schema> 
schemas) {
+    if (schemas.size() == 1) {
+      Schema schema = Iterables.getOnlyElement(schemas);
+      return schema.lazyIdToField();
+    }
+
+    Map<Integer, NestedField> fields = Maps.newHashMap();

Review Comment:
   Do we want to pre-size this map to avoid extra allocations/re-hashing as we 
add to it? We know the size would be the max schema width, so doesn't seem to 
bad to compute. Probably is only an issue for super wide schemas, and when 
there's many historical schemas (which should be cleaned up over time anyways)



##########
spark/v4.1/spark/src/main/java/org/apache/iceberg/spark/source/BaseReader.java:
##########
@@ -66,7 +68,6 @@ abstract class BaseReader<T, TaskT extends ScanTask> 
implements Closeable {
   private static final Logger LOG = LoggerFactory.getLogger(BaseReader.class);
 
   private final Table table;
-  private final Schema tableSchema;

Review Comment:
   Hm, have we considered just building a union schema? 



##########
spark/v4.1/spark/src/main/java/org/apache/iceberg/spark/source/BaseReader.java:
##########
@@ -66,7 +68,6 @@ abstract class BaseReader<T, TaskT extends ScanTask> 
implements Closeable {
   private static final Logger LOG = LoggerFactory.getLogger(BaseReader.class);
 
   private final Table table;
-  private final Schema tableSchema;

Review Comment:
   Ah nvm, we probably want to lazily do that, since on average we wouldn't 
expect to have to reference the historical fields. 



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to