gauravkhatri05 commented on PR #3272:
URL: https://github.com/apache/parquet-java/pull/3272#issuecomment-3249142635

   Hi @gszadovszky,
   
   Thanks for your follow-up! I want to clarify the key distinction between the 
current approach and our patch:
   
   ### What’s happening in current patch
   
   With the `IdentityHashMap` check in `convertField(...)`:
   
   ```java
   if (seenSchemas.containsKey(schema)) {
     throw new UnsupportedOperationException(
         "Recursive Avro schemas are not supported by parquet-avro: " + 
schema.getFullName());
   }
   ```
   
   As soon as the converter encounters a recursive reference (e.g., `Employee 
-> Employee`), it **throws immediately**, preventing even common, benign 
recursive structures from working.
   
   ### What our patch achieves (max-depth)
   
   We don’t block recursion outright. Instead, with a **configurable 
max-depth** (e.g., default = 10), we allow traversal to continue until that 
limit is reached. Only then do we prevent further descent. This allows 
recursive schemas to function *within a bounded, safe depth*.
   
   A typical use case is an `Employee` record with a `List<Employee>` for 
subordinates—common in organizational hierarchies.
   
   Here’s an example payload (max depth = 4) that works with our patch but 
fails with the current implementation (IdentityHashMap):
   
   ```
   A (CEO)
   |
   |--------------- B (CTO)
   |                        |
   |                        |---------------- B1 (Mgr 1)
   |                        |                         |---------------- B1-1 
(Architect)
   |                        |                         |---------------- B1-2 
(Team Lead)
   |                        |                                                   
|---------------- B1-2-1 (SDE)
   |                        |
   |                        |---------------- B2 (Product Owner)
   |
   |--------------- C (HR)
   ```
   
   * **Current patch (IdentityHashMap):** Rejects the recursive reference at 
the first encounter.
   * **Max-depth patch:** Supports the structure up to the safe limit, enabling 
real-world recursive models like trees or org charts.
   
   ### For reference: Similar approach advocated elsewhere
   
   This 
[Medium](https://medium.com/hadoop-noob/recursive-avro-schema-for-parquet-a87c84b6aca1?utm_source=chatgpt.com)
 post explicitly suggests the same solution—stop schema generation when a 
specified **max\_depth** is reached to handle recursive Avro schemas for 
Parquet gracefully.
   
   ### Here’s the essential snippet from our max-depth patch:
   
   ```java
   private final int maxDepth = Integer.parseInt(
       
getApplicationProps().getProperty("parquet.schema.parser.recursion.max.depth", 
"10")
   );
   
   // MAX-DEPTH PATCH START
   public MessageType convert(Schema avroSchema) {
     if (!avroSchema.getType().equals(Schema.Type.RECORD)) {
       throw new IllegalArgumentException("Avro schema must be a record.");
     }
     return new MessageType(
       avroSchema.getFullName(),
       convertFields(avroSchema.getFields(), "", maxDepth)
     );
   }
   
   private List<Type> convertFields(List<Schema.Field> fields, String 
schemaPath, int depth) {
     var types = new ArrayList<Type>();
     --depth;
     for (Schema.Field field : fields) {
       if (field.schema().getType().equals(NULL)) {
         continue;
       }
       if (depth > 0 || !isRecord(field)) {
         types.add(convertField(field, appendPath(schemaPath, field.name()), 
depth));
       }
     }
     return types;
   }
   
   private boolean isRecord(Schema.Field field) {
       var fieldSchema = field.schema();
       if (UNION == fieldSchema.getType()) {
         var optionalNonNullUnionElementSchema = fieldSchema.getTypes().stream()
           .filter(schema -> NULL != schema.getType())
           .findFirst();
         if (optionalNonNullUnionElementSchema.isPresent()) {
           var nonNullUnionElementSchema = 
optionalNonNullUnionElementSchema.get();
           if (ARRAY == nonNullUnionElementSchema.getType()) {
             return checkArrayForRecord(nonNullUnionElementSchema);
           } else if (MAP == nonNullUnionElementSchema.getType()) {
             return checkMapForRecord(nonNullUnionElementSchema);
           }
           return RECORD == nonNullUnionElementSchema.getType();
         }
         return false;
       } else if (ARRAY == fieldSchema.getType()) {
         return checkArrayForRecord(fieldSchema);
       } else if (MAP == fieldSchema.getType()) {
         return checkMapForRecord(fieldSchema);
       }
       return fieldSchema.getType() == RECORD;
     }
   
     private boolean checkMapForRecord(Schema mapSchema) {
       var valueSchema = mapSchema.getValueType();
       if (UNION == valueSchema.getType()) {
         return valueSchema.getTypes().stream().anyMatch(
           schema -> RECORD == schema.getType()
         );
       } else {
         return RECORD == valueSchema.getType();
       }
     }
   
     private boolean checkArrayForRecord(Schema arraySchema) {
       if (UNION == arraySchema.getElementType().getType()) {
         return arraySchema.getElementType().getTypes().stream().anyMatch(
           schema -> RECORD == schema.getType()
         );
       } else {
         return RECORD == arraySchema.getElementType().getType();
       }
     }
   
   // ... rest of convertField, convertUnion, etc., updated similarly ...
   // MAX-DEPTH PATCH END
   ```
   
   This enhancement allows recursive schemas to be **usable and safe**, not 
just blocked. The depth is configurable so it can be tuned based on workload 
and performance needs—default = 10 provides stability; dropping below 7 has 
shown issues in our tests.
   
   Let me know if my explanation makes sense or if you’d like me to clarify 
further.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to