[I] DataFile Serialization for REST Scan Planning [iceberg-python]

via GitHub Wed, 03 Dec 2025 16:08:19 -0800


geruh opened a new issue, #2792:
URL: https://github.com/apache/iceberg-python/issues/2792


   Related to #2775 
   
   In order to support,  scan planning for the REST catalog. The API returns 
file scan tasks as JSON, and I need to deserialize them into DataFile and 
DeleteFile objects. The API returns JSON like this:
   ```
     {
         "plan-status": "completed",
         "delete-files": [
             {
                 "spec-id": 0,
                 "content": "position-deletes",
                 "file-path": "s3://bucket/deletes.parquet",
                 "file-format": "parquet",
                 "partition": ["test"],
                 "file-size-in-bytes": 1529,
                 "record-count": 1,
                 "column-sizes": {"keys": [2147483546], "values": [134]},
                 "lower-bounds": {"keys": [2147483546], "values": 
["73333A2F..."]},
                 ...
             }
         ],
         "file-scan-tasks": [
             {
                 "data-file": {
                     "spec-id": 0,
                     "content": "data",
                     "file-path": "s3://bucket/data.parquet",
                     ...
                 },
                 "delete-file-references": [0],
                 "residual-filter": true
             }
         ]
     }
   ```
   The format is defined in the 
https://github.com/apache/iceberg/blob/main/open-api/rest-catalog-open-api.yaml#L4337-L4389,
 and Java parses it via 
[ContentFileParser.java](https://github.com/apache/iceberg/blob/main/core/src/main/java/org/apache/iceberg/ContentFileParser.java).
   
   
   ## Issue 
   
   The REST API representation differs from our internal representation:
     - Partition is unbound ["test"] instead of a Record
     - Maps are {"keys": [...], "values": [...]} instead of {key: value}
     - Bounds are primitives (bytes, hex...)
     - content is `position-deletes` string instead of enum int
   
   Our DataFile isn't Pydantic, so we can't just do 
DataFile.model_validate(json) with validators to handle these conversions. 
Also,  DataFile handles both data files and delete files via the content field. 
So it's really a `content file`.
   
    ## Options
   
   **1. Translation layer (`RestContentFile`)**
   
   Pydantic model that parses JSON with validators for the format differences, 
then converts to DataFile. 
   
   But, the Downside is there will be a lot of duplicated logic. I've tested 
this here: 
https://github.com/geruh/iceberg-python/blob/11ea47487e8f428585b17e6dd912df3c57041359/pyiceberg/rest/models.py#L83
   
   **2. Make DataFile Pydantic**
   Then it could parse JSON directly with pydantic. 
   
   The Challenge with this is that DataFile is coupled to Avro for fields, and 
extends Record. the Avro reader constructs objects with positional args like 
`DataFile(None, None, ...)` then fills by index. We'd need to converge here.
   
   **3. Manual parsing**
   
   Transform the raw JSON dict manually and construct DataFile without 
Pydantic. No duplication, but lose validation benefits. 
   
   ## Reccomendation
   
   I'm Leaning towards B, as it would reduce a lot of duplication. However,  it 
seems can't directly extend both Record and BaseModel due to a metaclass 
conflict.
   
   I'm Leaning towards option 2 since it would reduce a lot of duplication. 
However, we can't directly extend both Record and BaseModel due to a metaclass 
conflict, but we can implement the same StructProtocol interface:
   
   ```
     class DataFile(IcebergBaseModel):
         content: DataFileContent = Field(default=DataFileContent.DATA)
         file_path: str = Field(alias="file-path")
         file_format: FileFormat = Field(alias="file-format")
         # fields with validators for pydantic conversion
   
         # Field order must match DATA_FILE_TYPE for Avro StructProtocol 
compatibility.
         # The Avro reader/writer accesses fields by position, not name.
         _FIELD_ORDER: ClassVar[tuple[str, ...]] = ("content", "file_path", ...)
   
         def __new__(cls, *args, **kwargs):
             if args and not kwargs:
                 # Positional args from Avro reader and bypass validation
                 return cls.model_construct(**dict(zip(cls._FIELD_ORDER, args)))
             return super().__new__(cls)
   
         # StructProtocol interface
         def __getitem__(self, pos: int):
             return getattr(self, self._FIELD_ORDER[pos])
   
         def __setitem__(self, pos: int, value):
             setattr(self, self._FIELD_ORDER[pos], value)
   
         def __len__(self):
             return len(self._FIELD_ORDER)
   ```
   
   But ultimately, I  wanted to get input before making changes since this 
touches a core model. Open to suggestions on the approach.
   
   cc: @Fokko @kevinjqliu @HonahX 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[I] DataFile Serialization for REST Scan Planning [iceberg-python]

Reply via email to