Fokko commented on PR #4858:
URL: https://github.com/apache/iceberg/pull/4858#issuecomment-1140056996
> Is this worth it?
I think it is! Allow me to elaborate on why I think it is the case.
> Many of the classes in the REST spec mirror structures from TableMetadata,
but when we talked last I thought that we were going to create to_dict /
from_dict methods to serialize and deserialize those. Having autogenerated
classes for those seems like duplication.
That's correct. Having the serialization methods would be replaced by this
PR because we get them for free:
```python
➜ python git:(fd-open-api) python3
Python 3.9.13 (main, May 19 2022, 13:48:47)
[Clang 13.1.6 (clang-1316.0.21.2)] on darwin
Type "help", "copyright", "credits" or "license" for more information.
>>> from iceberg.openapi.rest_catalog import PartitionSpec, PartitionField
>>> spec = PartitionSpec(
... spec_id=123,
... fields=[
... PartitionField(
... field_id=234,
... source_id=1,
... name='date',
... transform='identity'
... )
... ]
... )
>>> spec
PartitionSpec(spec_id=123, fields=[PartitionField(field_id=234, source_id=1,
name='date', transform=Transform(__root__='identity'))])
# Serialize
>>> spec.json()
'{"spec_id": 123, "fields": [{"field_id": 234, "source_id": 1, "name":
"date", "transform": "identity"}]}'
# Deserialize
>>> PartitionSpec.parse_raw(spec.json())
PartitionSpec(spec_id=123, fields=[PartitionField(field_id=234, source_id=1,
name='date', transform=Transform(__root__='identity'))])
```
I think this is much less error-prone as code like this:
https://github.com/apache/iceberg/pull/3677/files#diff-9d9a8492ccd85cbbddfc202b9a954b57a079f2cca33e96eb0a9106cf1a4a8130R37-R58
Next to that, the maintenance cost is much higher when we have to keep it in
sync by hand. If we generate it, we can check if we break compatibility (using
mypy).
> Also, I don't think we're getting much out of this. This makes classes
that are simple and could easily be statically defined. And it requires an
additional library that looks similar to what we get from @dataclass. I'd
probably opt not to pull in the additional dependency.
This is correct, but we get some more on top of the data classes;
validators. Instead of using a builder pattern, we can just nicely validate the
input using decorators: https://pydantic-docs.helpmanual.io/usage/validators/
This looks like:
```python
>>> from iceberg.openapi.rest_catalog import PartitionSpec, PartitionField
>>> spec = PartitionSpec(
... spec_id=123,
... fields=[
... PartitionField(
... field_id=234,
... name='date',
... transform='identity'
... )
... ]
... )
Traceback (most recent call last):
File "<stdin>", line 4, in <module>
File "pydantic/main.py", line 341, in pydantic.main.BaseModel.__init__
pydantic.error_wrappers.ValidationError: 1 validation error for
PartitionField
source-id
field required (type=value_error.missing)
```
> I'm generally unsure about using the OpenAPI doc currently for generating
classes, as it's presently intended more for documentation purposes than for
usage for models and generating code. To that extent, it seems that the
generated classes are at times not that helpful (like Updates).
I see your point, but it is a bit of a catch-22. Until we start using it, it
won't mature. Instead of appending methods to the python classes, it is just as
simple as amending the OpenAPI spec and regenerating the code.
> I've also noticed a lot of variance in the generated code depending on
which library is used to do it. I know one Python generation tool I used to
generate code made all models just dictionaries... not even classes that were
secretly dictionaries but just... dictionaries.
Yes, there is a variety of generators and I found this one the best. It
nicely generates all the aliases as well. I already bumped into some issues
with the plain python one:
https://github.com/openapi-generators/openapi-python-client/pull/618
> If we get to a place where the OpenAPI doc can be used as a drop-in, that
would be great. But talking to some people who are very familiar with OpenAPI,
my understanding is that some projects have behavior that's simply more complex
than any of the OpenAPI generation tools can really account for. I've been told
that either the code has to be written in a very specific style or that it can
take years to get to such a place.
Great point, and I think we always need to extend the generated classes with
convenience methods, or additional validation. How to do this is given in
https://github.com/apache/iceberg/pull/4717#discussion_r882941190
> Is pydantic required if we were to move forward with this or is this just
added in this PR for other reasons?
For this implementation, pydantic is required:
https://github.com/koxudaxi/datamodel-code-generator The code generator
specifically generates the pydantic object (and brings in the validation as
well). It fits our use-case perfectly:
https://python.plainenglish.io/an-introduction-to-the-pydantic-stack-9e490d606c8d
It is coming from the FastAPI movement, and it is becoming increasingly more
popular. Also, pydantic doesn't pull in the whole universe, it just has one
dependency:
https://github.com/samuelcolvin/pydantic/blob/master/setup.py#L131-L133
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]