[ 
https://issues.apache.org/jira/browse/BEAM-8732?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17056429#comment-17056429
 ] 

Brian Hulette commented on BEAM-8732:
-------------------------------------

I don't think I'll have the bandwidth to take this on for a while, but I could 
prioritize it if it's important. I have been thinking about it some and 
discussing with [~robertwb] offline though, some thoughts:

I think what I suggested about deterministically generating UUIDs is a 
non-starter. So we will need to store some serialized python _somewhere_ in the 
pipeline graph. The simplest thing to do is store it in the PCollection/Coder 
like Java does with SchemaCoder. This has some drawbacks though: it makes the 
coder look like its non-standard, so runners can't inspect it, and conversions 
need to be inserted for xlang.

The alternative is to associate the user type with the transform(s) that use 
it. I think this is what we should try to do long-term and it would be great if 
we could do it in the Python from the outset, but it's also more difficult to 
do it right, so it could be left as a future improvement.

> Add support for additional structured types to Schemas/RowCoders
> ----------------------------------------------------------------
>
>                 Key: BEAM-8732
>                 URL: https://issues.apache.org/jira/browse/BEAM-8732
>             Project: Beam
>          Issue Type: New Feature
>          Components: sdk-py-core
>            Reporter: Chad Dombrova
>            Priority: Major
>
> Currently we can convert between a {{NamedTuple}} type and its {{Schema}} 
> protos using {{named_tuple_from_schema}} and {{named_tuple_to_schema}}. I'd 
> like to introduce a system to support additional types, starting with 
> structured types like {{attrs}}, {{dataclasses}}, and {{TypedDict}}.
> I've only just started digesting the code, but this task seems pretty 
> straightforward. For example, I think the type-to-schema code would look 
> roughly like this:
> {code:python}
> def typing_to_runner_api(type_):
>   # type: (Type) -> schema_pb2.FieldType
>   structured_handler = _get_structured_handler(type_)
>   if structured_handler:
>     schema = None
>     if hasattr(type_, 'id'):
>       schema = SCHEMA_REGISTRY.get_schema_by_id(type_.id)
>     if schema is None:
>       fields = structured_handler.get_fields()
>       type_id = str(uuid4())
>       schema = schema_pb2.Schema(fields=fields, id=type_id)
>       SCHEMA_REGISTRY.add(type_, schema)
>     return schema_pb2.FieldType(
>         row_type=schema_pb2.RowType(
>             schema=schema))
> {code}
> The rest of the work would be in implementing a class hierarchy for working 
> with structured types, such as getting a list of fields from an instance, and 
> instantiation from a list of fields. Eventually we can extend this behavior 
> to arbitrary, unstructured types.  
> Going in the schema-to-type direction, we have the problem of choosing which 
> type to use for a given schema. I believe that as long as 
> {{typing_to_runner_api()}} has been called on our structured type in the 
> current python session, it should be added to the registry and thus round 
> trip ok, so I think we just need a public function for registering schemas 
> for structured types.
> [~bhulette] Did you want to tackle this or are you ok with me going after it?
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

Reply via email to