TheNeuralBit commented on a change in pull request #13005:
URL: https://github.com/apache/beam/pull/13005#discussion_r501372151
##########
File path: website/www/site/content/en/documentation/programming-guide.md
##########
@@ -2546,10 +2611,68 @@ public abstract class TransactionValue {
}
{{< /highlight >}}
+{{< paragraph class="language-java" >}}
This is all that’s needed to generate a simple AutoValue class, and the above
`@DefaultSchema` annotation tells Beam to
infer a schema from it. This also allows AutoValue elements to be used inside
of `PCollection`s.
+{{< /paragraph >}}
+{{< paragraph class="language-java" >}}
`@SchemaFieldName` and `@SchemaIgnore` can be used to alter the schema
inferred.
+{{< /paragraph >}}
+
+{{< paragraph class="language-py" >}}
+Beam has a few different mechanisms for inferring schemas from Python code.
+{{< /paragraph >}}
+
+{{< paragraph class="language-py" >}}
+**NamedTuple classes**
+{{< /paragraph >}}
+
+{{< paragraph class="language-py" >}}
+A [NamedTuple](https://docs.python.org/3/library/typing.html#typing.NamedTuple)
+class is a Python class that wraps a `tuple`, assigning a name to each element
+and restricting it to a particular type. Beam will automatically infer the
+schema for PCollections with `NamedTuple` output types. For example:
+{{< /paragraph >}}
+
+{{< highlight py >}}
+class Transaction(typing.NamedTuple):
+ bank: str
+ purchase_amount: float
+
+pc = input | beam.Map(lambda ...).with_output_types(Transaction)
+{{< /highlight >}}
+
+
+{{< paragraph class="language-py" >}}
+**beam.Row**
+{{< /paragraph >}}
+
+{{< paragraph class="language-py" >}}
+It's also possible to create ad-hoc schema declarations with a simple lambda
+that returns instances of `beam.Row`:
+{{< /paragraph >}}
+
+{{< highlight py >}}
+input_pc = ... # {"bank": ..., "purchase_amount": ...}
+output_pc = input_pc | beam.Map(lambda item: beam.Row(bank=item["bank"],
+
purchase_amount=item["purchase_amount"])
+{{< /highlight >}}
+
+{{< paragraph class="language-py" >}}
+Note that this declaration doesn't include any specific information about the
+types of the `bank` and `purchase_amount` fields. Beam will attempt to infer
+type information, if it's unable to it will fall back to the generic type
+`Any`. Sometimes this is not ideal, you can use casts to make sure Beam
+correctly infers types with `beam.Row`:
+{{< /paragraph >}}
+
+{{< highlight py >}}
+input_pc = ... # {"bank": ..., "purchase_amount": ...}
+output_pc = input_pc | beam.Map(lambda item: beam.Row(bank=str(item["bank"]),
+
purchase_amount=float(item["purchase_amount"]))
+{{< /highlight >}}
+
Review comment:
Good idea! I'll add it in a separate PR and we can merge it once Select
is available in 2.25.0
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
[email protected]