[
https://issues.apache.org/jira/browse/BEAM-2993?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16191085#comment-16191085
]
Ryan Skraba commented on BEAM-2993:
-----------------------------------
Hello -- I'm chiming in to help clarify our use case, which is a bit
specialized. However, if it's useful for us, it's potentially useful to others!
As part of our work using Beam, we help users assemble pipelines to run using
configured "components". These are eventually translated to PTransforms, of
course, acting on PCollections -- nothing surprising! We've picked Avro
IndexedRecords (not GenericRecords, but that's a detail for the moment) as the
common currency between the PTransforms. This works well, especially if you
know every schema on every collection at design-time, when you're building your
pipeline.
[~jkff] is correct that *if* the schema is known *and* you are using
{{AvroCoder}}, you already have the schema in your hands when you build the
{{AvroIO.write}} and all is well.
We have some advanced functionality, however, where we deduce the schema at
runtime -- either at the start of the pipeline (such as reading from a JDBC
table and converting the row + table metadata into a consistent IndexedRecord)
but also in the middle of a pipeline (we can infer a schema after some
user-defined processing). In brief, we can't directly use AvroCoder in this
case, but we can write our own {{Coder<IndexedRecord>}} that takes care of
sharing the schema between interested nodes when necessary (not with every
record).
In this case, we've managed to create a {{PCollection<IndexedRecord>}} while
designing the Pipeline, using our Coder that doesn't require the schema, but we
still can't attach it to the {{AvroIO.write}}...
Note that "sharing the schema between interested nodes" in our custom coder
introduces a distributed state between nodes, which is not the ideal for
parallelization -- in this case, we've measured the cost to be acceptable since
it only occurs the first time a node tries to write or read an avro-encoded
record from the collection.
That's a long detour to explain why {{AvroIO.write}} without schema would be
interesting to us, but I hope you find it useful. Our technique for sharing
the schema as distributed state in the Coder is a much larger view but I'm very
sure we'd be interested in contributing!
> AvroIO.write without specifying a schema
> ----------------------------------------
>
> Key: BEAM-2993
> URL: https://issues.apache.org/jira/browse/BEAM-2993
> Project: Beam
> Issue Type: Improvement
> Components: sdk-java-extensions
> Reporter: Etienne Chauchot
> Assignee: Etienne Chauchot
>
> Similarly to https://issues.apache.org/jira/browse/BEAM-2677, we should be
> able to write to avro files using {{AvroIO}} without specifying a schema at
> build time. Consider the following use case: a user has a
> {{PCollection<GenericRecord>}} but the schema is only known while running
> the pipeline. {{AvroIO.writeGenericRecords}} needs the schema, but the
> schema is already available in {{GenericRecord}}. We should be able to call
> {{AvroIO.writeGenericRecords()}} with no schema.
--
This message was sent by Atlassian JIRA
(v6.4.14#64029)