GitHub user echauchot opened a pull request:
https://github.com/apache/beam/pull/3950
[BEAM-2993] AvroIO.write without specifying a schema
Follow this checklist to help us incorporate your contribution quickly and
easily:
- [X] Make sure there is a [JIRA
issue](https://issues.apache.org/jira/projects/BEAM/issues/) filed for the
change (usually before you start working on it). Trivial changes like typos do
not require a JIRA issue. Your pull request should address just this issue,
without pulling in other changes.
- [X] Each commit in the pull request should have a meaningful subject
line and body.
- [X] Format the pull request title like `[BEAM-XXX] Fixes bug in
ApproximateQuantiles`, where you replace `BEAM-XXX` with the appropriate JIRA
issue.
- [X] Write a pull request description that is detailed enough to
understand what the pull request does, how, and why.
- [X] Run `mvn clean verify` to make sure basic checks pass. A more
thorough check will be performed on your pull request automatically.
- [X] If this contribution is large, please file an Apache [Individual
Contributor License Agreement](https://www.apache.org/licenses/icla.pdf).
---
This PR adds the ability to use `AvroIO.write()` and related methods
without specifying a schema.
The schema is determined at the first call of `AvroSink.write()`: the
`DataFileWriter` is lazy initialized (at first write) once we have the value to
get the schema from.
This PR also makes the schema optional in `ConstantAvroDestination` and
depreciate write methods that take schema as parameter. Tell me if I'm missing
something that prevents deprecation of these methods.
To use `AvoIO.write()` with no schema, all the elements of the input
PCollection must have the same schema, but it is the same with current
AvroIO.write(schema) implementation because this schema instance is passed to
the `TypedWrite` then to the `ConstantAvroDestination` that is used in
`AvroSink`. Please tell me if I'm missing something here.
My only concern is with empty bundles, `AvroSink.write()` will not be
called resulting in the `DataFileWriter` not being initialized.
Please merge the PR bellow before this one because it is used as a base for
the tests
https://github.com/apache/beam/pull/3948
R: @jkff
R: @reuvenlax
CC: @lukecwik
You can merge this pull request into a Git repository by running:
$ git pull https://github.com/echauchot/beam AvroIOWriteSchemaLess2
Alternatively you can review and apply these changes as the patch at:
https://github.com/apache/beam/pull/3950.patch
To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:
This closes #3950
----
commit 43ef4d42d7d224b1997278832ec645bccb945792
Author: Etienne Chauchot <[email protected]>
Date: 2017-10-05T09:45:12Z
[BEAM-3019] Make AvroIOWriteTransformTest more generic
make runTestWrite() more generic to be able to use GenericRecord[] as input
for writeGenericRecords test in place of AvroGeneratedUser
make readAvroFile() generic to be able to read GenericRecords using
GenericDatumReader for writeGenericRecords test
commit 84074e36085d76f569c89d4a29a647fc40b22531
Author: Etienne Chauchot <[email protected]>
Date: 2017-10-02T15:08:55Z
[BEAM-2993] AvroIO.write without specifying a schema
Lazy init (at first write) of the dataFileWriter once we have the value to
get the schema from.
Make schema optional in ConstantAvroDestination and depreciate write
methods that take schema as parameter
Cleaning
commit d19c2cb3538e5981e8138522d0c2138b455dec46
Author: Etienne Chauchot <[email protected]>
Date: 2017-10-05T12:08:04Z
Add tests of the schema less write methods
Cleaning
commit da95342353bd191c55d6d7768d4c052c531b8cf1
Author: Etienne Chauchot <[email protected]>
Date: 2017-10-05T12:43:29Z
Fixups
----
---