[jira] [Commented] (BEAM-3771) Unable to write using AvroIO without schema

Etienne Chauchot (JIRA) Tue, 06 Mar 2018 02:04:51 -0800

    [ 
https://issues.apache.org/jira/browse/BEAM-3771?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16387554#comment-16387554
 ]


Etienne Chauchot commented on BEAM-3771:
----------------------------------------

Hi [~darshanmehta2] I have a similar use case. I did a PR in the past to avoid 
providing an avro schema at compile time (see 
[https://github.com/apache/beam/pull/3950).] But this PR was closed because, in 
some corner cases, defining the schema at runtime out of {{GenericRecords}} 
stored in the PCollection can produce wrong output. See my last comment on this 
ticket: https://issues.apache.org/jira/browse/BEAM-2993 for details on the 
corner case. 

The solution you have is to create a {{PCollectionView}} in your pipeline that 
stores elements.getSchema() and use it as a side input for your regular 
PCollection. Here is a sample code: 
[https://github.com/echauchot/beam/blob/671e74b7934050c31e8fb46f1e9ced9da9ca867e/sdks/java/core/src/test/java/org/apache/beam/sdk/io/AvroIOTransformTest.java#L393]

I put this ticket as a duplicate of 
https://issues.apache.org/jira/browse/BEAM-2993

> Unable to write using AvroIO without schema
> -------------------------------------------
>
>                 Key: BEAM-3771
>                 URL: https://issues.apache.org/jira/browse/BEAM-3771
>             Project: Beam
>          Issue Type: Bug
>          Components: io-java-avro
>            Reporter: Darshan Mehta
>            Assignee: Chamikara Jayalath
>            Priority: Major
>             Fix For: Not applicable
>
>
> I am working on a specific use case where I don't know the schema while 
> writing the GenericRecords' PCollection to File system. Here's how the use 
> case works:
>  * My dataflow listens to Pubsub's subscription and gets the message in this 
> format : 
> {code:java}
> // {"schema" : <schema_id>, "payload" : "<payload>"}
> {code}
>  * It then extracts the id, looks up schema registry and gets the schema for 
> a specific elelemt
>  * The payload is then deserialised into GenericRecord
>  * PCollection of these records is forwarded to BigQuery writer and it gets 
> written to BigQuery
>  * It then is passed to Storage writer that writes to file system using AvroIO
> Now, I am struggling with the last step as AvroIO expects a schema whereas I 
> do not know schema at compile time. All I have is a bunch of elements with 
> schema id embedded.
> Is there any way for AvroIO to write the records to FileSystem without 
> schema? If not, do I have any other alternatives (formats) to write to file 
> system?



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

[jira] [Commented] (BEAM-3771) Unable to write using AvroIO without schema

Reply via email to