[ 
https://issues.apache.org/jira/browse/BEAM-10769?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17181348#comment-17181348
 ] 

Valentyn Tymofieiev commented on BEAM-10769:
--------------------------------------------

Beam switched to use FastAvro as a default library on Python 3. The 
fastavro-based Avro sink expects schema as a dictionary, while the 
avro-python3-based Avro Sink expects a schema that was previously parsed by 
avro.schema.Parse(). Fastavro will not accept a schema parsed by avro-python3.

When a user switches their pipeline with WriteToAvro transform to Python 3, but 
does not change how schema is passed to the transform and thus passes a schema 
parsed by avro.schema.Parse(),  fastavro will not be able parse the schema, 
since FastAvro expects schema as a dictionary. Also FastAvro does not require a 
parsed schema, although supplying a schema parsed by fastavro works too.

The error may manifest as follows:

{noformat}
...lib/python3.7/site-packages/apache_beam/io/avroio.py", line 634, in open
    return Writer(file_handle, self._schema, self._codec)
  File "fastavro/_write.pyx", line 522, in fastavro._write.Writer.__init__
  File "fastavro/_schema.pyx", line 71, in fastavro._schema.parse_schema
  File "fastavro/_schema.pyx", line 85, in fastavro._schema._parse_schema
TypeError: unhashable type: 'RecordSchema' [while running 
'SampleInfoToAvro/WriteToAvroFiles/Write/WriteImpl/WriteBundles']
{noformat}

To fix the error, users should pass the schema to the sink as a dictionary. 
https://github.com/apache/beam/pull/12638 is out to fix the documentation and 
catch these errors with a better error message.   

> Fix Avro IO documentation: when fastavro is used, do not pass schema parsed 
> by avro-python3.
> --------------------------------------------------------------------------------------------
>
>                 Key: BEAM-10769
>                 URL: https://issues.apache.org/jira/browse/BEAM-10769
>             Project: Beam
>          Issue Type: Bug
>          Components: io-py-gcp
>            Reporter: Valentyn Tymofieiev
>            Assignee: Valentyn Tymofieiev
>            Priority: P2
>          Time Spent: 1h 20m
>  Remaining Estimate: 0h
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)

Reply via email to