[
https://issues.apache.org/jira/browse/BEAM-2810?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16407962#comment-16407962
]
Chamikara Jayalath commented on BEAM-2810:
------------------------------------------
I only looked into fastavro. Basically, when reading files, we need to be able
to read records between two arbitrary byte positions. For example, assume we
want to read byte range '[a,b)' in an Avro file. We should be able to seek into
position 'a' and read all Avro blocks that start within this range. It's great
if pyavroc supports this or pyavroc or fastavro can be modified to support this.
Also, what are the cons of pyavro compared to fastavro ? Better performance is
a big positive but we should also look into other factors, for example,
stability of the code, how well maintained the code is, ease of deployment. See
following code location for how we do this using the Apache Avro library.
[https://github.com/apache/beam/blob/master/sdks/python/apache_beam/io/avroio.py#L362]
> Consider a faster Avro library in Python
> ----------------------------------------
>
> Key: BEAM-2810
> URL: https://issues.apache.org/jira/browse/BEAM-2810
> Project: Beam
> Issue Type: Bug
> Components: sdk-py-core
> Reporter: Eugene Kirpichov
> Priority: Major
>
> https://stackoverflow.com/questions/45870789/bottleneck-on-data-source
> Seems like this job is reading Avro files (exported by BigQuery) at about 2
> MB/s.
> We use the standard Python "avro" library which is apparently known to be
> very slow (10x+ slower than Java)
> http://apache-avro.679487.n3.nabble.com/Avro-decode-very-slow-in-Python-td4034422.html,
> and there are alternatives e.g. https://pypi.python.org/pypi/fastavro/
--
This message was sent by Atlassian JIRA
(v7.6.3#76005)