[jira] [Commented] (BEAM-2810) Consider a faster Avro library in Python

Chamikara Jayalath (JIRA) Wed, 21 Mar 2018 07:12:14 -0700

    [ 
https://issues.apache.org/jira/browse/BEAM-2810?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16407962#comment-16407962
 ]


Chamikara Jayalath commented on BEAM-2810:
------------------------------------------

I only looked into fastavro. Basically, when reading files, we need to be able 
to read records between two arbitrary byte positions. For example, assume we 
want to read byte range '[a,b)' in an Avro file. We should be able to seek into 
position 'a' and read all Avro blocks that start within this range. It's great 
if pyavroc supports this or pyavroc or fastavro can be modified to support this.

Also, what are the cons of pyavro compared to fastavro ? Better performance is 
a big positive but we should also look into other factors, for example, 
stability of the code, how well maintained the code is, ease of deployment. See 
following code location for how we do this using the Apache Avro library.

[https://github.com/apache/beam/blob/master/sdks/python/apache_beam/io/avroio.py#L362]

 

> Consider a faster Avro library in Python
> ----------------------------------------
>
>                 Key: BEAM-2810
>                 URL: https://issues.apache.org/jira/browse/BEAM-2810
>             Project: Beam
>          Issue Type: Bug
>          Components: sdk-py-core
>            Reporter: Eugene Kirpichov
>            Priority: Major
>
> https://stackoverflow.com/questions/45870789/bottleneck-on-data-source
> Seems like this job is reading Avro files (exported by BigQuery) at about 2 
> MB/s.
> We use the standard Python "avro" library which is apparently known to be 
> very slow (10x+ slower than Java) 
> http://apache-avro.679487.n3.nabble.com/Avro-decode-very-slow-in-Python-td4034422.html,
>  and there are alternatives e.g. https://pypi.python.org/pypi/fastavro/



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

[jira] [Commented] (BEAM-2810) Consider a faster Avro library in Python

Reply via email to