[ https://issues.apache.org/jira/browse/BEAM-2810?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16539079#comment-16539079 ]
Barry Hart commented on BEAM-2810: ---------------------------------- I am a fairly frequent contributor to the fastavro library. I'm happy to try and help if it needs some tweaks. FWIW, the library is pretty mature and works well for our project. It has several small changes from time to time, but generally nothing major. Probably the last big changes were late 2017, when we did some Cython work to make reads about 30% faster and writes about 2x faster. > Consider a faster Avro library in Python > ---------------------------------------- > > Key: BEAM-2810 > URL: https://issues.apache.org/jira/browse/BEAM-2810 > Project: Beam > Issue Type: Bug > Components: sdk-py-core > Reporter: Eugene Kirpichov > Assignee: Ryan Williams > Priority: Major > Time Spent: 5h 40m > Remaining Estimate: 0h > > https://stackoverflow.com/questions/45870789/bottleneck-on-data-source > Seems like this job is reading Avro files (exported by BigQuery) at about 2 > MB/s. > We use the standard Python "avro" library which is apparently known to be > very slow (10x+ slower than Java) > http://apache-avro.679487.n3.nabble.com/Avro-decode-very-slow-in-Python-td4034422.html, > and there are alternatives e.g. https://pypi.python.org/pypi/fastavro/ -- This message was sent by Atlassian JIRA (v7.6.3#76005)