[ https://issues.apache.org/jira/browse/BEAM-2810?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16407538#comment-16407538 ]
Debasish Das commented on BEAM-2810: ------------------------------------ I will try reading bq from beam directly but during iterative processing, an intermediate format like avro can help...even better if parquet is supported but I did not see much support for parquet in GCP ecosystem... > Consider a faster Avro library in Python > ---------------------------------------- > > Key: BEAM-2810 > URL: https://issues.apache.org/jira/browse/BEAM-2810 > Project: Beam > Issue Type: Bug > Components: sdk-py-core > Reporter: Eugene Kirpichov > Priority: Major > > https://stackoverflow.com/questions/45870789/bottleneck-on-data-source > Seems like this job is reading Avro files (exported by BigQuery) at about 2 > MB/s. > We use the standard Python "avro" library which is apparently known to be > very slow (10x+ slower than Java) > http://apache-avro.679487.n3.nabble.com/Avro-decode-very-slow-in-Python-td4034422.html, > and there are alternatives e.g. https://pypi.python.org/pypi/fastavro/ -- This message was sent by Atlassian JIRA (v7.6.3#76005)