[ 
https://issues.apache.org/jira/browse/BEAM-2810?focusedWorklogId=112685&page=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-112685
 ]

ASF GitHub Bot logged work on BEAM-2810:
----------------------------------------

                Author: ASF GitHub Bot
            Created on: 18/Jun/18 05:00
            Start Date: 18/Jun/18 05:00
    Worklog Time Spent: 10m 
      Work Description: ryan-williams commented on issue #5496: do not merge! 
[BEAM-2810] use fastavro in Avro IO
URL: https://github.com/apache/beam/pull/5496#issuecomment-397942365
 
 
   OK, I think this is ready for a proper review!
   
   - [x] block-iteration code [merged into 
fastavro](https://github.com/fastavro/fastavro/pull/208) and released in 
[0.19.7](https://github.com/fastavro/fastavro/releases/tag/0.19.7)
   - [x] fastavro vs apache/avro is configurable in `avroio.py` via 
`use_fastavro` argument to relevant `PTransform`s 
([`ReadFromAvro`](https://github.com/apache/beam/pull/5496/files#diff-04fef9e0550df0b0c4e1cd0264406eb5R73),
 
[`WriteToAvro`](https://github.com/apache/beam/pull/5496/files#diff-04fef9e0550df0b0c4e1cd0264406eb5R464),
 etc; default: `False`).
   - [x] `avroio_test` runs all tests against 
[apache/avro](https://github.com/apache/beam/pull/5496/files#diff-5282dd5fac1c35c3a7b556447eb694aaR52)
 and 
[fastavro](https://github.com/apache/beam/pull/5496/files#diff-5282dd5fac1c35c3a7b556447eb694aaR447)
   - [x] significant speed boost (**4-5x**) demonstrated in an example pipeline 
(discussion below)
   
   ## Example pipeline: 
[`sdks/python/examples/avro_bitcoin.py`](https://github.com/apache/beam/pull/5496/files#diff-3d963380bd7941037fba4ef3932a0cec)
   
   - I exported [the `bigquery-public-data:bitcoin_blockchain.transactions` 
table](https://bigquery.cloud.google.com/table/bigquery-public-data:bitcoin_blockchain.transactions)
 to public Avro files at 
[`gs://beam-avro-test/bitcoin/txns`](https://console.cloud.google.com/storage/browser/beam-avro-test/bitcoin/txns)
   - I ran the `avro_bitcoin` example pipeline (using DataflowRunner) on 
{fastavro,apache/avro} x {compressed,uncompressed}:
     - for example:
   
       ```bash
       python \
         -m apache_beam.examples.bitcoin \
         --runner DataflowRunner \
         --project <project> \
         --temp_location gs://<tmp>/ \
         --sdk_location $PWD/python/dist/apache-beam-2.6.0.dev0.tar.gz \
         --output gs://beam-avro-test/bitcoin/txn-counts/fastavro-compressed \
         --fastavro \
         --compressed
       ```
   
     - outputs can be found at 
[`gs://beam-avro-test/bitcoin/txn-counts`](https://console.cloud.google.com/storage/browser/beam-avro-test/bitcoin/txn-counts/)
   
   
   ### Performance Measurements
   
   | Run | Elapsed time | Workers | vCPU-hrs | mem (GB-hrs) | PD time |
   | -- | -- | -- | -- | -- | -- |
   | fastavro (compressed) | 11m31s | 1 | 0.147 | 0.553 | 36.8 |
   | fastavro (uncompressed) | 11m9s | 1 | 0.149 | 0.558 | 37.2 |
   | apache/avro (compressed) | 17m15s | 1→4 | 0.643 | 2.413 | 160.9 |
   | apache/avro (uncompressed) | 17m30s | 1→4 | 0.684 | 2.566 | 171.0 |
   
   The collected metrics were the same in all cases, but the apache/avro 
outputs had 9 shards where the fastavro outputs had 5, I'm guessing due to the 
former having used up to 4 workers where the latter used 1?
   
   ### Relevant screenshots from job-pages
   
   #### fastavro compressed:
   
     ![fastavro 
compressed](https://cl.ly/2q1u2b1I381E/Screen%20Shot%202018-06-17%20at%2011.59.15%20PM.png)
 
   
   #### fastavro uncompressed:
   
     ![fastavro 
uncompressed](https://cl.ly/3D1y2p2L242A/Screen%20Shot%202018-06-17%20at%2011.59.39%20PM.png)
 
   
   #### apache/avro compressed:
   
     ![apache 
compressed](https://cl.ly/1c093X0n440t/Screen%20Shot%202018-06-18%20at%2012.00.09%20AM.png)
 
   
   #### apache/avro uncompressed:
   
     ![apache 
uncompressed](https://cl.ly/2C263E3S3w2v/Screen%20Shot%202018-06-18%20at%2012.00.48%20AM.png)
   
   ## Open Questions
   - I'd discussed adding an integration test of this functionality with 
@chamikaramj, but I'm thinking the example pipeline (and off-by-default flag) 
may be enough to make us comfortable w this feature; open to others' thoughts 
there.
   - Is there a more appropriate bucket that the test-data should live in?

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
[email protected]


Issue Time Tracking
-------------------

    Worklog Id:     (was: 112685)
    Time Spent: 1.5h  (was: 1h 20m)

> Consider a faster Avro library in Python
> ----------------------------------------
>
>                 Key: BEAM-2810
>                 URL: https://issues.apache.org/jira/browse/BEAM-2810
>             Project: Beam
>          Issue Type: Bug
>          Components: sdk-py-core
>            Reporter: Eugene Kirpichov
>            Assignee: Ryan Williams
>            Priority: Major
>          Time Spent: 1.5h
>  Remaining Estimate: 0h
>
> https://stackoverflow.com/questions/45870789/bottleneck-on-data-source
> Seems like this job is reading Avro files (exported by BigQuery) at about 2 
> MB/s.
> We use the standard Python "avro" library which is apparently known to be 
> very slow (10x+ slower than Java) 
> http://apache-avro.679487.n3.nabble.com/Avro-decode-very-slow-in-Python-td4034422.html,
>  and there are alternatives e.g. https://pypi.python.org/pypi/fastavro/



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

Reply via email to