[jira] [Work logged] (BEAM-2810) Consider a faster Avro library in Python

ASF GitHub Bot (JIRA) Mon, 25 Jun 2018 06:38:44 -0700


     [ 
https://issues.apache.org/jira/browse/BEAM-2810?focusedWorklogId=115417&page=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-115417
 ]


ASF GitHub Bot logged work on BEAM-2810:
----------------------------------------

                Author: ASF GitHub Bot
            Created on: 25/Jun/18 13:37
            Start Date: 25/Jun/18 13:37
    Worklog Time Spent: 10m 
      Work Description: ryan-williams commented on issue #5496: [BEAM-2810] use 
fastavro in Avro IO
URL: https://github.com/apache/beam/pull/5496#issuecomment-399954425
 
 
   # Notes on the integration test, `fastavro_it_test`
   
   ## Benchmarks
   
   I set it to write 10MM synthetic records, with fastavro and avro, and then 
read them back in, each side reading what it wrote, and then verify that the 
read `PCollection`s are equal (via a `CoGroupByKey`).
   
   ### "Write" pipeline: 10MM records
   
   The fastavro side is 3.5x faster:
   
   
![](https://cl.ly/2s3h1j3I1R3c/Screen%20Shot%202018-06-25%20at%209.20.02%20AM.png)
   
   ### "Read" pipeline: 10MM records
   
   The fastavro side is 6.3x faster:
   
   
![](https://cl.ly/3N072b3h0A1u/Screen%20Shot%202018-06-25%20at%209.21.22%20AM.png)
   
   Here's the total resource metrics:
   
   
![](https://cl.ly/0o3802393S1L/Screen%20Shot%202018-06-25%20at%209.26.18%20AM.png)
   
   The total size of the files written to disk is 87.9MiB, but appears to be 
≈1.38GiB uncompressed; here's the `CoGroupByKey`'s stats:
   
   
![](https://cl.ly/182g3z360p0p/Screen%20Shot%202018-06-25%20at%209.32.22%20AM.png)
   
   ## Issues
   
   ### Glob-`ReadAllFromAvro` works in `DataflowRunner`, not for me locally via 
`DirectRunner`
   
   For some reason when I run it locally (with `DirectRunner`), the glob I'm 
using in `ReadAllFromAvro` is picking up 0 files, so the `CoGroupByKey` and 
`check` logic are effectively no-op'ing; I haven't figured out why that is yet.
   
   However, running with `DataflowRunner` against GCS behaves as expected. I 
tried adding a `time.sleep` in between the two pipelines to see whether the 
filesystem (I'm on macOS) is racing itself, but that didn't fix it, so I'm 
thinking maybe it has to do with the handling of `*`-globs on different 
filesystems?
   
   ### Doesn't clean up temporary files
   
   I still need to add this; afaict I don't get this for free just by being an 
integration test.
   

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
[email protected]


Issue Time Tracking
-------------------

    Worklog Id:     (was: 115417)
    Time Spent: 3.5h  (was: 3h 20m)

> Consider a faster Avro library in Python
> ----------------------------------------
>
>                 Key: BEAM-2810
>                 URL: https://issues.apache.org/jira/browse/BEAM-2810
>             Project: Beam
>          Issue Type: Bug
>          Components: sdk-py-core
>            Reporter: Eugene Kirpichov
>            Assignee: Ryan Williams
>            Priority: Major
>          Time Spent: 3.5h
>  Remaining Estimate: 0h
>
> https://stackoverflow.com/questions/45870789/bottleneck-on-data-source
> Seems like this job is reading Avro files (exported by BigQuery) at about 2 
> MB/s.
> We use the standard Python "avro" library which is apparently known to be 
> very slow (10x+ slower than Java) 
> http://apache-avro.679487.n3.nabble.com/Avro-decode-very-slow-in-Python-td4034422.html,
>  and there are alternatives e.g. https://pypi.python.org/pypi/fastavro/



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

[jira] [Work logged] (BEAM-2810) Consider a faster Avro library in Python

Reply via email to