[GitHub] [avro] jklamer opened a new pull request #1608: [AVRO-3451] Reuse Resolved Schemas

GitBox Sat, 19 Mar 2022 15:40:31 -0700


jklamer opened a new pull request #1608:
URL: https://github.com/apache/avro/pull/1608



   A reuse of the resolved schema struct to help improve performance when 
possible. Currently every write/append into a writer would use the same schema 
and resolve it to index all the named schemas for every object being written. 
Because a writer has a consistent schema it is trivial to reuse the same 
resolved schema for every write. This improvement was brought up in 
collaboration with @travisbrown 
   
   What made the implementation less straightforward were the following design 
constraints:
   - absolutely no API breaking changes
   - Want to keep Resolved Schema crate private for simplicity
   
   As a result there is some complexity to how the resolved schema is 
initialized within the writer but I believe it is handled.
   
   ### Benchmark results:
   Using the example/benchmark.rs to get a csv output. I ran data for pre 1602 
code (which the JIRA is based on) and on the latest commit of this branch. This 
is run on a 2016 macbook pro. I ran multiple times to confirm that the results 
were consistent and chose the last run for comparisons. 
   
   #### Pre 1602
   | count |  runs  |  big_or_small |  total_write_secs | 
   |-------|--------|---------------|-------------------| 
   | 10000 | 1      | Small         | 0.080105792       | 
   | 10000 | 1      | Big           | 0.363642778       | 
   | 1     | 100000 | Small         | 5.450658665       | 
   | 100   | 1000   | Small         | 0.844267501       | 
   | 10000 | 10     | Small         | 0.799961709       | 
   | 1     | 100000 | Big           | 13.395100232      | 
   | 100   | 1000   | Big           | 4.254442101       | 
   | 10000 | 10     | Big           | 3.755155395       | 
   
   #### This branch 
   | count |  runs  |  big_or_small |  total_write_secs | 
   |-------|--------|---------------|-------------------| 
   | 10000 | 1      | Small         | 0.019134068       | 
   | 10000 | 1      | Big           | 0.089809544       | 
   | 1     | 100000 | Small         | 4.449467382       | 
   | 100   | 1000   | Small         | 0.307506175       | 
   | 10000 | 10     | Small         | 0.190783385       | 
   | 1     | 100000 | Big           | 11.514703263      | 
   | 100   | 1000   | Big           | 0.931118263       | 
   | 10000 | 10     | Big           | 1.042874368       | 
   
   #### Percent change (initial - final) / initial 
   | count |  runs  |  big_or_small | % reduction | 
   |-------|--------|---------------|-------------| 
   | 10000 | 1      | Small         | 0.761140018 | 
   | 10000 | 1      | Big           | 0.753028110 | 
   | 1     | 100000 | Small         | 0.183682623 | 
   | 100   | 1000   | Small         | 0.635771631 | 
   | 10000 | 10     | Small         | 0.761509354 | 
   | 1     | 100000 | Big           | 0.140379462 | 
   | 100   | 1000   | Big           | 0.781142100 | 
   | 10000 | 10     | Big           | 0.722281968 | 
   
   This is consistent with whats expected as we get the least performance 
improvements when the writer is constantly being remade. The 14/18 improvement, 
is a result of changes in #1602 that seem to have a highly variable performance 
impact  depending on schema. 
   
   ### Jira
   
   - [ ] My PR addresses the following [Avro 
Jira](https://issues.apache.org/jira/browse/AVRO-3451) 
   
   ### Tests
   
   My Pr does not add tests because it does not add functionality. All tests 
pass as before. 
   
   ### Documentation
   
   No new user facing changes
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

[GitHub] [avro] jklamer opened a new pull request #1608: [AVRO-3451] Reuse Resolved Schemas

Reply via email to