jklamer opened a new pull request #1608: URL: https://github.com/apache/avro/pull/1608
A reuse of the resolved schema struct to help improve performance when possible. Currently every write/append into a writer would use the same schema and resolve it to index all the named schemas for every object being written. Because a writer has a consistent schema it is trivial to reuse the same resolved schema for every write. This improvement was brought up in collaboration with @travisbrown What made the implementation less straightforward were the following design constraints: - absolutely no API breaking changes - Want to keep Resolved Schema crate private for simplicity As a result there is some complexity to how the resolved schema is initialized within the writer but I believe it is handled. ### Benchmark results: Using the example/benchmark.rs to get a csv output. I ran data for pre 1602 code (which the JIRA is based on) and on the latest commit of this branch. This is run on a 2016 macbook pro. I ran multiple times to confirm that the results were consistent and chose the last run for comparisons. #### Pre 1602 | count | runs | big_or_small | total_write_secs | |-------|--------|---------------|-------------------| | 10000 | 1 | Small | 0.080105792 | | 10000 | 1 | Big | 0.363642778 | | 1 | 100000 | Small | 5.450658665 | | 100 | 1000 | Small | 0.844267501 | | 10000 | 10 | Small | 0.799961709 | | 1 | 100000 | Big | 13.395100232 | | 100 | 1000 | Big | 4.254442101 | | 10000 | 10 | Big | 3.755155395 | #### This branch | count | runs | big_or_small | total_write_secs | |-------|--------|---------------|-------------------| | 10000 | 1 | Small | 0.019134068 | | 10000 | 1 | Big | 0.089809544 | | 1 | 100000 | Small | 4.449467382 | | 100 | 1000 | Small | 0.307506175 | | 10000 | 10 | Small | 0.190783385 | | 1 | 100000 | Big | 11.514703263 | | 100 | 1000 | Big | 0.931118263 | | 10000 | 10 | Big | 1.042874368 | #### Percent change (initial - final) / initial | count | runs | big_or_small | % reduction | |-------|--------|---------------|-------------| | 10000 | 1 | Small | 0.761140018 | | 10000 | 1 | Big | 0.753028110 | | 1 | 100000 | Small | 0.183682623 | | 100 | 1000 | Small | 0.635771631 | | 10000 | 10 | Small | 0.761509354 | | 1 | 100000 | Big | 0.140379462 | | 100 | 1000 | Big | 0.781142100 | | 10000 | 10 | Big | 0.722281968 | This is consistent with whats expected as we get the least performance improvements when the writer is constantly being remade. The 14/18 improvement, is a result of changes in #1602 that seem to have a highly variable performance impact depending on schema. ### Jira - [ ] My PR addresses the following [Avro Jira](https://issues.apache.org/jira/browse/AVRO-3451) ### Tests My Pr does not add tests because it does not add functionality. All tests pass as before. ### Documentation No new user facing changes -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected]
