[
https://issues.apache.org/jira/browse/AVRO-3451?focusedWorklogId=744660&page=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-744660
]
ASF GitHub Bot logged work on AVRO-3451:
----------------------------------------
Author: ASF GitHub Bot
Created on: 19/Mar/22 22:40
Start Date: 19/Mar/22 22:40
Worklog Time Spent: 10m
Work Description: jklamer opened a new pull request #1608:
URL: https://github.com/apache/avro/pull/1608
A reuse of the resolved schema struct to help improve performance when
possible. Currently every write/append into a writer would use the same schema
and resolve it to index all the named schemas for every object being written.
Because a writer has a consistent schema it is trivial to reuse the same
resolved schema for every write. This improvement was brought up in
collaboration with @travisbrown
What made the implementation less straightforward were the following design
constraints:
- absolutely no API breaking changes
- Want to keep Resolved Schema crate private for simplicity
As a result there is some complexity to how the resolved schema is
initialized within the writer but I believe it is handled.
### Benchmark results:
Using the example/benchmark.rs to get a csv output. I ran data for pre 1602
code (which the JIRA is based on) and on the latest commit of this branch. This
is run on a 2016 macbook pro. I ran multiple times to confirm that the results
were consistent and chose the last run for comparisons.
#### Pre 1602
| count | runs | big_or_small | total_write_secs |
|-------|--------|---------------|-------------------|
| 10000 | 1 | Small | 0.080105792 |
| 10000 | 1 | Big | 0.363642778 |
| 1 | 100000 | Small | 5.450658665 |
| 100 | 1000 | Small | 0.844267501 |
| 10000 | 10 | Small | 0.799961709 |
| 1 | 100000 | Big | 13.395100232 |
| 100 | 1000 | Big | 4.254442101 |
| 10000 | 10 | Big | 3.755155395 |
#### This branch
| count | runs | big_or_small | total_write_secs |
|-------|--------|---------------|-------------------|
| 10000 | 1 | Small | 0.019134068 |
| 10000 | 1 | Big | 0.089809544 |
| 1 | 100000 | Small | 4.449467382 |
| 100 | 1000 | Small | 0.307506175 |
| 10000 | 10 | Small | 0.190783385 |
| 1 | 100000 | Big | 11.514703263 |
| 100 | 1000 | Big | 0.931118263 |
| 10000 | 10 | Big | 1.042874368 |
#### Percent change (initial - final) / initial
| count | runs | big_or_small | % reduction |
|-------|--------|---------------|-------------|
| 10000 | 1 | Small | 0.761140018 |
| 10000 | 1 | Big | 0.753028110 |
| 1 | 100000 | Small | 0.183682623 |
| 100 | 1000 | Small | 0.635771631 |
| 10000 | 10 | Small | 0.761509354 |
| 1 | 100000 | Big | 0.140379462 |
| 100 | 1000 | Big | 0.781142100 |
| 10000 | 10 | Big | 0.722281968 |
This is consistent with whats expected as we get the least performance
improvements when the writer is constantly being remade. The 14/18 improvement,
is a result of changes in #1602 that seem to have a highly variable performance
impact depending on schema.
### Jira
- [ ] My PR addresses the following [Avro
Jira](https://issues.apache.org/jira/browse/AVRO-3451)
### Tests
My Pr does not add tests because it does not add functionality. All tests
pass as before.
### Documentation
No new user facing changes
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]
Issue Time Tracking
-------------------
Worklog Id: (was: 744660)
Remaining Estimate: 50m (was: 1h)
Time Spent: 10m
> fix poor Avro write performance
> -------------------------------
>
> Key: AVRO-3451
> URL: https://issues.apache.org/jira/browse/AVRO-3451
> Project: Apache Avro
> Issue Type: Improvement
> Components: rust
> Affects Versions: 1.11.0
> Environment: Mac OS X Big Sur
> {code:java}
> installed toolchains
> --------------------
> stable-x86_64-apple-darwin (default)
> nightly-x86_64-apple-darwin
> active toolchain
> ----------------
> stable-x86_64-apple-darwin (default)
> rustc 1.56.1 (59eed8a2a 2021-11-01) {code}
> Reporter: Kevin
> Priority: Major
> Attachments: Screen Shot 2022-03-14 at 7.30.24 PM.png
>
> Original Estimate: 1h
> Time Spent: 10m
> Remaining Estimate: 50m
>
> Rust implementation of Apache Avro library – apache-avro (née avro-rs) –
> demonstrates poor write performance when serializing Rust structures to Avro.
> Profiling indicates that this implementation spends an inordinate amount of
> time in the function {{encode::encode_ref}} performing {{clone()}} and
> {{drop}} operations related to a HashMap<String, Schema> type.
> We modified the function {{encode_ref0}} as follows:
> {code:java}
> -pub fn encode_ref(value: &Value, schema: &Schema, buffer: &mut Vec<u8>) {
> - fn encode_ref0(
> +pub fn encode_ref<'a>(value: &Value, schema: &'a Schema, buffer: &mut
> Vec<u8>) {
> + fn encode_ref0<'a>(
> value: &Value,
> - schema: &Schema,
> + schema: &'a Schema,
> buffer: &mut Vec<u8>,
> - schemas_by_name: &mut HashMap<String, Schema>,
> + schemas_by_name: &mut HashMap<&'a str, &'a Schema>,
> ) {
> match &schema {
> Schema::Ref { ref name } => {
> - let resolved =
> schemas_by_name.get(name.name.as_str()).unwrap();
> + let resolved = schemas_by_name.get(&name.name as
> &str).unwrap();
> return encode_ref0(value, resolved, buffer, &mut
> schemas_by_name.clone());
> }
> Schema::Record { ref name, .. }
> | Schema::Enum { ref name, .. }
> | Schema::Fixed { ref name, .. } => {
> - schemas_by_name.insert(name.name.clone(), schema.clone());
> + schemas_by_name.insert(&name.name, &schema);
> }
> _ => (),
> }{code}
> to remove any need for Clone in the {{schemas_by_name}} cache and see a
> notable improvement (factor of 4 to 5) in our application with this change.
> After this change, all Cargo Tests still pass and Benchmarks display a very
> significant improvement in Write performance across the board. Attached below
> is one example benchmark for {{big schema, write 10k records}} with Before on
> the Left and After on the Right.
--
This message was sent by Atlassian Jira
(v8.20.1#820001)