[I] Explore optimizing RFile LoadPlan computation [accumulo]

via GitHub Fri, 17 Jan 2025 09:00:17 -0800


keith-turner opened a new issue, #5272:
URL: https://github.com/apache/accumulo/issues/5272

**Is your feature request related to a problem? Please describe.**

In #4898 a new mechanism was added to RFile to compute bulk import load
plans as the RFile is written. This new mechanism was implemented using
completely new code that examines each key value written. There may be existing
code in RFile that could be leveraged for this computation that may reduce the
amount of work done per key value written.

**Describe the solution you'd like**

Determine if this
[code](https://github.com/apache/accumulo/blob/139d850e850277cfc0fd5e0da15abe1467b8fa5c/core/src/main/java/org/apache/accumulo/core/file/rfile/RFile.java#L473-L533)
could be modified to help compute the load plan leveraging its tracking of
first and last keys. Can that code be modified to minimize the total amount of
per key/value work that the rfile write pipeline is doing?

**Describe alternatives you've considered**

It may be best to not make any changes at for this issue, its needs
investigation.

The following are some reasons that maybe no changes should be made for this
issue.

1. The performance impact of the code that does per key examination added
in #4898 is negligible compared to other parts of the rfile code write
pipeline. Optimizing something that is not taking much time will not really
speed up the overall write pipeline. Need to optimize the slowest parts to see
measurable improvement.
2. The existing code is not well suited for the new task.
3. There too many existing layers of abstraction that would need to be
broken to make the change.

Only want to make this change if it give a measurable performance
improvement and does not add tech debt to the code.

--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

[I] Explore optimizing RFile LoadPlan computation [accumulo]

Reply via email to