ctubbsii commented on pull request #2096: URL: https://github.com/apache/accumulo/pull/2096#issuecomment-842443067
@dlmarion wrote: > I don't think that one of our goals was to create a pluggable external implementation. It's possible it was, I just don't remember that. Is that something we want? My understanding of the original idea to create an "external" compactions, was to create them "external" to Accumulo... something Accumulo could "submit" work to do and not care about the implementation, and get back a new compacted file. The current implementation only makes them "external" to the tserver, not to Accumulo, and does so by growing Accumulo by 30,000 lines of code. This isn't necessarily bad, but it seems much more complex and tightly coupled to Accumulo's core than I think is maintainable long-term. My understanding of the original idea was to be able to submit compactions to some external service that the user could designate.... it could be a separate process, a bare metal server farm, EC2 nodes that scale on demand, or something else, depending on user needs. My understanding was that there would also be some flexibility in externalizing this functionality, and that implies making it pluggable in some way. Perhaps my understanding was incorrect, but that's what I had thought. The bare minimum implementation that I was expecting to review at this point would be a pluggable service API for external compactions, and possibly a rudimentary in-process implementation. I wasn't quite expecting new worker processes, and a new coordination service, as well as new ZooKeeper nodes and RPCs. All of that seems to add a lot of complexity, and while that complexity seems necessary for a robust implementation as seems to be provided here (which is great, overall), I worry about maintainability in the future if it isn't modularized and pluggable, or flexibility based on divergent use cases. Don't get me wrong... I have no issue with the basic design of this PR, as far as I can tell. I'm just wondering how we can improve upon the API/configuration to make it more modular so it's more maintainable in the future, and more flexible for users if they need to swap in alternate implementations of externalized workers to do the actual compacting. Right now, I don't know enough about the implementation to propose suggestions for how to do that... I'm only raising the issue to those who might know. As I try to dive in and learn the implementation, I may have more concrete suggestions. But, this is a lot of code, so it may take time to get that point. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: [email protected]
