ctubbsii commented on pull request #2096:
URL: https://github.com/apache/accumulo/pull/2096#issuecomment-842443067


   @dlmarion wrote:
   > I don't think that one of our goals was to create a pluggable external 
implementation. It's possible it was, I just don't remember that. Is that 
something we want?
   
   My understanding of the original idea to create an "external" compactions, 
was to create them "external" to Accumulo... something Accumulo could "submit" 
work to do and not care about the implementation, and get back a new compacted 
file. The current implementation only makes them "external" to the tserver, not 
to Accumulo, and does so by growing Accumulo by 30,000 lines of code. This 
isn't necessarily bad, but it seems much more complex and tightly coupled to 
Accumulo's core than I think is maintainable long-term. My understanding of the 
original idea was to be able to submit compactions to some external service 
that the user could designate.... it could be a separate process, a bare metal 
server farm, EC2 nodes that scale on demand, or something else, depending on 
user needs. My understanding was that there would also be some flexibility in 
externalizing this functionality, and that implies making it pluggable in some 
way. Perhaps my understanding was incorrect, but that's what 
 I had thought.
   
   The bare minimum implementation that I was expecting to review at this point 
would be a pluggable service API for external compactions, and possibly a 
rudimentary in-process implementation. I wasn't quite expecting new worker 
processes, and a new coordination service, as well as new ZooKeeper nodes and 
RPCs. All of that seems to add a lot of complexity, and while that complexity 
seems necessary for a robust implementation as seems to be provided here (which 
is great, overall), I worry about maintainability in the future if it isn't 
modularized and pluggable, or flexibility based on divergent use cases.
   
   Don't get me wrong... I have no issue with the basic design of this PR, as 
far as I can tell. I'm just wondering how we can improve upon the 
API/configuration to make it more modular so it's more maintainable in the 
future, and more flexible for users if they need to swap in alternate 
implementations of externalized workers to do the actual compacting. Right now, 
I don't know enough about the implementation to propose suggestions for how to 
do that... I'm only raising the issue to those who might know. As I try to dive 
in and learn the implementation, I may have more concrete suggestions. But, 
this is a lot of code, so it may take time to get that point.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
[email protected]


Reply via email to