keith-turner opened a new issue #1451: Support external compactions in 
containers
URL: https://github.com/apache/accumulo/issues/1451
 
 
   For use cases like large scale filtering data on an accumulo table, it may 
be useful to support running compactions externally from a tserver in a system 
like kubernetes. This feature could support the following operations and 
behaviors.
   
    * Client side Accumulo API that selects tablets and files to compact that 
returns serializable+runnable objects for each external compaction. The objects 
can be serialized and run anywhere that has access to DFS.
    * A lease for every file that has been selected for external compaction.  
This lease prevents other compactions from processing the files.
    * A client side API for listing, committing, and canceling outstanding 
external compactions. 
   
   If all of the selection decisions for tablets and files to compact are made 
in the client side, then a user could pass a lambda to Accumulo to make these 
decisions.  This approach would avoid having to put that code on tservers.  The 
Accumulo client code could make RPCs to bring all needed information to the 
client side. Accumulo could also automatically handle the case of things 
changing and call the lambda again.
   
   External compactions could have early or late binding for the set of files 
to compact.  Early binding is much easier to implement and run idempotently on 
a cluster.  There is one problem with early binding : leases could be held on 
lots of small file for a long time negatively impacting scans.  One possible 
way to avoid this would be compact all files less than size X on tserver before 
starting an external compaction and then only select files over size X for the 
external compaction.

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services

Reply via email to