Re: [PR] Modifies the Compactor to be a more generic job execution process [accumulo]

via GitHub Fri, 06 Oct 2023 05:15:59 -0700


dlmarion commented on PR #3801:
URL: https://github.com/apache/accumulo/pull/3801#issuecomment-1750541282


   > @dlmarion if you have some time to chat sometime I would like to discuss 
some questions I have. Wondering about the following.
   > 
   
   Yeah, let's find a time.
   
   > Where things run? Maybe instead of having a task runner executable, task 
runner is more of an internal code library that user facing executable 
components instantiate. For example thinking through what the user facing 
accumulo commands could look like and what they could do.
   > 
   > ```
   > accumulo compactor  -- runs compactions, instantiates a task runner in its 
impl to make this happen
   > accumulo tserver    -- hosts tablets and does log sorts, it runs a task 
runner inside of a tablet server process to do log sorts
   > accumulo manager    -- runs fate, assigns tablets, does split 
calculations... non-primary manager processes can run a task runner process to 
do split calculations
   > ```
   
   With on-demand tablets it's possible that a user only has enough 
TabletServers running to host the root and metadata tables - that would be my 
only concern with doing log sorting in the TabletServer. I do like the idea of 
non-primary managers doing some of this work (which BTW I had working a while 
back in #3262 ). If we could leverage the non-primary managers, then this PR 
could likely be closed leaving the compactors the way that they are today. I 
think we should explore that idea sooner rather than later. There was another 
item mentioned in #3796 - compaction selection functionality. If we performed 
compaction selection and split calculations in the non-primary manager, left 
compactors the way they are today, then we just need a server component to run 
log sorting. IIRC from #3262 you can have multiple non-primary managers.
   
   > 
   > What task return? Maybe nothing, could we structure all task such that 
there is no return of data to manager? A task runner gets a task from the 
manager, runs it, and when done gets another task. It never reports completion 
to the manager or status.
   > 
   > * compaction task run the compaction and commit it to the metadata table 
as part of the task
   > * compaction task do not report stats back to the manager, but only to the 
metrics system.  Could the monitor get the data it needs?  Maybe the monitor 
contacts task runners directly if it wants info?
   > * log sort task sort the logs and create the appropriate dirs in hdfs when 
done
   > * split task could update the metadata table with the needed information 
instead of reporting back
   > 
   > If task do not return anything, then that simplifies the thrift API and 
the manager possibly. Would not need to worry about keeping info in memory in 
the manager and keeping that info consistent and avoiding using to much memory.
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Re: [PR] Modifies the Compactor to be a more generic job execution process [accumulo]

Reply via email to