[ 
https://issues.apache.org/jira/browse/ACCUMULO-4355?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16359008#comment-16359008
 ] 

Michael Miller commented on ACCUMULO-4355:
------------------------------------------

I am wondering if some of these ideas would be better realized prior to calling 
Accumulo API.  For example, looking at how the ingest process is calling 
importDirectory() and restricting the size of directories passed to the API 
could help.

> Provide more granular control for bulk import operations
> --------------------------------------------------------
>
>                 Key: ACCUMULO-4355
>                 URL: https://issues.apache.org/jira/browse/ACCUMULO-4355
>             Project: Accumulo
>          Issue Type: Wish
>          Components: master, tserver
>            Reporter: Shawn Walker
>            Assignee: Shawn Walker
>            Priority: Major
>             Fix For: 2.0.0
>
>
> Accumulo currently provides mechanisms to initiate bulk imports and to list 
> bulk imports in progress.  Scheduling of bulk import requests is not entirely 
> deterministic, and most of the execution of a bulk-import request is done in 
> a non-preemptable manner.  As such, any bulk import which takes very long to 
> complete can block bulk imports with higher operational priority for 
> significant periods.
> To better support bulk-import-heavy applications, it would be nice if 
> Accumulo would offer additional mechanisms for controlling the scheduling and 
> execution of bulk imports, such as the abilities to:
> * Pause/resume bulk import in progress.
> * Prioritize/reprioritize bulk import requests.
> * Cancel bulk import in progress.  If possible, cancelling a partially 
> completed bulk import request should result in a rollback of changes.  That 
> is, a bulk import should either succeed or make no changes.
> Additionally, for multitenant situations, it would be nice if Accumulo would:
> * Provide multiple queues for bulk import requests.  Each queue would have 
> its requests worked serially in priority order.  Requests in separate queues 
> should be worked in parallel, or have time distributed among the queues in 
> some manner as to make work appear roughly parallel.
> ----
> Implementation-wise, I'm thinking of rewriting much of the current 
> bulk-loading logic.  While the current logic depends upon multiple threads 
> executing (potentially long-duration) blocking RPC calls, I'd like to move to 
> a more event-driven/message-passing model backed by a persistent state 
> machine.
> Current ideas I'm playing around with (very tentative)
> * Creating a new table {{accumulo.bulk_load_queues}} to keep track of bulk 
> load progress.
> * Distributing bulk load orchestration via a mechanism similar to tablet 
> assignment instead of the current blocking RPC calls (LoadFiles.java:156).
> * Implementing something akin to a two-phase commit to achieve rollback 
> behavior on failure.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

Reply via email to