[ 
https://issues.apache.org/jira/browse/HADOOP-13403?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15400523#comment-15400523
 ] 

Subramanyam Pattipaka commented on HADOOP-13403:
------------------------------------------------

[~cnauroth],

Regarding your question on synchronization and idiomatic usage of 
ThreadPoolExecutor

Yes. But, it is very simple one and worth using it due to its benefits as 
explained below.
1. The input data set is final and doesn't change. To ensure this, I have 
changed input parameter to be final. Further, we don't have a code paths which 
can change this input list.
2. As the input array is final, each thread starts from index 0 and walks 
through till the end of the array. The synchronization is done through 
generating "next index" ATOMICALLY. This efficient way due to following reasons
        a) As the input set doesn't change, its overhead to convert array to 
synchronized queue. If number of files are in millions then having double 
memory is overkill.
        b) Each thread has to spend extra CPU to dequeue from synchronized 
queue which is costly compared to simple atomic increment.
3. With above benefits, this method still ensures that work doesn't get delayed 
even if some threads got stuck due to some reasons.


Regarding your question on use of futures

The reason for tracking lastException or operationStatus is to ensure
a) Failure of operation on single file means overall operation is failed 
anyway. No point in processing further by anyt threads. bail them out right 
away.
b) In case if threads are still yet to be submitted then bail out there is 
well. A probable case in case we use large number of threads like 128 and the 
very first file processing encountered issue.

Is there any way to achieve this through futures?


Regarding your question on RejectedExecutionException,

After thinking deep, may be this exception will not be raised as 
threadPool.shutdown happens on the same thread inline after submit calls are 
done. All submit requests must be accepted. But, having this code will ensure 
that operation is completed cleanly even if this exception occurs due to any 
reason. I have unit tests to validate the behavior as well. Please let me know 
if you find any issues in keeping this code.




> AzureNativeFileSystem rename/delete performance improvements
> ------------------------------------------------------------
>
>                 Key: HADOOP-13403
>                 URL: https://issues.apache.org/jira/browse/HADOOP-13403
>             Project: Hadoop Common
>          Issue Type: Bug
>          Components: azure
>    Affects Versions: 2.7.2
>            Reporter: Subramanyam Pattipaka
>            Assignee: Subramanyam Pattipaka
>             Fix For: 2.9.0
>
>         Attachments: HADOOP-13403-001.patch, HADOOP-13403-002.patch
>
>
> WASB Performance Improvements
> Problem
> -----------
> Azure Native File system operations like rename/delete which has large number 
> of directories and/or files in the source directory are experiencing 
> performance issues. Here are possible reasons
> a)    We first list all files under source directory hierarchically. This is 
> a serial operation. 
> b)    After collecting the entire list of files under a folder, we delete or 
> rename files one by one serially.
> c)    There is no logging information available for these costly operations 
> even in DEBUG mode leading to difficulty in understanding wasb performance 
> issues.
> Proposal
> -------------
> Step 1: Rename and delete operations will generate a list all files under the 
> source folder. We need to use azure flat listing option to get list with 
> single request to azure store. We have introduced config 
> fs.azure.flatlist.enable to enable this option. The default value is 'false' 
> which means flat listing is disabled.
> Step 2: Create thread pool and threads dynamically based on user 
> configuration. These thread pools will be deleted after operation is over.  
> We are introducing introducing two new configs
>       a)      fs.azure.rename.threads : Config to set number of rename 
> threads. Default value is 0 which means no threading.
>       b)      fs.azure.delete.threads: Config to set number of delete 
> threads. Default value is 0 which means no threading.
>       We have provided debug log information on number of threads not used 
> for the operation which can be useful .
>       Failure Scenarios:
>       If we fail to create thread pool due to ANY reason (for example trying 
> create with thread count with large value such as 1000000), we fall back to 
> serialization operation. 
> Step 3: Bob operations can be done in parallel using multiple threads 
> executing following snippet
>       while ((currentIndex = fileIndex.getAndIncrement()) < files.length) {
>               FileMetadata file = files[currentIndex];
>               Rename/delete(file);
>       }
>       The above strategy depends on the fact that all files are stored in a 
> final array and each thread has to determine synchronized next index to do 
> the job. The advantage of this strategy is that even if user configures large 
> number of unusable threads, we always ensure that work doesn’t get serialized 
> due to lagging threads. 
>       We are logging following information which can be useful for tuning 
> number of threads
>       a) Number of unusable threads
>       b) Time taken by each thread
>       c) Number of files processed by each thread
>       d) Total time taken for the operation
>       Failure Scenarios:
>       Failure to queue a thread execute request shouldn’t be an issue if we 
> can ensure at least one thread has completed execution successfully. If we 
> couldn't schedule one thread then we should take serialization path. 
> Exceptions raised while executing threads are still considered regular 
> exceptions and returned to client as operation failed. Exceptions raised 
> while stopping threads and deleting thread pool shouldn't can be ignored if 
> operation all files are done with out any issue.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to