[jira] [Commented] (HADOOP-13403) AzureNativeFileSystem rename/delete performance improvements

Chris Nauroth (JIRA) Thu, 28 Jul 2016 17:02:09 -0700

    [ 
https://issues.apache.org/jira/browse/HADOOP-13403?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15398461#comment-15398461
 ]


Chris Nauroth commented on HADOOP-13403:
----------------------------------------

Hello [~pattipaka].  Thank you for the patch.

I started code reviewing this while doing a full test run with the patch 
against my Azure Storage account.  I saw these failures:

{code}
Tests run: 67, Failures: 1, Errors: 2, Skipped: 0, Time elapsed: 591.899 sec 
<<< FAILURE! - in org.apache.hadoop.fs.azure.TestFileSystemOperationsWithThreads
testRenameSigleRenameException(org.apache.hadoop.fs.azure.TestFileSystemOperationsWithThreads)
  Time elapsed: 7.062 sec  <<< ERROR!
java.lang.Exception: Unexpected exception, expected<java.io.IOException> but 
was<java.lang.AssertionError>
        at org.junit.Assert.fail(Assert.java:86)
        at org.junit.Assert.assertTrue(Assert.java:41)
        at org.junit.Assert.assertTrue(Assert.java:52)
        at 
org.apache.hadoop.fs.azure.TestFileSystemOperationsWithThreads.testRenameSigleRenameException(TestFileSystemOperationsWithThreads.java:630)

testDeleteSigleDeleteException(org.apache.hadoop.fs.azure.TestFileSystemOperationsWithThreads)
  Time elapsed: 4.289 sec  <<< ERROR!
java.lang.Exception: Unexpected exception, expected<java.io.IOException> but 
was<java.lang.AssertionError>
        at org.junit.Assert.fail(Assert.java:86)
        at org.junit.Assert.assertTrue(Assert.java:41)
        at org.junit.Assert.assertTrue(Assert.java:52)
        at 
org.apache.hadoop.fs.azure.TestFileSystemOperationsWithThreads.testDeleteSigleDeleteException(TestFileSystemOperationsWithThreads.java:479)

testDeleteSigleDeleteFailure(org.apache.hadoop.fs.azure.TestFileSystemOperationsWithThreads)
  Time elapsed: 4.403 sec  <<< FAILURE!
java.lang.AssertionError: null
        at org.junit.Assert.fail(Assert.java:86)
        at org.junit.Assert.assertTrue(Assert.java:41)
        at org.junit.Assert.assertFalse(Assert.java:64)
        at org.junit.Assert.assertFalse(Assert.java:74)
        at 
org.apache.hadoop.fs.azure.TestFileSystemOperationsWithThreads.testDeleteSigleDeleteFailure(TestFileSystemOperationsWithThreads.java:455)
{code}

I'll halt my code review at this point so that you can investigate the test 
failures.  Here is my feedback so far, based on what I've read of the patch.

{code}
      threadCount = conf.getInt(config, defaultThreadCount);
{code}

I recommend restructuring this so that the thread count for delete and rename 
are read just once in {{NativeAzureFileSystem#initialize}} and stored to member 
variables.  Accessing {{Configuration}} can be expensive, and contributors 
reading the code generally expect to find all config access in {{initialize}}, 
not throughout other methods.

{code}
      // Enable flat listing option only if depth is unboubded and config 
{code}

Typo: "unbounded".

{code}
Date start = new Date();
// Do something.
Date end = new Date();
long duration = end.getTime() - start.getTime();
{code}

For all occurrences of logic like the above, please switch to using 
{{org.apache.hadoop.util.Time#monotonicNow()}}, which is a wrapper over the JDK 
[{{java.lang.System#nanoTime()}}|http://docs.oracle.com/javase/7/docs/api/java/lang/System.html#nanoTime()].
  By switching to this, the logic won't be prone to issues caused by clock 
drift or resetting time on the host.

{code}
    private String threadIdPrefix = "AzureFileSytemThread";
{code}

Typo: "AzureFileSystemThread".

{code}
  enum AzureFileSystemOperation{
    Unkown,
    Delete,
    Rename
  }
{code}

Typo: "Unknown".

{code}
  class AzureFileSystemThreadFactory implements ThreadFactory {
{code}

{code}
  public class AzureFileSystemThreadRunnable implements Runnable {
{code}

{{NativeAzureFileSystem}} is a pretty long class already.  I propose 
refactoring {{AzureFileSystemThreadFactory}} and 
{{AzureFileSystemThreadRunnable}} to top-level classes with package-private 
visibility in separate files.

{code}
        if ((ioThreadPool = getThreadPool(threadCount, threadNamePrefix)) != 
null) {
{code}

I don't understand the null check.  {{getThreadPool}} cannot return {{null}}, 
because it simply calls the {{ThreadPoolExecutor}} constructor and returns the 
new object.  I see some tests mock it to return {{null}}, but since this 
condition cannot really happen, why do we need test coverage for it?  The same 
feedback applies to the exception handling.  I do not see any checked 
exceptions thrown from {{getThreadPool}}.  The {{ThreadPoolExecutor}} 
constructor can throw unchecked exceptions if the arguments don't make sense 
(e.g. negative thread pool size), but we can validate the sanity of those 
configuration parameters during {{initialize}} after following the feedback 
above about configuration handling.

{code}
          lastException = new IOException(operation + " failed as operation on 
subfolers and files failed.");
{code}

Typo: "subfolders".

{code}
  @Test(expected=IOException.class)
{code}

Please don't use this technique for writing tests that expect a certain 
exception.  The problem with this is that it's very coarse-grained.  It allows 
the test to pass if an {{IOException}} is thrown from any point in the method.  
There are multiple points in these test methods where the exception can be 
thrown.  Even if it's thrown from the wrong place, JUnit will still think the 
test passed.  Instead, I recommend using the JUnit {{ExpectedException}} rule, 
which gives you fine-grained control of declaring exactly where you expect the 
exception to be thrown.  See {{TestS3AAWSCredentialsProvider}} for an example 
of an existing test suite that uses {{ExpectedException}}.

{code}
  public void testDeleteSigleDeleteFailure() throws Exception {
{code}

{code}
  public void testDeleteSigleDeleteException() throws Exception {
{code}

{code}
  public void testRenameSigleRenameException() throws Exception {
{code}

Same typo on all 3: "Single".

The FindBugs and Checkstyle warnings from the last pre-commit run look 
legitimate, so please fix those.  You can check these reports locally by 
running {{mvn -o findbugs:findbugs}} and {{mvn -o checkstyle:checkstyle}}.

I recommend updating the documentation at src/site/markdown/index.md to 
describe this new feature and its configuration properties.

> AzureNativeFileSystem rename/delete performance improvements
> ------------------------------------------------------------
>
>                 Key: HADOOP-13403
>                 URL: https://issues.apache.org/jira/browse/HADOOP-13403
>             Project: Hadoop Common
>          Issue Type: Bug
>          Components: azure
>    Affects Versions: 2.7.2
>            Reporter: Subramanyam Pattipaka
>            Assignee: Subramanyam Pattipaka
>             Fix For: 2.9.0
>
>         Attachments: HADOOP-13403-001.patch, HADOOP-13403-002.patch
>
>
> WASB Performance Improvements
> Problem
> -----------
> Azure Native File system operations like rename/delete which has large number 
> of directories and/or files in the source directory are experiencing 
> performance issues. Here are possible reasons
> a)    We first list all files under source directory hierarchically. This is 
> a serial operation. 
> b)    After collecting the entire list of files under a folder, we delete or 
> rename files one by one serially.
> c)    There is no logging information available for these costly operations 
> even in DEBUG mode leading to difficulty in understanding wasb performance 
> issues.
> Proposal
> -------------
> Step 1: Rename and delete operations will generate a list all files under the 
> source folder. We need to use azure flat listing option to get list with 
> single request to azure store. We have introduced config 
> fs.azure.flatlist.enable to enable this option. The default value is 'false' 
> which means flat listing is disabled.
> Step 2: Create thread pool and threads dynamically based on user 
> configuration. These thread pools will be deleted after operation is over.  
> We are introducing introducing two new configs
>       a)      fs.azure.rename.threads : Config to set number of rename 
> threads. Default value is 0 which means no threading.
>       b)      fs.azure.delete.threads: Config to set number of delete 
> threads. Default value is 0 which means no threading.
>       We have provided debug log information on number of threads not used 
> for the operation which can be useful .
>       Failure Scenarios:
>       If we fail to create thread pool due to ANY reason (for example trying 
> create with thread count with large value such as 1000000), we fall back to 
> serialization operation. 
> Step 3: Bob operations can be done in parallel using multiple threads 
> executing following snippet
>       while ((currentIndex = fileIndex.getAndIncrement()) < files.length) {
>               FileMetadata file = files[currentIndex];
>               Rename/delete(file);
>       }
>       The above strategy depends on the fact that all files are stored in a 
> final array and each thread has to determine synchronized next index to do 
> the job. The advantage of this strategy is that even if user configures large 
> number of unusable threads, we always ensure that work doesn’t get serialized 
> due to lagging threads. 
>       We are logging following information which can be useful for tuning 
> number of threads
>       a) Number of unusable threads
>       b) Time taken by each thread
>       c) Number of files processed by each thread
>       d) Total time taken for the operation
>       Failure Scenarios:
>       Failure to queue a thread execute request shouldn’t be an issue if we 
> can ensure at least one thread has completed execution successfully. If we 
> couldn't schedule one thread then we should take serialization path. 
> Exceptions raised while executing threads are still considered regular 
> exceptions and returned to client as operation failed. Exceptions raised 
> while stopping threads and deleting thread pool shouldn't can be ignored if 
> operation all files are done with out any issue.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[jira] [Commented] (HADOOP-13403) AzureNativeFileSystem rename/delete performance improvements

Reply via email to