[jira] [Work logged] (HADOOP-17873) ABFS: Fix transient failures in ITestAbfsStreamStatistics and ITestAbfsRestOperationException

ASF GitHub Bot (Jira) Tue, 07 Sep 2021 16:49:06 -0700


     [ 
https://issues.apache.org/jira/browse/HADOOP-17873?focusedWorklogId=647626&page=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-647626
 ]


ASF GitHub Bot logged work on HADOOP-17873:
-------------------------------------------

                Author: ASF GitHub Bot
            Created on: 07/Sep/21 23:48
            Start Date: 07/Sep/21 23:48
    Worklog Time Spent: 10m 
      Work Description: sumangala-patki commented on pull request #3341:
URL: https://github.com/apache/hadoop/pull/3341#issuecomment-914703372


   I agree with the fact that any given process executes tests in a sequential 
manner. 
   
   Some findings:
   The failure can be consistently reproduced by adding a dummy test (in the 
same class) to call the existing test, and running the full test suite. It 
could be verified that each of the two tests has the correct number of 
invocations of the read/write op increment function for the respective paths, 
and no extra updates. Moreover, the statistics reset method works; all 
read/write operation counts were verified to be 0 right after reset was called, 
which is before each section (small file, large file). This would rule out 
left-over states and reset issues. As stats updates happen within the driver 
before store calls to read, response from the remote store will not affect the 
values.
   Therefore, this instance of test run should have passed considering no 
interference between statistics reset and the value assertion, and with only 
the correct number of operation increments.
   
   For one failing scenario:
   Expected value for large file read op count: 102 or 103
   Actual value in streamOps test: 99
   Actual value in dummy test: 198
   Value according to logs for each test: 103
   
   Therefore, one way this could have happened is that the two tests (possibly 
along with any other test class involving read) were running in different 
processes, but around the same time. This resulted in these tests modifying the 
same statistics variable, which could also explain the drop in read count 
despite the test having executed the expected number of read ops - the 
statistics reset was called in one test while the other test was in the middle 
of executing read.
   Hence, we can conclude that any test running in parallel processes along 
with the stats test may affect this test if it performs read/write. To avoid 
this scenario, we can introduce an additional filesystem level statistics 
variable that is not static, apart from the current static one that records 
operations globally from all filesystems created in a session.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


Issue Time Tracking
-------------------

    Worklog Id:     (was: 647626)
    Time Spent: 3h 40m  (was: 3.5h)

> ABFS: Fix transient failures in ITestAbfsStreamStatistics and 
> ITestAbfsRestOperationException
> ---------------------------------------------------------------------------------------------
>
>                 Key: HADOOP-17873
>                 URL: https://issues.apache.org/jira/browse/HADOOP-17873
>             Project: Hadoop Common
>          Issue Type: Sub-task
>          Components: fs/azure
>    Affects Versions: 3.3.1
>            Reporter: Sumangala Patki
>            Assignee: Sumangala Patki
>            Priority: Major
>              Labels: pull-request-available
>          Time Spent: 3h 40m
>  Remaining Estimate: 0h
>
> To address transient failures in the following test classes:
>  * ITestAbfsStreamStatistics: Uses a filesystem level instance to record 
> read/write statistics, which also tracks these operations in other tests. 
> running parallelly. To be marked for sequential run only to avoid transient 
> failure
>  * ITestAbfsRestOperationException: The use of a static member to track retry 
> count causes transient failures when two tests of this class happen to run 
> together. Switch to non-static variable for assertions on retry count



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[jira] [Work logged] (HADOOP-17873) ABFS: Fix transient failures in ITestAbfsStreamStatistics and ITestAbfsRestOperationException

Reply via email to