[ 
https://issues.apache.org/jira/browse/IMPALA-14462?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=18022922#comment-18022922
 ] 

ASF subversion and git services commented on IMPALA-14462:
----------------------------------------------------------

Commit 775f73f03ea59401ca2752383182185599b9777d in impala's branch 
refs/heads/master from Joe McDonnell
[ https://gitbox.apache.org/repos/asf?p=impala.git;h=775f73f03 ]

IMPALA-14462: Fix tie-breaking for sorting scan ranges oldest to newest

TestTupleCacheFullCluster.test_scan_range_distributed is flaky on s3
builds. The addition of a single file is changing scheduling significantly
even with scan ranges sorted oldest to newest. This is because modification
times on S3 have a granularity of one second. Multiple files have the
same modification time, and the fix for IMPALA-13548 did not properly
break ties for sorting.

This adds logic to break ties for files with the same modification
time. It compares the path (absolute path or relative path + partition)
as well as the offset within the file. These should be enough to break
all conceivable ties, as it is not possible to have two scan ranges with
the same file at the same offset. In debug builds, this does additional
validation to make sure that when a != b, comp(a, b) != comp(b, a).

The test requires that adding a single file to the table changes exactly
one cache key. If that final file has the same modification time as
an existing file, scheduling may still mix up the files and change more
than one cache key, even with tie-breaking. This adds a sleep just before
generating the final file to guarantee that it gets a newer modification
time.

Testing:
 - Ran TestTupleCacheFullCluster.test_scan_range_distributed for 15
   iterations on S3

Change-Id: I3f2e40d3f975ee370c659939da0374675a28cd38
Reviewed-on: http://gerrit.cloudera.org:8080/23458
Tested-by: Impala Public Jenkins <impala-public-jenk...@cloudera.com>
Reviewed-by: Michael Smith <michael.sm...@cloudera.com>
Reviewed-by: Riza Suminto <riza.sumi...@cloudera.com>


> TestTupleCacheFullCluster.test_scan_range_distributed fails on S3
> -----------------------------------------------------------------
>
>                 Key: IMPALA-14462
>                 URL: https://issues.apache.org/jira/browse/IMPALA-14462
>             Project: IMPALA
>          Issue Type: Bug
>          Components: Backend, Test
>    Affects Versions: Impala 5.0.0
>            Reporter: Joe McDonnell
>            Assignee: Joe McDonnell
>            Priority: Blocker
>             Fix For: Impala 5.0.0
>
>
> TestTupleCacheFullCluster.test_scan_range_distributed is expecting that 
> inserting a single file doesn't change tuple cache's runtime hash for more 
> than a single executor. This should be true due to the modification to 
> schedule scan ranges oldest to newest. This is failing on S3:
> {noformat}
> custom_cluster/test_tuple_cache.py:905: in test_scan_range_distributed
>     assert len(after_insert_unique_cache_keys - unique_cache_keys) == 1
> E   assert 3 == 1
> E    +  where 3 = len(({'6e2682fb793acd7b689a8d69aab01675_1266802730', 
> '6e2682fb793acd7b689a8d69aab01675_2730281323', 
> '6e2682fb793acd7b689a8d69aab01675_3027502829'} - 
> {'6e2682fb793acd7b689a8d69aab01675_1885657991', 
> '6e2682fb793acd7b689a8d69aab01675_2939791479', 
> '6e2682fb793acd7b689a8d69aab01675_3685468122'})){noformat}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-all-unsubscr...@impala.apache.org
For additional commands, e-mail: issues-all-h...@impala.apache.org

Reply via email to