Kim gichan created HADOOP-19270:
-----------------------------------

             Summary: Use stable sort in commandQueue
                 Key: HADOOP-19270
                 URL: https://issues.apache.org/jira/browse/HADOOP-19270
             Project: Hadoop Common
          Issue Type: Bug
          Components: tools
    Affects Versions: 3.4.0, 3.4.1
            Reporter: Kim gichan
             Fix For: 3.5.0


h2. Purpose
- To remove possibility of wrong-ordered log simulation

h2. Why this happens?
- private DelayQueue<AuditReplayCommand> commandQueue is actually PriorityQueue 
that use unstable sort.
- commandQueue can have order that is not same to original audit log order.
- In real production, there is the commands that occur same time and should be 
fixed order.
{code:bash}
# getfileinfo before open
2024-07-01 19:27:12,886 INFO FSNamesystem.audit: allowed=true   ugi=xx-xx 
(auth:TOKEN) via hive/[email protected] (auth:TOKEN)   
ip=/10.xx.xxx.xxx       cmd=getfileinfo 
src=/user/hive/warehouse/a.db/b/date_id=2024-06-16/part-xxxx.gz.parquet     
dst=null        perm=null       proto=rpc
2024-07-01 19:27:12,886 INFO FSNamesystem.audit: allowed=true   ugi=xx-xx 
(auth:TOKEN) via hive/[email protected] (auth:TOKEN)   
ip=/10.xx.xxx.xxx       cmd=open        
src=/user/hive/warehouse/a.db/b/date_id=2024-06-16/part-xxxx.gz.parquet     
dst=null        perm=null       proto=rpc

# create before setPermission
# this examples have not exactly same time, but could be same when rate factor 
is high enough
2024-07-01 17:25:30,867 INFO FSNamesystem.audit: allowed=true   
[email protected] (auth:KERBEROS)    ip=/10.xxx.xx.xxx       cmd=create    
  src=/user/yy-yy/.staging/job_1716867484406_290658/job.xml dst=null        
perm=yy-yy:zzz:rw-rw-r--  proto=rpc
2024-07-01 17:25:30,871 INFO FSNamesystem.audit: allowed=true   
[email protected] (auth:KERBEROS)    ip=/10.xxx.xx.xxx       
cmd=setPermission       
src=/user/yy-yy/.staging/job_1716867484406_290658/job.xml dst=null        
perm=yy-yy:zzz:rw-r--r--  proto=rpc
{code}

h2. How much improve test accuracy when use stable sort?
- Using stable sort, wrong ordered simulation could not occur.
-- I fixed code to use line number of audit log in sorting criteria.
-- Because it is not simple to change DelayQueue data structure to use stable 
sort
- Multi threading or client-ip-based-partitioning could be occur in real 
production and affect log order, but not critical.
- This is the graph that
-- compare stable sort and unstable sort with different rate(1~4)
-- use 5 minutes simulation(in rate 1) ip-based-partitioned-audit-log
-- shows total valid command, total read latency, total write latency
(image will be uploaded soon)
- Conclusion
-- Stable sort ensure almost similar valid command number.
-- Unstable sort sometimes extremely high latency because of wrong ordered log 
simulation.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to