[ 
https://issues.apache.org/jira/browse/HADOOP-19270?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Erik Krogen resolved HADOOP-19270.
----------------------------------
    Resolution: Fixed

Merged to trunk.

> Use stable sort in commandQueue
> -------------------------------
>
>                 Key: HADOOP-19270
>                 URL: https://issues.apache.org/jira/browse/HADOOP-19270
>             Project: Hadoop Common
>          Issue Type: Bug
>          Components: tools
>    Affects Versions: 3.4.0, 3.4.1
>            Reporter: Kim gichan
>            Priority: Major
>              Labels: pull-request-available
>             Fix For: 3.5.0
>
>         Attachments: image-2024-09-19-16-40-34-947.png
>
>
> h2. Purpose
>  - To remove possibility of wrong-ordered log simulation
> h2. Why this happens?
>  - private DelayQueue<AuditReplayCommand> commandQueue is actually 
> PriorityQueue that use unstable sort.
>  - commandQueue can have order that is not same to original audit log order.
>  - In real production, there is the commands that occur same time and should 
> be fixed order.
> {code:bash}
> # getfileinfo before open
> 2024-07-01 19:27:12,886 INFO FSNamesystem.audit: allowed=true   ugi=xx-xx 
> (auth:TOKEN) via hive/hadoop.example....@example.prod (auth:TOKEN)   
> ip=/10.xx.xxx.xxx       cmd=getfileinfo 
> src=/user/hive/warehouse/a.db/b/date_id=2024-06-16/part-xxxx.gz.parquet     
> dst=null        perm=null       proto=rpc
> 2024-07-01 19:27:12,886 INFO FSNamesystem.audit: allowed=true   ugi=xx-xx 
> (auth:TOKEN) via hive/hadoop.example....@example.prod (auth:TOKEN)   
> ip=/10.xx.xxx.xxx       cmd=open        
> src=/user/hive/warehouse/a.db/b/date_id=2024-06-16/part-xxxx.gz.parquet     
> dst=null        perm=null       proto=rpc
> # create before setPermission
> # this examples have not exactly same time, but could be same when rate 
> factor is high enough
> 2024-07-01 17:25:30,867 INFO FSNamesystem.audit: allowed=true   
> ugi=yy...@example.prod (auth:KERBEROS)    ip=/10.xxx.xx.xxx       cmd=create  
>     src=/user/yy-yy/.staging/job_1716867484406_290658/job.xml dst=null        
> perm=yy-yy:zzz:rw-rw-r--  proto=rpc
> 2024-07-01 17:25:30,871 INFO FSNamesystem.audit: allowed=true   
> ugi=yy...@example.prod (auth:KERBEROS)    ip=/10.xxx.xx.xxx       
> cmd=setPermission       
> src=/user/yy-yy/.staging/job_1716867484406_290658/job.xml dst=null        
> perm=yy-yy:zzz:rw-r--r--  proto=rpc
> {code}
> h2. How much improve test accuracy when use stable sort?
>  - Using stable sort, wrong ordered simulation could not occur.
>  -- I fixed code to use line number of audit log in sorting criteria.
>  -- Because it is not simple to change DelayQueue data structure to use 
> stable sort
>  - Multi threading or client-ip-based-partitioning could be occur in real 
> production and affect log order, but not critical.
>  -- Client-ip-based-partitioning is even similar to real production choas log 
> order
>  - This is the graph that
>  -- use real production hdfs audit log
>  -- compare stable sort and unstable sort with different rate(1~4)
>  -- use 5 minutes simulation(in rate 1) ip-based-partitioned-audit-log
>  -- shows total valid command, total read latency, total write latency
> !image-2024-09-19-16-40-34-947.png|width=615,height=740!
>  - Conclusion
>  -- Stable sort ensure almost similar valid command number.
>  -- Unstable sort sometimes extremely high latency because of wrong ordered 
> log simulation.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

---------------------------------------------------------------------
To unsubscribe, e-mail: common-dev-unsubscr...@hadoop.apache.org
For additional commands, e-mail: common-dev-h...@hadoop.apache.org

Reply via email to