[ https://issues.apache.org/jira/browse/HADOOP-19270?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Erik Krogen resolved HADOOP-19270. ---------------------------------- Resolution: Fixed Merged to trunk. > Use stable sort in commandQueue > ------------------------------- > > Key: HADOOP-19270 > URL: https://issues.apache.org/jira/browse/HADOOP-19270 > Project: Hadoop Common > Issue Type: Bug > Components: tools > Affects Versions: 3.4.0, 3.4.1 > Reporter: Kim gichan > Priority: Major > Labels: pull-request-available > Fix For: 3.5.0 > > Attachments: image-2024-09-19-16-40-34-947.png > > > h2. Purpose > - To remove possibility of wrong-ordered log simulation > h2. Why this happens? > - private DelayQueue<AuditReplayCommand> commandQueue is actually > PriorityQueue that use unstable sort. > - commandQueue can have order that is not same to original audit log order. > - In real production, there is the commands that occur same time and should > be fixed order. > {code:bash} > # getfileinfo before open > 2024-07-01 19:27:12,886 INFO FSNamesystem.audit: allowed=true ugi=xx-xx > (auth:TOKEN) via hive/hadoop.example....@example.prod (auth:TOKEN) > ip=/10.xx.xxx.xxx cmd=getfileinfo > src=/user/hive/warehouse/a.db/b/date_id=2024-06-16/part-xxxx.gz.parquet > dst=null perm=null proto=rpc > 2024-07-01 19:27:12,886 INFO FSNamesystem.audit: allowed=true ugi=xx-xx > (auth:TOKEN) via hive/hadoop.example....@example.prod (auth:TOKEN) > ip=/10.xx.xxx.xxx cmd=open > src=/user/hive/warehouse/a.db/b/date_id=2024-06-16/part-xxxx.gz.parquet > dst=null perm=null proto=rpc > # create before setPermission > # this examples have not exactly same time, but could be same when rate > factor is high enough > 2024-07-01 17:25:30,867 INFO FSNamesystem.audit: allowed=true > ugi=yy...@example.prod (auth:KERBEROS) ip=/10.xxx.xx.xxx cmd=create > src=/user/yy-yy/.staging/job_1716867484406_290658/job.xml dst=null > perm=yy-yy:zzz:rw-rw-r-- proto=rpc > 2024-07-01 17:25:30,871 INFO FSNamesystem.audit: allowed=true > ugi=yy...@example.prod (auth:KERBEROS) ip=/10.xxx.xx.xxx > cmd=setPermission > src=/user/yy-yy/.staging/job_1716867484406_290658/job.xml dst=null > perm=yy-yy:zzz:rw-r--r-- proto=rpc > {code} > h2. How much improve test accuracy when use stable sort? > - Using stable sort, wrong ordered simulation could not occur. > -- I fixed code to use line number of audit log in sorting criteria. > -- Because it is not simple to change DelayQueue data structure to use > stable sort > - Multi threading or client-ip-based-partitioning could be occur in real > production and affect log order, but not critical. > -- Client-ip-based-partitioning is even similar to real production choas log > order > - This is the graph that > -- use real production hdfs audit log > -- compare stable sort and unstable sort with different rate(1~4) > -- use 5 minutes simulation(in rate 1) ip-based-partitioned-audit-log > -- shows total valid command, total read latency, total write latency > !image-2024-09-19-16-40-34-947.png|width=615,height=740! > - Conclusion > -- Stable sort ensure almost similar valid command number. > -- Unstable sort sometimes extremely high latency because of wrong ordered > log simulation. -- This message was sent by Atlassian Jira (v8.20.10#820010) --------------------------------------------------------------------- To unsubscribe, e-mail: common-dev-unsubscr...@hadoop.apache.org For additional commands, e-mail: common-dev-h...@hadoop.apache.org