[
https://issues.apache.org/jira/browse/HADOOP-19270?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Kim gichan updated HADOOP-19270:
--------------------------------
Attachment: image-2024-09-19-16-40-34-947.png
Description:
h2. Purpose
- To remove possibility of wrong-ordered log simulation
h2. Why this happens?
- private DelayQueue<AuditReplayCommand> commandQueue is actually PriorityQueue
that use unstable sort.
- commandQueue can have order that is not same to original audit log order.
- In real production, there is the commands that occur same time and should be
fixed order.
{code:bash}
# getfileinfo before open
2024-07-01 19:27:12,886 INFO FSNamesystem.audit: allowed=true ugi=xx-xx
(auth:TOKEN) via hive/[email protected] (auth:TOKEN)
ip=/10.xx.xxx.xxx cmd=getfileinfo
src=/user/hive/warehouse/a.db/b/date_id=2024-06-16/part-xxxx.gz.parquet
dst=null perm=null proto=rpc
2024-07-01 19:27:12,886 INFO FSNamesystem.audit: allowed=true ugi=xx-xx
(auth:TOKEN) via hive/[email protected] (auth:TOKEN)
ip=/10.xx.xxx.xxx cmd=open
src=/user/hive/warehouse/a.db/b/date_id=2024-06-16/part-xxxx.gz.parquet
dst=null perm=null proto=rpc
# create before setPermission
# this examples have not exactly same time, but could be same when rate factor
is high enough
2024-07-01 17:25:30,867 INFO FSNamesystem.audit: allowed=true
[email protected] (auth:KERBEROS) ip=/10.xxx.xx.xxx cmd=create
src=/user/yy-yy/.staging/job_1716867484406_290658/job.xml dst=null
perm=yy-yy:zzz:rw-rw-r-- proto=rpc
2024-07-01 17:25:30,871 INFO FSNamesystem.audit: allowed=true
[email protected] (auth:KERBEROS) ip=/10.xxx.xx.xxx
cmd=setPermission
src=/user/yy-yy/.staging/job_1716867484406_290658/job.xml dst=null
perm=yy-yy:zzz:rw-r--r-- proto=rpc
{code}
h2. How much improve test accuracy when use stable sort?
- Using stable sort, wrong ordered simulation could not occur.
-- I fixed code to use line number of audit log in sorting criteria.
-- Because it is not simple to change DelayQueue data structure to use stable
sort
- Multi threading or client-ip-based-partitioning could be occur in real
production and affect log order, but not critical.
- This is the graph that
-- use real production hdfs audit log
-- compare stable sort and unstable sort with different rate(1~4)
-- use 5 minutes simulation(in rate 1) ip-based-partitioned-audit-log
-- shows total valid command, total read latency, total write latency
!image-2024-09-19-16-40-34-947.png!
- Conclusion
-- Stable sort ensure almost similar valid command number.
-- Unstable sort sometimes extremely high latency because of wrong ordered log
simulation.
was:
h2. Purpose
- To remove possibility of wrong-ordered log simulation
h2. Why this happens?
- private DelayQueue<AuditReplayCommand> commandQueue is actually PriorityQueue
that use unstable sort.
- commandQueue can have order that is not same to original audit log order.
- In real production, there is the commands that occur same time and should be
fixed order.
{code:bash}
# getfileinfo before open
2024-07-01 19:27:12,886 INFO FSNamesystem.audit: allowed=true ugi=xx-xx
(auth:TOKEN) via hive/[email protected] (auth:TOKEN)
ip=/10.xx.xxx.xxx cmd=getfileinfo
src=/user/hive/warehouse/a.db/b/date_id=2024-06-16/part-xxxx.gz.parquet
dst=null perm=null proto=rpc
2024-07-01 19:27:12,886 INFO FSNamesystem.audit: allowed=true ugi=xx-xx
(auth:TOKEN) via hive/[email protected] (auth:TOKEN)
ip=/10.xx.xxx.xxx cmd=open
src=/user/hive/warehouse/a.db/b/date_id=2024-06-16/part-xxxx.gz.parquet
dst=null perm=null proto=rpc
# create before setPermission
# this examples have not exactly same time, but could be same when rate factor
is high enough
2024-07-01 17:25:30,867 INFO FSNamesystem.audit: allowed=true
[email protected] (auth:KERBEROS) ip=/10.xxx.xx.xxx cmd=create
src=/user/yy-yy/.staging/job_1716867484406_290658/job.xml dst=null
perm=yy-yy:zzz:rw-rw-r-- proto=rpc
2024-07-01 17:25:30,871 INFO FSNamesystem.audit: allowed=true
[email protected] (auth:KERBEROS) ip=/10.xxx.xx.xxx
cmd=setPermission
src=/user/yy-yy/.staging/job_1716867484406_290658/job.xml dst=null
perm=yy-yy:zzz:rw-r--r-- proto=rpc
{code}
h2. How much improve test accuracy when use stable sort?
- Using stable sort, wrong ordered simulation could not occur.
-- I fixed code to use line number of audit log in sorting criteria.
-- Because it is not simple to change DelayQueue data structure to use stable
sort
- Multi threading or client-ip-based-partitioning could be occur in real
production and affect log order, but not critical.
- This is the graph that
-- compare stable sort and unstable sort with different rate(1~4)
-- use 5 minutes simulation(in rate 1) ip-based-partitioned-audit-log
-- shows total valid command, total read latency, total write latency
(image will be uploaded soon)
- Conclusion
-- Stable sort ensure almost similar valid command number.
-- Unstable sort sometimes extremely high latency because of wrong ordered log
simulation.
> Use stable sort in commandQueue
> -------------------------------
>
> Key: HADOOP-19270
> URL: https://issues.apache.org/jira/browse/HADOOP-19270
> Project: Hadoop Common
> Issue Type: Bug
> Components: tools
> Affects Versions: 3.4.0, 3.4.1
> Reporter: Kim gichan
> Priority: Major
> Labels: pull-request-available
> Fix For: 3.5.0
>
> Attachments: image-2024-09-19-16-40-34-947.png
>
>
> h2. Purpose
> - To remove possibility of wrong-ordered log simulation
> h2. Why this happens?
> - private DelayQueue<AuditReplayCommand> commandQueue is actually
> PriorityQueue that use unstable sort.
> - commandQueue can have order that is not same to original audit log order.
> - In real production, there is the commands that occur same time and should
> be fixed order.
> {code:bash}
> # getfileinfo before open
> 2024-07-01 19:27:12,886 INFO FSNamesystem.audit: allowed=true ugi=xx-xx
> (auth:TOKEN) via hive/[email protected] (auth:TOKEN)
> ip=/10.xx.xxx.xxx cmd=getfileinfo
> src=/user/hive/warehouse/a.db/b/date_id=2024-06-16/part-xxxx.gz.parquet
> dst=null perm=null proto=rpc
> 2024-07-01 19:27:12,886 INFO FSNamesystem.audit: allowed=true ugi=xx-xx
> (auth:TOKEN) via hive/[email protected] (auth:TOKEN)
> ip=/10.xx.xxx.xxx cmd=open
> src=/user/hive/warehouse/a.db/b/date_id=2024-06-16/part-xxxx.gz.parquet
> dst=null perm=null proto=rpc
> # create before setPermission
> # this examples have not exactly same time, but could be same when rate
> factor is high enough
> 2024-07-01 17:25:30,867 INFO FSNamesystem.audit: allowed=true
> [email protected] (auth:KERBEROS) ip=/10.xxx.xx.xxx cmd=create
> src=/user/yy-yy/.staging/job_1716867484406_290658/job.xml dst=null
> perm=yy-yy:zzz:rw-rw-r-- proto=rpc
> 2024-07-01 17:25:30,871 INFO FSNamesystem.audit: allowed=true
> [email protected] (auth:KERBEROS) ip=/10.xxx.xx.xxx
> cmd=setPermission
> src=/user/yy-yy/.staging/job_1716867484406_290658/job.xml dst=null
> perm=yy-yy:zzz:rw-r--r-- proto=rpc
> {code}
> h2. How much improve test accuracy when use stable sort?
> - Using stable sort, wrong ordered simulation could not occur.
> -- I fixed code to use line number of audit log in sorting criteria.
> -- Because it is not simple to change DelayQueue data structure to use stable
> sort
> - Multi threading or client-ip-based-partitioning could be occur in real
> production and affect log order, but not critical.
> - This is the graph that
> -- use real production hdfs audit log
> -- compare stable sort and unstable sort with different rate(1~4)
> -- use 5 minutes simulation(in rate 1) ip-based-partitioned-audit-log
> -- shows total valid command, total read latency, total write latency
> !image-2024-09-19-16-40-34-947.png!
> - Conclusion
> -- Stable sort ensure almost similar valid command number.
> -- Unstable sort sometimes extremely high latency because of wrong ordered
> log simulation.
--
This message was sent by Atlassian Jira
(v8.20.10#820010)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]