[ 
https://issues.apache.org/jira/browse/HADOOP-18257?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17768659#comment-17768659
 ] 

ASF GitHub Bot commented on HADOOP-18257:
-----------------------------------------

mehakmeet commented on PR #6000:
URL: https://github.com/apache/hadoop/pull/6000#issuecomment-1733516788

   > Have you tested this on actual files? And specifically so many files... 
total size in GB's kind of scale testing?
   
   Have tested this but not at scale. Will do that.
   Example test:
   
   ```
   ❯ bin/hadoop org.apache.hadoop.fs.s3a.audit.AuditTool 
s3a://mehakmeet-singh-data/logdir2/ s3a://mehakmeet-singh-data/logsdir/
   16:48:41,319 WARN util.NativeCodeLoader: Unable to load native-hadoop 
library for your platform... using builtin-java classes where applicable
   16:49:58,339 INFO mapreduce.S3AAuditLogMergerAndParser: Successfully 
generated avro data
   16:49:58,839 INFO mapreduce.S3AAuditLogMergerAndParser: Successfully parsed 
:7547 audit logs and 6718 referrer headers logs in the logs
   16:49:58,854 INFO impl.MetricsSystemImpl: Stopping s3a-file-system metrics 
system...
   16:49:58,854 INFO impl.MetricsSystemImpl: s3a-file-system metrics system 
stopped.
   16:49:58,854 INFO impl.MetricsSystemImpl: s3a-file-system metrics system 
shutdown complete.
   ```
   Since we're reading each file serially per line, I would assume this would 
be alot slower in that scenario. Optimisation can be a follow-up patch. 
   
   > If there are so many files for ex 1000, does it launch multiple mappers to 
process x files ny each mapper based on the splits?
   Not currently. Is that something we would have to write the logic off of, 
I'll have to check the code for it? Specifically for number of mappers maybe we 
could have a threshold of number of files and then paginate them based on that.




> Analyzing S3A Audit Logs 
> -------------------------
>
>                 Key: HADOOP-18257
>                 URL: https://issues.apache.org/jira/browse/HADOOP-18257
>             Project: Hadoop Common
>          Issue Type: Task
>          Components: fs/s3
>            Reporter: Sravani Gadey
>            Assignee: Mehakmeet Singh
>            Priority: Major
>              Labels: pull-request-available
>
> The main aim is to analyze S3A Audit logs to give better insights in Hive and 
> Spark jobs.
> Steps involved are:
>  * Merging audit log files containing huge number of audit logs collected 
> from a job containing various S3 requests.
>  * Parsing audit logs using regular expressions i.e., dividing them into key 
> value pairs.
>  * Converting the key value pairs into CSV file and AVRO file formats.
>  * Querying on data which would give better insights for different jobs.
>  * Visualizing the audit logs on Zeppelin or Jupyter notebook with graphs.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to