[GitHub] [hadoop] mehakmeet commented on pull request #6000: HADOOP-18257. Merging and Parsing S3A audit logs into Avro format for analysis.

via GitHub Mon, 25 Sep 2023 04:55:12 -0700


mehakmeet commented on PR #6000:
URL: https://github.com/apache/hadoop/pull/6000#issuecomment-1733516788


   > Have you tested this on actual files? And specifically so many files... 
total size in GB's kind of scale testing?
   
   Have tested this but not at scale. Will do that.
   Example test:
   
   ```
   ❯ bin/hadoop org.apache.hadoop.fs.s3a.audit.AuditTool 
s3a://mehakmeet-singh-data/logdir2/ s3a://mehakmeet-singh-data/logsdir/
   16:48:41,319 WARN util.NativeCodeLoader: Unable to load native-hadoop 
library for your platform... using builtin-java classes where applicable
   16:49:58,339 INFO mapreduce.S3AAuditLogMergerAndParser: Successfully 
generated avro data
   16:49:58,839 INFO mapreduce.S3AAuditLogMergerAndParser: Successfully parsed 
:7547 audit logs and 6718 referrer headers logs in the logs
   16:49:58,854 INFO impl.MetricsSystemImpl: Stopping s3a-file-system metrics 
system...
   16:49:58,854 INFO impl.MetricsSystemImpl: s3a-file-system metrics system 
stopped.
   16:49:58,854 INFO impl.MetricsSystemImpl: s3a-file-system metrics system 
shutdown complete.
   ```
   Since we're reading each file serially per line, I would assume this would 
be alot slower in that scenario. Optimisation can be a follow-up patch. 
   
   > If there are so many files for ex 1000, does it launch multiple mappers to 
process x files ny each mapper based on the splits?
   Not currently. Is that something we would have to write the logic off of, 
I'll have to check the code for it? Specifically for number of mappers maybe we 
could have a threshold of number of files and then paginate them based on that.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[GitHub] [hadoop] mehakmeet commented on pull request #6000: HADOOP-18257. Merging and Parsing S3A audit logs into Avro format for analysis.

Reply via email to