mehakmeet commented on PR #6000: URL: https://github.com/apache/hadoop/pull/6000#issuecomment-1733516788
> Have you tested this on actual files? And specifically so many files... total size in GB's kind of scale testing? Have tested this but not at scale. Will do that. Example test: ``` ❯ bin/hadoop org.apache.hadoop.fs.s3a.audit.AuditTool s3a://mehakmeet-singh-data/logdir2/ s3a://mehakmeet-singh-data/logsdir/ 16:48:41,319 WARN util.NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable 16:49:58,339 INFO mapreduce.S3AAuditLogMergerAndParser: Successfully generated avro data 16:49:58,839 INFO mapreduce.S3AAuditLogMergerAndParser: Successfully parsed :7547 audit logs and 6718 referrer headers logs in the logs 16:49:58,854 INFO impl.MetricsSystemImpl: Stopping s3a-file-system metrics system... 16:49:58,854 INFO impl.MetricsSystemImpl: s3a-file-system metrics system stopped. 16:49:58,854 INFO impl.MetricsSystemImpl: s3a-file-system metrics system shutdown complete. ``` Since we're reading each file serially per line, I would assume this would be alot slower in that scenario. Optimisation can be a follow-up patch. > If there are so many files for ex 1000, does it launch multiple mappers to process x files ny each mapper based on the splits? Not currently. Is that something we would have to write the logic off of, I'll have to check the code for it? Specifically for number of mappers maybe we could have a threshold of number of files and then paginate them based on that. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected] --------------------------------------------------------------------- To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
