[
https://issues.apache.org/jira/browse/HADOOP-17943?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Steve Loughran reassigned HADOOP-17943:
---------------------------------------
Assignee: Mehakmeet Singh
> Add s3a tool to convert S3 server logs to avro/csv files
> --------------------------------------------------------
>
> Key: HADOOP-17943
> URL: https://issues.apache.org/jira/browse/HADOOP-17943
> Project: Hadoop Common
> Issue Type: Sub-task
> Components: fs/s3
> Affects Versions: 3.3.2
> Reporter: Steve Loughran
> Assignee: Mehakmeet Singh
> Priority: Major
>
> Add s3a tool to convert S3 server logs to avro/csv files
> With S3A Auditing, we have code in hadoop-aws to parse s3 log entries,
> including splitting up the referrer into its fields.
> But we don't have an easy way of using it. I've done some early work in spark
> but as well as that code not working
> ([https://github.com/hortonworks-spark/cloud-integration/blob/master/spark-cloud-integration/src/main/scala/com/cloudera/spark/cloud/s3/S3LogRecordParser.scala]),
> it doesn't do the audit splitting.
> And, given that the S3 audit logs can be small on a lightly loaded store,
> not always justified.
> Proposed
> we add
> # utility parser class to take a row and split it into a record
> # which can be saved to avro through a schema we define
> # or exported to CSV with/without headers. (with: easy to understand,
> without: can cat files)
> # add a mapper so this can be used in MR jobs (could even make it committer
> test ..)
> # and a "hadoop s3guard/hadoop s3" entry point so you can do it on the cli
> {code:java}
> hadoop s3 parselogs -format avro -out s3a://dest/path -recursive
> s3a://stevel-london/logs/bucket1/*
> {code}
> would take all files under the path, load, parse and emit the output.
> design issues
> * would you combine all files, or emit a new .avro or .csv file for each one?
> * what's a good avro schema to cope with new context attributes
> * CSV nuances: tabs vs spaces, use opencsv or implement the (escaping?)
> writer ourselves.
> me: TSV and do a minimal escaping and quoting emitter. Can use opencsv in
> the test suite.
> * would you want an initial filter during processing? especially for exit
> codes?
> me: no, though I could see the benefit for 503s. Best to let you load it
> into a notebook or spreadsheet and go from there.
--
This message was sent by Atlassian Jira
(v8.20.10#820010)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]