Steve Loughran created HADOOP-17943:
---------------------------------------
Summary: Add s3a tool to convert S3 server logs to avro/csv files
Key: HADOOP-17943
URL: https://issues.apache.org/jira/browse/HADOOP-17943
Project: Hadoop Common
Issue Type: Sub-task
Components: fs/s3
Affects Versions: 3.3.2
Reporter: Steve Loughran
Add s3a tool to convert S3 server logs to avro/csv files
With S3A Auditing, we have code in hadoop-aws to parse s3 log entries,
including splitting up the referrer into its fields.
But we don't have an easy way of using it. I've done some early work in spark
but as well as that code not working
([https://github.com/hortonworks-spark/cloud-integration/blob/master/spark-cloud-integration/src/main/scala/com/cloudera/spark/cloud/s3/S3LogRecordParser.scala]),
it doesn't do the audit splitting.
And, given that the S3 audit logs can be small on a lightly loaded store, not
always justified.
Proposed
we add
# utility parser class to take a row and split it into a record
# which can be saved to avro through a schema we define
# or exported to CSV with/without headers. (with: easy to understand, without:
can cat files)
# add a mapper so this can be used in MR jobs (could even make it committer
test ..)
# and a "hadoop s3guard/hadoop s3" entry point so you can do it on the cli
{code:java}
hadoop s3 parselogs -format avro -out s3a://dest/path -recursive
s3a://stevel-london/logs/bucket1/*
{code}
would take all files under the path, load, parse and emit the output.
design issues
* would you combine all files, or emit a new .avro or .csv file for each one?
* what's a good avro schema to cope with new context attributes
* CSV nuances: tabs vs spaces, use opencsv or implement the (escaping?) writer
ourselves.
me: TSV and do a minimal escaping and quoting emitter. Can use opencsv in the
test suite.
* would you want an initial filter during processing? especially for exit
codes?
me: no, though I could see the benefit for 503s. Best to let you load it into
a notebook or spreadsheet and go from there.
--
This message was sent by Atlassian Jira
(v8.3.4#803005)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]