Auditing and accounting with Hadoop

Brian Bockelman Wed, 07 Jan 2009 07:04:29 -0800

Hey,

One of our charges is to do auditing and accounting with our filesystems (we use the simplifying assumption that the users are non-malicious).

Auditing can be done by going through the namenode logs and utilizingthe UGI information to track opens/reads/writes back to the users.Accounting can be done by adding up the byte counts from the datanodetraces (or via the lovely metrics interfaces). However, joining themtogether appears to be impossible! The namenode audits recordoriginating IP and UGI; the datanode audits contain the originating IPand DFSClient ID. With 8 clients (and possibly 8 users) openingmultiple files all from the same IP, it becomes a mess to untangle.

For example, in other filesystems, we've been able to construct adatabase with one row representing a file access from open-to-close.We record the username, amount of time the file was open, number ofbytes read, the remote IP, and the server which served the file(previous filesystem saved an entire file on server, not blocks).Already, that model quickly is problematic as several servers takepart in serving the file to the client. The depressing, horrible fileaccess pattern (Worse than random! To read a 1MB record entirely witha read-buffer size of 10MB, you can possibly read up to 2GB) of somejobs means that recording each read is not practical.

I'd like to record audit records and transfer accounting (at somelevel) into the DB. Does anyone have any experience in doing this?It seems that, if I can add the DFSClient ID into the namenode logs, Ican record:1) Each open (but miss the corresponding close) of a file at thenamenode, along with the UGI, timestamp, IP2) Each read/write on a datanode records the datanode, remote IP,DFSClient, bytes written/read, (but I miss the overall transactiontime! Possibly could be logged). Don't record the block ID, as Ican't map block ID -> file name in a cheap/easy manner (I'd have toeither do this synchronously, causing a massive performance hit -- ordo this asynchronously, and trip up over any files which were deletedafter they were read).

This would allow me to see who is accessing what files, and how muchthat client is reading - but not necessarily which files they readfrom, if the same client ID is used for multiple files. This alsowill allow me to trace reads back to specific users (so I can tell whohas the worst access patterns and beat them).


So, my questions are:
a) Is anyone doing anything remotely similar which I can reuse?

b) Is there some hole in my logic which would render the approachuseless?c) Is my approach reasonable? I.e., should I really be looking atinserting hooks into the DFSClient, as that's the only thing which cantell me information like "when did the client close the file?"?


Advise is welcome.

Brian

Auditing and accounting with Hadoop

Reply via email to