Re: Auditing and accounting with Hadoop

Doug Cutting Wed, 07 Jan 2009 10:16:27 -0800

The notion of a client/task ID, independent of IP or username seemsuseful for log analysis. DFS's client ID is probably currently yourbest bet, but we might improve its implementation, and make the notionmore generic.


It is currently implemented as:


    String taskId = conf.get("mapred.task.id");
    if (taskId != null) {
      this.clientName = "DFSClient_" + taskId;
    } else {
      this.clientName = "DFSClient_" + r.nextInt();
    }

This hardwires a mapred dependency, which is fragile, and it's fairlyuseless outside of mapreduce, degenerating to a random number.

Rather we should probably have a configuration property that'sexplicitly used to indicate the user-level task, that's different thanthe username, IP, etc. For MapReduce jobs this could default to thejob's ID, but applications might override it. So perhaps we could addstatic methods FileSystem.{get,set}TaskId(Configuration), then changethe logging code to use this?


What do others think?

Doug

Brian Bockelman wrote:

Hey,
One of our charges is to do auditing and accounting with our filesystems (we use the simplifying assumption that the users arenon-malicious).
Auditing can be done by going through the namenode logs and utilizingthe UGI information to track opens/reads/writes back to the users.Accounting can be done by adding up the byte counts from the datanodetraces (or via the lovely metrics interfaces). However, joining themtogether appears to be impossible! The namenode audits recordoriginating IP and UGI; the datanode audits contain the originating IPand DFSClient ID. With 8 clients (and possibly 8 users) openingmultiple files all from the same IP, it becomes a mess to untangle.
For example, in other filesystems, we've been able to construct adatabase with one row representing a file access from open-to-close. Werecord the username, amount of time the file was open, number of bytesread, the remote IP, and the server which served the file (previousfilesystem saved an entire file on server, not blocks). Already, thatmodel quickly is problematic as several servers take part in serving thefile to the client. The depressing, horrible file access pattern (Worsethan random! To read a 1MB record entirely with a read-buffer size of10MB, you can possibly read up to 2GB) of some jobs means that recordingeach read is not practical.
I'd like to record audit records and transfer accounting (at some level)into the DB. Does anyone have any experience in doing this? It seemsthat, if I can add the DFSClient ID into the namenode logs, I can record:1) Each open (but miss the corresponding close) of a file at thenamenode, along with the UGI, timestamp, IP2) Each read/write on a datanode records the datanode, remote IP,DFSClient, bytes written/read, (but I miss the overall transactiontime! Possibly could be logged). Don't record the block ID, as I can'tmap block ID -> file name in a cheap/easy manner (I'd have to either dothis synchronously, causing a massive performance hit -- or do thisasynchronously, and trip up over any files which were deleted after theywere read).
This would allow me to see who is accessing what files, and how muchthat client is reading - but not necessarily which files they read from,if the same client ID is used for multiple files. This also will allowme to trace reads back to specific users (so I can tell who has theworst access patterns and beat them).
So, my questions are:
a) Is anyone doing anything remotely similar which I can reuse?
b) Is there some hole in my logic which would render the approach useless?
c) Is my approach reasonable? I.e., should I really be looking atinserting hooks into the DFSClient, as that's the only thing which cantell me information like "when did the client close the file?"?
Advise is welcome.

Brian

Re: Auditing and accounting with Hadoop

Reply via email to