[ 
https://issues.apache.org/jira/browse/HDFS-6634?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14080112#comment-14080112
 ] 

Colin Patrick McCabe commented on HDFS-6634:
--------------------------------------------

James, Andrew, and I talked a bit about this and gathered some feedback from 
potential other projects that would like to use this functionality.

There are a lot of use-cases for this functionality.  One example is Lucene or 
other indexing and search systems.  They'd like to keep track of what new files 
are added to HDFS so they can index them faster.  Another is query systems like 
Impala or Hive that might also want to build separate indices or sync up 
metadata.

The way these systems work today is that they periodically do a full scan of 
all files in HDFS.  This is obviously not efficient, but it's the only thing 
they can rely on.  It is true that you could build a system to do this via 
audit logs, a queueing system, and some kind of client library.  But that 
system would be complex and hard to maintain, and pull in unnecessary 
dependencies.  These higher-level projects can't can't rely on it being 
present.  And as you yourself noted, there is no effective way to enforce 
security in such a system, since it's outside of HDFS.  File systems like ext4 
have inotify built-in.  They don't require additional daemons or software to 
use it.  HDFS ought to have at least that level of functionality.

There are a lot of downsides to audit logs.  Some of them have been brought up 
in this very thread.  They are text, so they're slow to generate and slow to 
parse.  You can't add optional fields while maintaining backwards 
compatibility, like you can with protobuf.  Since everyone is writing their own 
parsers, it's very likely that they'll get some little detail wrong or not 
handle some corner case like an equals sign or a space in a filename.  If you 
have to pick the audit log lines out of the main log, that's yet another major 
inefficiency.

I'd also like to add that I changed the audit log before and nobody got upset.  
There's no "public, stable" annotation on the audit log format and we often 
discover little things that are missing or incorrect.  A lot of the fields 
don't even make sense for many operations.  For example, many operations don't 
have both a src and a dst, but the text format requires that they both be 
present.  So it's easy to get people changing their minds about what goes 
where.  Talking about "killing people in their sleep" over changes in the audit 
log is both unprofessional and disingenuous.

> inotify in HDFS
> ---------------
>
>                 Key: HDFS-6634
>                 URL: https://issues.apache.org/jira/browse/HDFS-6634
>             Project: Hadoop HDFS
>          Issue Type: New Feature
>          Components: hdfs-client, namenode, qjm
>            Reporter: James Thomas
>            Assignee: James Thomas
>         Attachments: inotify-intro.2.pdf, inotify-intro.pdf
>
>
> Design a mechanism for applications like search engines to access the HDFS 
> edit stream.



--
This message was sent by Atlassian JIRA
(v6.2#6252)

Reply via email to