[
https://issues.apache.org/jira/browse/HDFS-6634?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14080112#comment-14080112
]
Colin Patrick McCabe commented on HDFS-6634:
--------------------------------------------
James, Andrew, and I talked a bit about this and gathered some feedback from
potential other projects that would like to use this functionality.
There are a lot of use-cases for this functionality. One example is Lucene or
other indexing and search systems. They'd like to keep track of what new files
are added to HDFS so they can index them faster. Another is query systems like
Impala or Hive that might also want to build separate indices or sync up
metadata.
The way these systems work today is that they periodically do a full scan of
all files in HDFS. This is obviously not efficient, but it's the only thing
they can rely on. It is true that you could build a system to do this via
audit logs, a queueing system, and some kind of client library. But that
system would be complex and hard to maintain, and pull in unnecessary
dependencies. These higher-level projects can't can't rely on it being
present. And as you yourself noted, there is no effective way to enforce
security in such a system, since it's outside of HDFS. File systems like ext4
have inotify built-in. They don't require additional daemons or software to
use it. HDFS ought to have at least that level of functionality.
There are a lot of downsides to audit logs. Some of them have been brought up
in this very thread. They are text, so they're slow to generate and slow to
parse. You can't add optional fields while maintaining backwards
compatibility, like you can with protobuf. Since everyone is writing their own
parsers, it's very likely that they'll get some little detail wrong or not
handle some corner case like an equals sign or a space in a filename. If you
have to pick the audit log lines out of the main log, that's yet another major
inefficiency.
I'd also like to add that I changed the audit log before and nobody got upset.
There's no "public, stable" annotation on the audit log format and we often
discover little things that are missing or incorrect. A lot of the fields
don't even make sense for many operations. For example, many operations don't
have both a src and a dst, but the text format requires that they both be
present. So it's easy to get people changing their minds about what goes
where. Talking about "killing people in their sleep" over changes in the audit
log is both unprofessional and disingenuous.
> inotify in HDFS
> ---------------
>
> Key: HDFS-6634
> URL: https://issues.apache.org/jira/browse/HDFS-6634
> Project: Hadoop HDFS
> Issue Type: New Feature
> Components: hdfs-client, namenode, qjm
> Reporter: James Thomas
> Assignee: James Thomas
> Attachments: inotify-intro.2.pdf, inotify-intro.pdf
>
>
> Design a mechanism for applications like search engines to access the HDFS
> edit stream.
--
This message was sent by Atlassian JIRA
(v6.2#6252)