Re: design question - multiple artifact sets and privacy

Ted Dunning Mon, 19 Nov 2007 09:54:00 -0800

Hadoop currently has no sense of user-private data.  I believe that this
capability is under development, but I don't know what the time-line for
completion is (if any).  You should not expect the capabilities to be fully
usable when they are first release.

In spite of this lack, I think you could meet your requirements in a few
different ways.  For instance, you could emulate user isolation by running
multiple hadoop clusters on the same machines under different uid's.
Virtualization would be another option, but I would guess that you would
rather share disk space so that large clusters like your public data set
could expand easily to nearly all of available disk and so that as clusters
create large numbers of temporary files, they could use more disk space.
These approaches only address user isolation, not the general problem of
user permissions.  In particular, the idea that you might have some public
files and some private files accessed by the same program is not addressed.

This sounds much more difficult than it actually is.  There is a single
command that can be used to launch a cluster and all of the instances could
share configuration.

Another approach would be to use document encryption and build a custom
input format.  This is very easy to do.  You would leave public files in
plain text and encrypt private files with customer specific keys.  That way,
programs accessing private files could only access files that you want them
to.  Standard encryption utilities available in Java are fast enough that
you shouldn't have a major speed penalty.  We use AES for a lot of our input
files and while I would prefer a plain text format for speed, our systems go
at a respectable pace.

You are right, btw, that your problem sounds ideally suited for map-reduce.
I would recommend that you batch your documents many to a single file.  That
will massively improve throughput since you avoid many seek times.

On 11/19/07 8:22 AM, "glashammer" <[EMAIL PROTECTED]> wrote:

> Hi,
> I am currently working on a system design and I am interested in hearing
> some ideas how hadoop/hbase can be used to solve a couple of tricky issues.
> 
> Basically I have a data set consisting of roughly 1-5 million documents
> growing at a rate of about 100-500 thousand a year. Each document is not of
> significant size (say 100k). Documents describe dependencies that are
> analyzed, intermediate results are cached as new documents, and documents
> and intermediate results are indexed. Intermediate results are about a
> factor 3 times the number of documents. Seems like a perfect thing to use
> map reduce to handle analytics and indexing, both on an initial set, and
> when changes occur.
> 
> My question regards being able to handle multiple sets of artifacts, and
> privacy.
> 
> I would like to be able to handle multiple sets of private documents and
> private intermediate results. When analyzing and indexing these the
> documents and partial results must run on servers that are private to the
> client and naturally also be stored in a (to our client) private and secure
> fashion.
> A client would have a lot less private data than what is stored in our
> global and publicly available data set.
> 
> Interested in hearing ideas regarding how this can be done.
> 
> Regards
> - henrik
>

Re: design question - multiple artifact sets and privacy

Reply via email to