Hadoop currently has no sense of user-private data. I believe that this capability is under development, but I don't know what the time-line for completion is (if any). You should not expect the capabilities to be fully usable when they are first release.
In spite of this lack, I think you could meet your requirements in a few different ways. For instance, you could emulate user isolation by running multiple hadoop clusters on the same machines under different uid's. Virtualization would be another option, but I would guess that you would rather share disk space so that large clusters like your public data set could expand easily to nearly all of available disk and so that as clusters create large numbers of temporary files, they could use more disk space. These approaches only address user isolation, not the general problem of user permissions. In particular, the idea that you might have some public files and some private files accessed by the same program is not addressed. This sounds much more difficult than it actually is. There is a single command that can be used to launch a cluster and all of the instances could share configuration. Another approach would be to use document encryption and build a custom input format. This is very easy to do. You would leave public files in plain text and encrypt private files with customer specific keys. That way, programs accessing private files could only access files that you want them to. Standard encryption utilities available in Java are fast enough that you shouldn't have a major speed penalty. We use AES for a lot of our input files and while I would prefer a plain text format for speed, our systems go at a respectable pace. You are right, btw, that your problem sounds ideally suited for map-reduce. I would recommend that you batch your documents many to a single file. That will massively improve throughput since you avoid many seek times. On 11/19/07 8:22 AM, "glashammer" <[EMAIL PROTECTED]> wrote: > Hi, > I am currently working on a system design and I am interested in hearing > some ideas how hadoop/hbase can be used to solve a couple of tricky issues. > > Basically I have a data set consisting of roughly 1-5 million documents > growing at a rate of about 100-500 thousand a year. Each document is not of > significant size (say 100k). Documents describe dependencies that are > analyzed, intermediate results are cached as new documents, and documents > and intermediate results are indexed. Intermediate results are about a > factor 3 times the number of documents. Seems like a perfect thing to use > map reduce to handle analytics and indexing, both on an initial set, and > when changes occur. > > My question regards being able to handle multiple sets of artifacts, and > privacy. > > I would like to be able to handle multiple sets of private documents and > private intermediate results. When analyzing and indexing these the > documents and partial results must run on servers that are private to the > client and naturally also be stored in a (to our client) private and secure > fashion. > A client would have a lot less private data than what is stored in our > global and publicly available data set. > > Interested in hearing ideas regarding how this can be done. > > Regards > - henrik >
