Hi, I am currently working on a system design and I am interested in hearing some ideas how hadoop/hbase can be used to solve a couple of tricky issues.
Basically I have a data set consisting of roughly 1-5 million documents growing at a rate of about 100-500 thousand a year. Each document is not of significant size (say 100k). Documents describe dependencies that are analyzed, intermediate results are cached as new documents, and documents and intermediate results are indexed. Intermediate results are about a factor 3 times the number of documents. Seems like a perfect thing to use map reduce to handle analytics and indexing, both on an initial set, and when changes occur. My question regards being able to handle multiple sets of artifacts, and privacy. I would like to be able to handle multiple sets of private documents and private intermediate results. When analysing and indexing these the documents and partial results must run on servers that are private to the client and naturaly also be stored in a (to our client) private and secure fashion. A client would have a lot less private data than what is stored in our global and publically available data set. Interested in hearing ideas regarding how this can be done. Regards - henrik
