Hi,
I am currently working on a system design and I am interested in hearing
some ideas how hadoop/hbase can be used to solve a couple of tricky issues.

Basically I have a data set consisting of roughly 1-5 million documents
growing at a rate of about 100-500 thousand a year. Each document is not of
significant size (say 100k). Documents describe dependencies that are
analyzed, intermediate results are cached as new documents, and documents
and intermediate results are indexed. Intermediate results are about a
factor 3 times the number of documents. Seems like a perfect thing to use
map reduce to handle analytics and indexing, both on an initial set, and
when changes occur.

My question regards being able to handle multiple sets of artifacts, and
privacy.

I would like to be able to handle multiple sets of private documents and
private intermediate results. When analyzing and indexing these the
documents and partial results must run on servers that are private to the
client and naturally also be stored in a (to our client) private and secure
fashion.
A client would have a lot less private data than what is stored in our
global and publicly available data set.

Interested in hearing ideas regarding how this can be done.

Regards
- henrik

Reply via email to