Hi all,
We are a small software development firm working on data backup
software. We have a backup product which copies data from client
machine to data store. Currently we provide a specialized hardware to
store data(1-3TB disks and servers). We want to provide solution to
some customers(mining company) with following requirements
1] Huge data storage capacity(initially starting with 100 TB but
should be easy to increase)
2] Initially this facility is used as data storage but in future
company plans to add data processing software(some MapReduce jobs)
3] Most of data is unstructured (mostly images, text files and videos)
4] many times data is duplicate of some original. So need de duplication
5] Mostly data is added every time(daily backup) and occasionally
read.(Write every day new data and read on weekly)
6] data copied is in terms of files(every backup is 100,000 files each
file is some MB and some files in KB)
7] this is data storage so latency requirements are not very strict
8] Some part of data have very high HA requirements. Should be copied
to data centers outside country on timely basis(weekly, but data size
is small like few TB)
9]Currently we provide some sort of HSM(Hierarchical Storage
Management ). company needs something similar in new solution
10] Single namespace and versioning of files is another requirement

As I understood HDFS doesn't suit directly for such storage due to
following design consideration
1] Large no of small files
2] duplicate data
3] write many read once requirement

Here are my questions
1] Does DHFS support our client requirements? or at least can it be
configured to suit needs?
2] is there any customization of HDFS(if possible) which will serve the purpose

is there any other solution which will work?

All thoughts/suggestions are welcome

Regards,
Hemant.

Reply via email to