Hi all, We are a small software development firm working on data backup software. We have a backup product which copies data from client machine to data store. Currently we provide a specialized hardware to store data(1-3TB disks and servers). We want to provide solution to some customers(mining company) with following requirements 1] Huge data storage capacity(initially starting with 100 TB but should be easy to increase) 2] Initially this facility is used as data storage but in future company plans to add data processing software(some MapReduce jobs) 3] Most of data is unstructured (mostly images, text files and videos) 4] many times data is duplicate of some original. So need de duplication 5] Mostly data is added every time(daily backup) and occasionally read.(Write every day new data and read on weekly) 6] data copied is in terms of files(every backup is 100,000 files each file is some MB and some files in KB) 7] this is data storage so latency requirements are not very strict 8] Some part of data have very high HA requirements. Should be copied to data centers outside country on timely basis(weekly, but data size is small like few TB) 9]Currently we provide some sort of HSM(Hierarchical Storage Management ). company needs something similar in new solution 10] Single namespace and versioning of files is another requirement
As I understood HDFS doesn't suit directly for such storage due to following design consideration 1] Large no of small files 2] duplicate data 3] write many read once requirement Here are my questions 1] Does DHFS support our client requirements? or at least can it be configured to suit needs? 2] is there any customization of HDFS(if possible) which will serve the purpose is there any other solution which will work? All thoughts/suggestions are welcome Regards, Hemant.