Owen O'Malley wrote:
On Sep 24, 2008, at 1:50 AM, Trinh Tuan Cuong wrote:

We are developing a project and we are intend to use Hadoop to handle the processing vast amount of data. But to convince our customers about the using of Hadoop in our project, we must show them the advantages ( and maybe ? the disadvantage ) when deploy the project with Hadoop compare to Oracle Database Platform.

The primary advantage of Hadoop is scalability. On an equivalent hardware budget, Hadoop can handle much much larger databases. We had a process that was run once a week on Oracle that is now run once an hour on Hadoop. Additionally, Hadoop scales out much much farther. We can store petabytes of data in a single Hadoop cluster and have jobs that read and generate 100's of terabytes.

That said, what a database gives you -on the right hardware- is very fast responses, especially if the indices are set up right and the data denormalised when appropriate. There is also really good integration with tools and application servers, with things like Java EE designed to make running code against a database easy.

Not using Oracle means you don't have to work with an Oracle DBA, which, in my experience, can only be a good thing. DBAs and developers never seem to see eye-to-eye.



Hadoop only has very primitive security at the moment, although I expect that to change in the next 6 months.


Right now you need to trust everyone else on the network where you run hadoop to not be malicious; the filesystem and job tracker interfaces are insecure. The forthcoming 0.19 release will ask who you are, but the far end trusts you to be who you say you are. In that respect, it's as secure as NFS over UDP.

To secure Hadoop you'd probably need to
 -sign every IPC request, with a CPU time cost at both ends.
-require some form of authentication for the HTTP exported parts of the system, such as digest authentication, or issue lots of HTTPS private keys and use that instead. Giving everyone a key management problem as well as extra communications overhead.

What is easier would be to lock down remote access to the filesystem/job submission so that only authenticated users would be able to upload jobs and data. The cluster would continue to trust everything else on its network, but the system doesn't trust people to submit work unless they could prove who they were.

Reply via email to