Owen O'Malley wrote:
On Sep 24, 2008, at 1:50 AM, Trinh Tuan Cuong wrote:
We are developing a project and we are intend to use Hadoop to handle
the processing vast amount of data. But to convince our customers
about the using of Hadoop in our project, we must show them the
advantages ( and maybe ? the disadvantage ) when deploy the project
with Hadoop compare to Oracle Database Platform.
The primary advantage of Hadoop is scalability. On an equivalent
hardware budget, Hadoop can handle much much larger databases. We had a
process that was run once a week on Oracle that is now run once an hour
on Hadoop. Additionally, Hadoop scales out much much farther. We can
store petabytes of data in a single Hadoop cluster and have jobs that
read and generate 100's of terabytes.
That said, what a database gives you -on the right hardware- is very
fast responses, especially if the indices are set up right and the data
denormalised when appropriate. There is also really good integration
with tools and application servers, with things like Java EE designed to
make running code against a database easy.
Not using Oracle means you don't have to work with an Oracle DBA, which,
in my experience, can only be a good thing. DBAs and developers never
seem to see eye-to-eye.
Hadoop only has very primitive
security at the moment, although I expect that to change in the next 6
months.
Right now you need to trust everyone else on the network where you run
hadoop to not be malicious; the filesystem and job tracker interfaces
are insecure. The forthcoming 0.19 release will ask who you are, but the
far end trusts you to be who you say you are. In that respect, it's as
secure as NFS over UDP.
To secure Hadoop you'd probably need to
-sign every IPC request, with a CPU time cost at both ends.
-require some form of authentication for the HTTP exported parts of
the system, such as digest authentication, or issue lots of HTTPS
private keys and use that instead. Giving everyone a key management
problem as well as extra communications overhead.
What is easier would be to lock down remote access to the filesystem/job
submission so that only authenticated users would be able to upload jobs
and data. The cluster would continue to trust everything else on its
network, but the system doesn't trust people to submit work unless they
could prove who they were.