I have forwarded this discussion to my mentors so they are informed
and I hope they will provide better input regarding email storage.

> I second what Todd said, even with FuseHDFS, mounting HDFS as a regular file
> system, it won't give you the immediate response about the file status that
> you need. I believe Google implemented Gmail with HBase. Here is an example
> of implementing a mail store with Cassandra:
> http://ewh.ieee.org/r6/scv/computer/nfic/2009/IBM-Jun-Rao.pdf
>
> <http://ewh.ieee.org/r6/scv/computer/nfic/2009/IBM-Jun-Rao.pdf>Mark

Thanks Mark, I will look into that. I am currently watching. Claudera
Hadoop Training [1] to get a better view of how things work.

I have one question: what is the defining difference between Cassandra
and HBase? Also, Eric, one of my mentors, suggested I use Gora for
this and after a quick look at Gora I saw that it is an ORM for HBase
and Cassandra which will allow me switch between them. The downside
with this is that Gora is still incubating so a piece of advice about
using it or not is welcomed. I will also ask on the Gora mailing list
to see how things are there.

>> I would encourage you to look at a system like HBase for your mail
>> backend. HDFS doesn't work well with lots of little files, and also
>> doesn't support random update, so existing formats like Maildir
>> wouldn't be a good fit.

I don't think I understand correctly what you mean by random updates.
E-mails are immutable so once written they are not going to be
updated. But if you are referring to the fact that lots of (small)
files will be written in a directory and that this can be a problem
then I get it. This will also mean that mailbox format (all emails in
one file) will be more inappropriate than Maildir. But since e-mails
are immutable and adding a mail to the mailbox means appending a small
piece of data to the file this should not be a problem if Hadoop has
append.

The presentation on Vimeo it stated that HDFS 0.19 did not had append,
I don't know yet what is the status on that, but things are a little
brighter. You could have a mailbox file that could grow to a very
large size. This will lead to all the users emails into one big file
that is easy to manage, the only thing that it's missing is the
fetching the emails. Since emails are appended to the file (inbox) as
they come, and you usually are interested in the latest emails
received you could just read the tail of the file and do some indexing
based on that. Should I post this on the HDFS mailing-list also?

I'm talking without real experience with Hadoop so shut me up if I'm wrong.

>> --
>> Todd Lipcon
>> Software Engineer, Cloudera

You are form Cloudera, nice. Answers straight from the source :).

[1] http://vimeo.com/3591321

Thanks,

-- 
Ioan-Eugen Stan

Reply via email to