I have forwarded this discussion to my mentors so they are informed and I hope they will provide better input regarding email storage.
> I second what Todd said, even with FuseHDFS, mounting HDFS as a regular file > system, it won't give you the immediate response about the file status that > you need. I believe Google implemented Gmail with HBase. Here is an example > of implementing a mail store with Cassandra: > http://ewh.ieee.org/r6/scv/computer/nfic/2009/IBM-Jun-Rao.pdf > > <http://ewh.ieee.org/r6/scv/computer/nfic/2009/IBM-Jun-Rao.pdf>Mark Thanks Mark, I will look into that. I am currently watching. Claudera Hadoop Training [1] to get a better view of how things work. I have one question: what is the defining difference between Cassandra and HBase? Also, Eric, one of my mentors, suggested I use Gora for this and after a quick look at Gora I saw that it is an ORM for HBase and Cassandra which will allow me switch between them. The downside with this is that Gora is still incubating so a piece of advice about using it or not is welcomed. I will also ask on the Gora mailing list to see how things are there. >> I would encourage you to look at a system like HBase for your mail >> backend. HDFS doesn't work well with lots of little files, and also >> doesn't support random update, so existing formats like Maildir >> wouldn't be a good fit. I don't think I understand correctly what you mean by random updates. E-mails are immutable so once written they are not going to be updated. But if you are referring to the fact that lots of (small) files will be written in a directory and that this can be a problem then I get it. This will also mean that mailbox format (all emails in one file) will be more inappropriate than Maildir. But since e-mails are immutable and adding a mail to the mailbox means appending a small piece of data to the file this should not be a problem if Hadoop has append. The presentation on Vimeo it stated that HDFS 0.19 did not had append, I don't know yet what is the status on that, but things are a little brighter. You could have a mailbox file that could grow to a very large size. This will lead to all the users emails into one big file that is easy to manage, the only thing that it's missing is the fetching the emails. Since emails are appended to the file (inbox) as they come, and you usually are interested in the latest emails received you could just read the tail of the file and do some indexing based on that. Should I post this on the HDFS mailing-list also? I'm talking without real experience with Hadoop so shut me up if I'm wrong. >> -- >> Todd Lipcon >> Software Engineer, Cloudera You are form Cloudera, nice. Answers straight from the source :). [1] http://vimeo.com/3591321 Thanks, -- Ioan-Eugen Stan
