On Thu, May 19, 2011 at 12:04 PM, Ioan Eugen Stan <[email protected]> wrote: > I have forwarded this discussion to my mentors so they are informed
(I've hopped onto this list so no need to remember to copy me into the thread ;-) <snip> > Eric, one of my mentors, suggested I use Gora for > this and after a quick look at Gora I saw that it is an ORM for HBase > and Cassandra which will allow me switch between them. The downside > with this is that Gora is still incubating so a piece of advice about > using it or not is welcomed. I will also ask on the Gora mailing list > to see how things are there. (I suspect there will be a measure of experimentation required in this project, so don't be afraid to try a spike or two) >>> I would encourage you to look at a system like HBase for your mail >>> backend. HDFS doesn't work well with lots of little files, and also >>> doesn't support random update, so existing formats like Maildir >>> wouldn't be a good fit. (Apache James closer to the Microsoft Exchange space than traditional *nix mail user agents) > I don't think I understand correctly what you mean by random updates. > E-mails are immutable so once written they are not going to be > updated. But if you are referring to the fact that lots of (small) > files will be written in a directory and that this can be a problem > then I get it. This will also mean that mailbox format (all emails in > one file) will be more inappropriate than Maildir. But since e-mails > are immutable and adding a mail to the mailbox means appending a small > piece of data to the file this should not be a problem if Hadoop has > append. Essentially, there are two classes of data that mail storage requires 1. read only MIME documents (mail messages) embedding meta-data (headers) 2. read-write meta-data sets about each document including flags for each (virtual) mail directory containing the document The documents are searched rarely. The meta-data sets are read often but written rarely. I suspect that emails are relatively small in Hadoop terms, and are often numerous. Might be interesting to see how a tuned HDFS instance performs when storing large numbers of small MIME documents. Should be easy enough to set up an experiment to benchmark. (I wonder whether a RESTful distributed storage solution might end up working better.) I suspect that the read-write meta-data sets will need HBase (or Cassandra). Would need to think carefully about design, I think. > The presentation on Vimeo it stated that HDFS 0.19 did not had append, > I don't know yet what is the status on that, but things are a little > brighter. You could have a mailbox file that could grow to a very > large size. This will lead to all the users emails into one big file > that is easy to manage, the only thing that it's missing is the > fetching the emails. Since emails are appended to the file (inbox) as > they come, and you usually are interested in the latest emails > received you could just read the tail of the file and do some indexing > based on that. I'm not hopeful about adopting an append based approach. (Might be made to work but I suspect that the locking required for IMAP or POP3 is likely to kill performance.) Robert
