Hi,
Yes, we need to store immutable mails and their associated r/w metadata.
I was wondering in which way a solution like the one presented on [1]
can help. Twitter seems to use Protocol Buffers to store tweets.
Would a solution based on Avro be a better fit for our needs (mail storage)?
In this Avro option, would each "mail" be a avro file, or should be
consider to have the "folder" an avro file and run some map/reduce jobs?
Tks,
- Eric
[1]
http://www.slideshare.net/kevinweil/protocol-buffers-and-hadoop-at-twitter
On 19/05/2011 20:53, Robert Burrell Donkin wrote:
On Thu, May 19, 2011 at 12:04 PM, Ioan Eugen Stan<[email protected]> wrote:
I have forwarded this discussion to my mentors so they are informed
(I've hopped onto this list so no need to remember to copy me into the
thread ;-)
<snip>
Eric, one of my mentors, suggested I use Gora for
this and after a quick look at Gora I saw that it is an ORM for HBase
and Cassandra which will allow me switch between them. The downside
with this is that Gora is still incubating so a piece of advice about
using it or not is welcomed. I will also ask on the Gora mailing list
to see how things are there.
(I suspect there will be a measure of experimentation required in this
project, so don't be afraid to try a spike or two)
I would encourage you to look at a system like HBase for your mail
backend. HDFS doesn't work well with lots of little files, and also
doesn't support random update, so existing formats like Maildir
wouldn't be a good fit.
(Apache James closer to the Microsoft Exchange space than traditional
*nix mail user agents)
I don't think I understand correctly what you mean by random updates.
E-mails are immutable so once written they are not going to be
updated. But if you are referring to the fact that lots of (small)
files will be written in a directory and that this can be a problem
then I get it. This will also mean that mailbox format (all emails in
one file) will be more inappropriate than Maildir. But since e-mails
are immutable and adding a mail to the mailbox means appending a small
piece of data to the file this should not be a problem if Hadoop has
append.
Essentially, there are two classes of data that mail storage requires
1. read only MIME documents (mail messages) embedding meta-data (headers)
2. read-write meta-data sets about each document including flags for
each (virtual) mail directory containing the document
The documents are searched rarely. The meta-data sets are read often
but written rarely.
I suspect that emails are relatively small in Hadoop terms, and are
often numerous. Might be interesting to see how a tuned HDFS instance
performs when storing large numbers of small MIME documents. Should be
easy enough to set up an experiment to benchmark. (I wonder whether a
RESTful distributed storage solution might end up working better.)
I suspect that the read-write meta-data sets will need HBase (or
Cassandra). Would need to think carefully about design, I think.
The presentation on Vimeo it stated that HDFS 0.19 did not had append,
I don't know yet what is the status on that, but things are a little
brighter. You could have a mailbox file that could grow to a very
large size. This will lead to all the users emails into one big file
that is easy to manage, the only thing that it's missing is the
fetching the emails. Since emails are appended to the file (inbox) as
they come, and you usually are interested in the latest emails
received you could just read the tail of the file and do some indexing
based on that.
I'm not hopeful about adopting an append based approach. (Might be
made to work but I suspect that the locking required for IMAP or POP3
is likely to kill performance.)
Robert