Hi, I am testing my Hadoop-based FreeEed <http://freeeed.org/>, an open source tool for eDiscovery, and I am using the Enron data set<http://www.edrm.net/resources/data-sets/edrm-enron-email-data-set-v2>for that. In my processing, each email with its attachments becomes a map, and it is later collected by a reducer and written to the output. With the (PST) mailboxes of around 2-5 Gigs, I begin to the see the numbers of emails of about 50,000. I remember in Yahoo best practices that the number of maps should not exceed 75,000, and I can see that I can break this barrier soon.
I could, potentially, combine a few emails into one map, but I would be doing it only to circumvent the size problem, not because my processing requires it. Besides, my keys are the MD5 hashes of the files, and I use them to find duplicates. If I combine a few emails into a map, I cannot use the hashes as keys in a meaningful way anymore. So my question is, can't I have millions of maps, if that's how many artifacts I need to process, and why not? Thank you. Sincerely, Mark
