Thank you, Sonal, at least that big job I was looking at just finished :)
Mark On Tue, Sep 6, 2011 at 11:56 PM, Sonal Goyal <[email protected]> wrote: > Mark, > > Having a large number of emitted key values from the mapper should not be a > problem. Just make sure that you have enough reducers to handle the data so > that the reduce stage does not become a bottleneck. > > Best Regards, > Sonal > Crux: Reporting for HBase <https://github.com/sonalgoyal/crux> > Nube Technologies <http://www.nubetech.co> > > <http://in.linkedin.com/in/sonalgoyal> > > > > > > On Wed, Sep 7, 2011 at 8:44 AM, Mark Kerzner <[email protected]> > wrote: > > > Harsh, > > > > I read one PST file, which contains many emails. But then I emit many > maps, > > like this > > > > MapWritable mapWritable = createMapWritable(metadata, fileName); > > // use MD5 of the input file as Hadoop key > > FileInputStream fileInputStream = new FileInputStream(fileName); > > MD5Hash key = MD5Hash.digest(fileInputStream); > > fileInputStream.close(); > > // emit map > > context.write(key, mapWritable); > > > > and it is this context.write calls that I have a great number of. Is that > a > > problem? > > > > Mark > > > > On Tue, Sep 6, 2011 at 10:06 PM, Harsh J <[email protected]> wrote: > > > > > You can use an input format that lets you read multiple files per map > > > (like say, all local files. See CombineFileInputFormat for one > > > implementation that does this). This way you get reduced map #s and > > > you don't really have to clump your files. One record reader would be > > > initialized per file, so I believe you should be free to generate > > > unique identities per file/email with this approach (whenever a new > > > record reader is initialized)? > > > > > > On Wed, Sep 7, 2011 at 7:12 AM, Mark Kerzner <[email protected]> > > > wrote: > > > > Hi, > > > > > > > > I am testing my Hadoop-based FreeEed <http://freeeed.org/>, an open > > > source > > > > tool for eDiscovery, and I am using the Enron data > > > > set< > > http://www.edrm.net/resources/data-sets/edrm-enron-email-data-set-v2 > > > >for > > > > that. In my processing, each email with its attachments becomes a > map, > > > > and it is later collected by a reducer and written to the output. > With > > > the > > > > (PST) mailboxes of around 2-5 Gigs, I begin to the see the numbers of > > > emails > > > > of about 50,000. I remember in Yahoo best practices that the number > of > > > maps > > > > should not exceed 75,000, and I can see that I can break this barrier > > > soon. > > > > > > > > I could, potentially, combine a few emails into one map, but I would > be > > > > doing it only to circumvent the size problem, not because my > processing > > > > requires it. Besides, my keys are the MD5 hashes of the files, and I > > use > > > > them to find duplicates. If I combine a few emails into a map, I > cannot > > > use > > > > the hashes as keys in a meaningful way anymore. > > > > > > > > So my question is, can't I have millions of maps, if that's how many > > > > artifacts I need to process, and why not? > > > > > > > > Thank you. Sincerely, > > > > Mark > > > > > > > > > > > > > > > > -- > > > Harsh J > > > > > >
