Hi Folks, I have a bunch of binary files which I've stored in a sequencefile. The name of the file is the key, the data is the value and I've stored them sorted by key. (I'm not tied to using a sequencefile for this). The current test data is only 50MB, but the real data will be 500MB - 1GB.
My M/R job requires that it's input be several of these records in the sequence file, which is determined by the key. The sorting mentioned above keeps these all packed together. 1. Any reason not to use a sequence file for this? Perhaps a mapfile? Since I've sorted it, I don't need "random" accesses, but I do need to be aware of the keys, as I need to be sure that I get all of the relevant keys sent to a given mapper 2. Looks like I want a custom inputformat for this, extending SequenceFileInputFormat. Do you agree? I'll gladly take some opinions on this, as I ultimately want to split the based on what's in the file, which might be a little unorthodox. 3. Another idea might be create separate seq files for chunk of records and make them non-splittable, ensuring that they go to a single mapper. Assuming I can get away with this, see any pros/cons with that approach? Thanks, Tom -- =================== Skybox is hiring. http://www.skyboximaging.com/careers/jobs
