Well, I want something like TeraSort but for sequenceFiles instead of Lines in Text. My goal is efficiency and I'm currently working with Hadoop only.
Thanks for your suggestions, Mark On Thu, May 26, 2011 at 8:34 AM, Robert Evans <[email protected]> wrote: > Also if you want something that is fairly fast and a lot less dev work to > get going you might want to look at pig. They can do a distributed order by > that is fairly good. > > --Bobby Evans > > On 5/26/11 2:45 AM, "Luca Pireddu" <[email protected]> wrote: > > On May 25, 2011 22:15:50 Mark question wrote: > > I'm using SequenceFileInputFormat, but then what to write in my mappers? > > > > each mapper is taking a split from the SequenceInputFile then sort its > > split ?! I don't want that.. > > > > Thanks, > > Mark > > > > On Wed, May 25, 2011 at 2:09 AM, Luca Pireddu <[email protected]> wrote: > > > On May 25, 2011 01:43:22 Mark question wrote: > > > > Thanks Luca, but what other way to sort a directory of sequence > files? > > > > > > > > I don't plan to write a sorting algorithm in mappers/reducers, but > > > > hoping to use the sequenceFile.sorter instead. > > > > > > > > Any ideas? > > > > > > > > Mark > > > > > > If you want to achieve a global sort, then look at how TeraSort does it: > > http://sortbenchmark.org/YahooHadoop.pdf > > The idea is to partition the data so that all keys in part[i] are < all > keys > in part[i+1]. Each partition in individually sorted, so to read the data > in > globally sorted order you simply have to traverse it starting from the > first > partition and working your way to the last one. > > If your keys are already what you want to sort by, then you don't even need > a > mapper (just use the default identity map). > > > > -- > Luca Pireddu > CRS4 - Distributed Computing Group > Loc. Pixina Manna Edificio 1 > Pula 09010 (CA), Italy > Tel: +39 0709250452 > >
