The ZIP file format in Hadoop is not "native", and does not include the option to ignore some files. But, here is a good alternative: http://cotdp.com/blog/2011/03/reading-zip-files-from-hadoop-mapreduce.html
This, with accept/reject file patterns, would do the trick. (p.s. The FPGrowth Writable format is TopKStringPatterns.java.) On Fri, Aug 26, 2011 at 3:02 AM, Sean Owen <[email protected]> wrote: > Yeah but then our output is not interoperable with other stuff off the bat, > over something pretty minor. > > On Fri, Aug 26, 2011 at 10:59 AM, Lance Norskog <[email protected]> wrote: > > > Yeah; it would have to be a variant of SequenceFile managed by some kind > of > > 'MahoutFileUtility' class. No, I haven't written it. "Tagged Sequence > File" > > with a suffix .tsq? > > > > A second use case crops up with the FPGrowth code. FPGrowth develops a > > multi-level data structure of its results, then writes this out in a > > SequenceFile. In the Mahout code base there is no code that reads this > data > > for any reason. It would be more useful to have different "flattenings" > > written out in commonly used Writables. > > > > > -- Lance Norskog [email protected]
