Yeah; it would have to be a variant of SequenceFile managed by some kind of
'MahoutFileUtility' class. No, I haven't written it. "Tagged Sequence File"
with a suffix .tsq?

A second use case crops up with the FPGrowth code. FPGrowth develops a
multi-level data structure of its results, then writes this out in a
SequenceFile. In the Mahout code base there is no code that reads this data
for any reason. It would be more useful to have different "flattenings"
written out in commonly used Writables.

On Fri, Aug 26, 2011 at 2:27 AM, Sean Owen <[email protected]> wrote:

> I think the problem here is that the world of Hadoop tends to treat all
> files as streams of pure data records. The file boundaries -- and hence any
> per-file header or metadata -- doesn't have a meaning. It's unnatural-ish
> to
> put metadata in data files in this land.
>
> More specifically I don't quite see how you have a SequenceFile with one
> header record of a different type? they all have to be of the same type.
> Sure you can make a CommentOrDataWritable wrapper class but that's ugly.
>
> On Fri, Aug 26, 2011 at 10:24 AM, Lance Norskog <[email protected]> wrote:
>
> > If my camera uploaded raw image files and metadata files separately, I'd
> go
> > mad. The sound sample people got this right 20 years ago, when I wrote
> SoX.
> >
> > The difference between throwaway data files and permanently archivable
> data
> > files is having metadata inside the file.   Letting Mahout make a
> > permanently archivable file opens up its utility tremendously, and
> > self-description is the key.
> >
> >
>



-- 
Lance Norskog
[email protected]

Reply via email to