I talked to the IBM guys about this problem with JSON-like formats. Their answer was that if you care enough, then any compression algorithm around will compress away the type information.
So if you have a splittable compressed format (bz2 works with hadoop), you are set except for the compression cost. Decompression cost is usually compensated for by the I/O advantage. On Wed, Sep 3, 2008 at 3:52 PM, Jay Kreps <[EMAIL PROTECTED]> wrote: > ... > > Thanks for the pointer to jaql, that seems very cool, but I believe > jaql would have the same problem if they tried to implement any kind > of compact structured storage. Jaql would return a JArray or JRecord > which might have a variety of fields and you would want to store the > data about what kinds of fields separately. >
