Yes, I mean this is just the trade-off between structured and unstructured data. In my case 99% of my data sources are structured. So if I am expecting List<String> and get List<Integer> then something is broken and I want to catch the bug before someone writes the bad data. I agree that in principle a compression algorithm should be able to give me comparable compactness with some CPU trade-off.
-Jay ---------- Forwarded message ---------- From: "Ted Dunning" <[EMAIL PROTECTED]> To: [email protected] Date: Wed, 3 Sep 2008 21:24:00 -0700 Subject: Re: Serialization with additional schema info I talked to the IBM guys about this problem with JSON-like formats. Their answer was that if you care enough, then any compression algorithm around will compress away the type information. So if you have a splittable compressed format (bz2 works with hadoop), you are set except for the compression cost. Decompression cost is usually compensated for by the I/O advantage. On Wed, Sep 3, 2008 at 3:52 PM, Jay Kreps <[EMAIL PROTECTED]> wrote: > ... > > Thanks for the pointer to jaql, that seems very cool, but I believe > jaql would have the same problem if they tried to implement any kind > of compact structured storage. Jaql would return a JArray or JRecord > which might have a variety of fields and you would want to store the > data about what kinds of fields separately. >
