Re: Serialization with additional schema info

Jay Kreps Thu, 04 Sep 2008 07:36:27 -0700

Yes, I mean this is just the trade-off between structured and
unstructured data.  In my case 99% of my data sources are structured.
So if I am expecting List<String> and get List<Integer> then something
is broken and I want to catch the bug before someone writes the bad
data. I agree that in principle a compression algorithm should be able
to give me comparable compactness with some CPU trade-off.

-Jay

---------- Forwarded message ----------
From: "Ted Dunning" <[EMAIL PROTECTED]>
To: [email protected]
Date: Wed, 3 Sep 2008 21:24:00 -0700
Subject: Re: Serialization with additional schema info
I talked to the IBM guys about this problem with JSON-like formats.

Their answer was that if you care enough, then any compression algorithm
around will compress away the type information.

So if you have a splittable compressed format (bz2 works with hadoop), you
are set except for the compression cost.  Decompression cost is usually
compensated for by the I/O advantage.

On Wed, Sep 3, 2008 at 3:52 PM, Jay Kreps <[EMAIL PROTECTED]> wrote:

> ...
>
> Thanks for the pointer to jaql, that seems very cool, but I believe
> jaql would have the same problem if they tried to implement any kind
> of compact structured storage.  Jaql would return a JArray or JRecord
> which might have a variety of fields and you would want to store the
> data about what kinds of fields separately.
>

Re: Serialization with additional schema info

Reply via email to