Re: Multi-language serialization discussion

Jeff Hammerbacher Mon, 27 Oct 2008 12:50:19 -0700

Hey Pete,

Can you write up some documentation on DynamicSerDe for the wiki? It's
come up a few times in discussion and I think it would be of general
use for people.


Thanks,
Jeff

On Mon, Oct 27, 2008 at 12:13 PM, Pete Wyckoff <[EMAIL PROTECTED]> wrote:
>
>>   You'd still need to write IDL parsers & processors for each platform.
>
> Fyi - Hadoop already has this for Java - in hive/serde/DynamicSerDe. This is 
> exactly that and gives one the ability to read and write thrift and 
> non-thrift data without compilation.
>
> -- pete
>
> On 10/27/08 12:01 PM, "Doug Cutting" <[EMAIL PROTECTED]> wrote:
>
> Ted Dunning wrote:
>> I don't think that it would be a major inconvenience in any of the major
>> scripting languages to change the meaning of "open" to mean that you must
>> read the IDL for a file, generate a reading script, load that and now be
>> ready to read.  This is a scripting language after all.
>
> That sounds like compilation, which isn't very scripty.  It's certainly
> workable, but not optimal.  We want to push this stack all the way up to
> spreadsheet-type programmers, who define new record types interactively.
>  Do we really want a GUI to run the Thrift compiler each time a file is
> opened, and loading new code in?
>
>> Note that you are saying that the writer should have a schema.  This seems
>> to contradict your previous statement and agree with mine.
>
> We can induce a schema.  If an application doesn't specify an output
> schema then the first instance written might implicitly define the
> schema.  Or you could be more lax and modify the schema as instances are
> written to match all instances, then append it at the end of the file.
> So in the binary format there would always be a schema.  It would be
> used for compaction and available to readers to describe the data.
>
>>> So, how well does Thrift meet these needs?
>>
>> Very closely, actually, especially if you adjust it to allow the IDL to be
>> inside the file.
>
> Thrift has a lot of the parts, and one could probably define a Thrift
> protocol that does this.  Looking through the Thrift mail archives, it
> seems that TDenseProtocol with an IDL in the file would get you partway.
>  You'd still need to write IDL parsers & processors for each platform.
>  I'm not sure it would be any less work than to build this from
> scratch, but I guess that's up to me to prove!
>
> On one hand, it's good to have an architecture that embraces more
> different data formats.  But, in practice, its nice to have actual data
> in fewer formats, since otherwise you end up having to support the cross
> product of formats and platforms.
>
>> We should also consider the JAQL work.
>
> Yes.  I've started to look at that more.  There examples imply a binary
> format for JSON, but I can find no details.
>
> Doug
>
>
>

Re: Multi-language serialization discussion

Reply via email to