With the schema in hand, you don't need to tag data with field numbers or types, since that's all there in the schema. So, having the schema, you can use a simpler data format.

To a degree, we already have that in Thrift - we call it the DenseProtocol.

Would you write parsers for Thrift's IDL in every language? Or would you use JSON, as Avro does, to avoid that?

When it comes to having a code-usable IDL for the schema, I'm totally pro-JSON.

Once you're using a different IDL and a different data format, what's shared with Thrift? Fundamentally, those two things define a serialization system, no?

It's not actually a different data format, is it? You're saying that the user wouldn't specify the field IDs, but you'd fundamentally still use field ids for compactness and the like. You may not use actual Thrift generated objects, but you could certainly use Binary or Compact protocol from Thrift to do all the writing to the wire.

You might also be able to use (or contribute to) Thrift's RPC-level stuff like server implementations. We have some respectable Java servers written, and if those aren't enough for your uses, I'd actually be really interested in seeing if we could generalize some of the Hadoop stuff to be useful within Thrift.

The bottom line is that I would love to see greater cooperation between Hadoop and Thrift. Unless it's impossible or impractical for Thrift to be useful here, I think we'd be willing to work towards Hadoop's needs.

-Bryan

Reply via email to