Le 20/01/2020 à 17:17, Jacques Nadeau a écrit :
>>
>> To be clear, I agree that we need to check that our various validation
>> and integration suites pass properly.  But once that is done and
>> assuming all the metadata variations are properly tested, data
>> variations should not pose any problem.
> 
> Unless I'm misunderstanding your proposal, that doesn't deal with the data
> that has already been produced that may have been written in a way that
> this change finds non-consumable but works today.

Yes, so the question is whether such data can be valid Arrow data at
all, or it is semantically invalid even though it has a valid encoding.
 For example, what does a Field mean with a missing datatype?

It can probably help to review which fields would be made required.
For example, here are the potential candidates in Schema.fbs:
- Union.typeIds
- KeyValue.key
- KeyValue.string
- Field.type
- Schema.fields

In Message.fbs:
- RecordBatch.nodes
- RecordBatch.buffers
- DictionaryBatch.data
- Message.header

In File.fbs:
- Footer.recordBatches

In Tensor.fbs:
- Tensor.type
- Tensor.shape
- Tensor.strides
- Tensor.data

In SparseTensor.fbs:
- SparseTensorIndexCOO.indicesStrides
- SparseTensorIndexCOO.indicesBuffer
- SparseMatrixIndexCSX.indptrBuffer
- SparseMatrixIndexCSX.indicesBuffer
- SparseTensor.type
- SparseTensor.shape
- SparseTensor.sparseIndex
- SparseTensor.data

>> Of course, we can hand-write all the NULL checks on the read side.  My
>> concern is not the one-time cost of doing so, but the long-term
>> fragility of such a strategy
> 
> I agree with you in principle about using tools rather than humans to
> minimize mistakes. On the flipside, we chose to use optional for the same
> reason that flatbuf defaults to optional, protobuf2 recommended use of
> optional over required and protobuf3 removed the ability to express things
> as required [1].

Right, flatbuffers recommends optional fields for the ease of evolving
the format.  One does not have to agree with that opinion.  Evolving the
encoding flexibly is one thing, keeping all implementations compatible
with each other (and with past encoded data) is another.  By having
multiple optional fields, some which become deprecated along the way, a
combinatory explosion is created.  It's probably less of a concern for
an in-house protocol than for an open standard like Arrow where there
may be multiple third-party implementations around at some point.

Regards

Antoine.

Reply via email to