Greetings, avro devs.

We've been using avro for a short while now and have run into an issue with
validation.  Our problem is that we have a number of schemas that are quite
large.  When working on getting data into the right shape for them, the
format of error messages for these large schemas has been pretty
unhelpful.

In version 1.9.2, the error that is produced for validation errors shows
the full structure of the expected schema as well as the entire datum
provided at the top level of validation.  For large schemas, this is of
little value, since the part of the schema that is in error is likely to be
one field somewhere in that pile of data.

In order to solve this problem locally, we've created an alternate form of
validation that uses iteration and traversal to validate each node.  If any
node fails validation, then the error raised contains that specific node
(datum and schema) which improves the visibility of problems.

I have noticed that in 1.10 this has been solved to some extent by adding
the module constants _DEBUG_VALIDATE and _DEBUG_VALIDATE_INDENT.  But it
seems pretty clear that this is intended primarily for development.  It
doesn't really help at runtime.

There's another potential advantage to our approach.  As an iterative
process, it will use fewer system resources, especially when validating
schemas with a number of nested levels.

I wanted to offer this new approach as a potential improvement and I am
seeking to open a discussion of our code.  I've got a working branch and am
happy to open a PR against the apache github master if there's any chance
of anyone being interested.

Thanks very much for reading this far.  I hope you might be interested.

Yours,

Cris Ewing
Coffee Meets Bagel Engineering

Reply via email to