For some array builders, ArrayBuilder::type() will be different from the
type of array produced by ArrayBuilder::Finish(). These are:
- AdaptiveIntBuilder will progress through {int8, int16, int32, int64}
whenever a value is inserted which cannot be stored using the current
integer type.
- DictionaryBuilder will similarly increase the width of its indices if its
memo table grows too large.
- {Dense,Sparse}UnionBuilder may append a new child builder
- Any nested builder whose child builders include a builder with mutable
type

IMHO if ArrayBuilder::type is sporadically inaccurate then it's a user
hostile API and needs to be fixed.

The current solution solution is for mutable type to be marked by
ArrayBuilder::type() == null. This results in significant loss of metadata
from nested types; for example StructBuilder::FinishInternal currently sets
all field names to "" if constructed with null type. Null type is
inconsistently applied; a builder of list(dictionary()) will currently
finish to an invalid array if the dictionary builder widens its indices
before finishing.

Options:
- Implement array builders such that ArrayBuilder::type() is always the
type to which the builder would Finish. There is a PR for this
https://github.com/apache/arrow/pull/4930 but it introduces performance
regressions for the dictionary builders: 5% if the values are integer, 1.8%
if they are strings.
- Provide ArrayBuilder::UpdateType(); type() is not guaranteed to be
accurate unless immediately preceded by UpdateType().
- Remove ArrayBuilder::type() in favor of ArrayBuilder::type_id(), which
will be an immutable property of ArrayBuilders.
- Make ArrayBuilder::type() virtual. This will be much more expensive for
nested builders and for applications which need to branch on
ArrayBuilder::type()->id() ArrayBuilder::type_id() should be provided as
well.

Reply via email to