I think there's two useful modes for for schema-on-read: * Unions allowed * Unions not allowed
We haven't implemented union inference for converting Python sequences yet. see e.g. In [1]: import pyarrow as pa In [2]: pa.array([{'a': 'foo'}, {'a': 'bar'}]) Out[2]: <pyarrow.lib.StructArray object at 0x7f08c92c8e58> -- is_valid: all not null -- child 0 type: string [ "foo", "bar" ] In [3]: pa.array([{'a': 'foo'}, {'a': 1}]) --------------------------------------------------------------------------- ArrowTypeError Traceback (most recent call last) <ipython-input-3-60e46588953f> in <module>() ----> 1 pa.array([{'a': 'foo'}, {'a': 1}]) ~/code/arrow/python/pyarrow/array.pxi in pyarrow.lib.array() 173 else: 174 # ConvertPySequence does strict conversion if type is explicitly passed --> 175 return _sequence_to_array(obj, mask, size, type, pool, from_pandas) 176 177 ~/code/arrow/python/pyarrow/array.pxi in pyarrow.lib._sequence_to_array() 34 35 with nogil: ---> 36 check_status(ConvertPySequence(sequence, mask, options, &out)) 37 38 if out.get().num_chunks() == 1: ~/code/arrow/python/pyarrow/error.pxi in pyarrow.lib.check_status() 89 raise ArrowNotImplementedError(message) 90 elif status.IsTypeError(): ---> 91 raise ArrowTypeError(message) 92 elif status.IsCapacityError(): 93 raise ArrowCapacityError(message) ArrowTypeError: ../src/arrow/python/python_to_arrow.cc:1019 code: converter->AppendMultiple(seq, size) ../src/arrow/python/iterators.h:70 code: func(value, static_cast<int64_t>(i), &keep_going) ../src/arrow/python/python_to_arrow.cc:794 code: value_converters_[i]->AppendSingleVirtual(valueobj ? valueobj : (&_Py_NoneStruct)) ../src/arrow/python/python_to_arrow.cc:137 code: internal::CIntFromPython(obj, &value) ../src/arrow/python/helpers.cc:197 code: CheckPyError() an integer is required (got type str) On Fri, Nov 30, 2018 at 9:28 AM Francois Saint-Jacques <fsaintjacq...@networkdump.com> wrote: > > Hello, > > With JSON and other "typed" formats (msgpack, protobuf, ...) you need to > take account unions, e.g. > > {a: "herp", b: 10} > {a: true, c: "derp"} > > The type for `a` would be union<string, bool>. > > I think we should also evaluate into investing at ingesting different > schema DSL (protobuf idl, json-schema) to avoid inference entirely. > > On Fri, Nov 30, 2018 at 9:43 AM Ben Kietzman <ben.kietz...@rstudio.com> > wrote: > > > Hi Antoine, > > > > The conversion of previous blocks is part of the fall back mechanism I'm > > trying to describe. When type inference fails (even in a different block), > > conversion of all blocks of the column is attempted to the next type in the > > fallback graph. > > > > If there is no problem with the fallback graph model, the API would > > probably look like a reusable LoosenType- something which simplifies > > querying for the loosened type when inference fails. > > > > Unrelated: I forgot to include some edges in the json graph > > > > NULL -> BOOL > > NULL -> INT64 -> DOUBLE > > NULL -> TIMESTAMP -> STRING -> BINARY > > NULL -> STRUCT > > NULL -> LIST > > > > On Fri, Nov 30, 2018, 04:52 Antoine Pitrou <anto...@python.org> wrote: > > > > > > > > Hi Ben, > > > > > > Le 30/11/2018 à 02:19, Ben Kietzman a écrit : > > > > Currently, to figure out which types may be inferred and under which > > > > circumstances they will be inferred involves digging through code. I > > > think > > > > it would be useful to have an API for expressing type inference rules. > > > > Ideally this would be provided as utility functions alongside > > > > StringConverter and used by anything which does type inference while > > > > parsing/unboxing. > > > > > > It may be a bit more complicated. For example, a CSV file is parsed by > > > blocks, and each block produces an array chunk. But when the type of a > > > later block changes due to type inference failing on the current type, > > > all previous blocks must be parsed again. > > > > > > So I'm curious what you would make the API look like. > > > > > > > By contrast, when reading JSON (which is explicit about numbers vs > > > > strings), the graph would be: > > > > > > > > NULL -> BOOL > > > > NULL -> INT64 -> DOUBLE > > > > NULL -> TIMESTAMP -> STRING -> BINARY > > > > > > > > Seem reasonable? > > > > Is there a case which isn't covered by a fallback graph as above? > > > > > > I have no idea. Someone else may be able to answer your question. > > > > > > Regards > > > > > > Antoine. > > > > > > > > -- > Sent from my jetpack.