I think there's two useful modes for for schema-on-read:
* Unions allowed
* Unions not allowed
We haven't implemented union inference for converting Python sequences
yet. see e.g.
In [1]: import pyarrow as pa
In [2]: pa.array([{'a': 'foo'}, {'a': 'bar'}])
Out[2]:
<pyarrow.lib.StructArray object at 0x7f08c92c8e58>
-- is_valid: all not null
-- child 0 type: string
[
"foo",
"bar"
]
In [3]: pa.array([{'a': 'foo'}, {'a': 1}])
---------------------------------------------------------------------------
ArrowTypeError Traceback (most recent call last)
<ipython-input-3-60e46588953f> in <module>()
----> 1 pa.array([{'a': 'foo'}, {'a': 1}])
~/code/arrow/python/pyarrow/array.pxi in pyarrow.lib.array()
173 else:
174 # ConvertPySequence does strict conversion if type is
explicitly passed
--> 175 return _sequence_to_array(obj, mask, size, type, pool,
from_pandas)
176
177
~/code/arrow/python/pyarrow/array.pxi in pyarrow.lib._sequence_to_array()
34
35 with nogil:
---> 36 check_status(ConvertPySequence(sequence, mask, options, &out))
37
38 if out.get().num_chunks() == 1:
~/code/arrow/python/pyarrow/error.pxi in pyarrow.lib.check_status()
89 raise ArrowNotImplementedError(message)
90 elif status.IsTypeError():
---> 91 raise ArrowTypeError(message)
92 elif status.IsCapacityError():
93 raise ArrowCapacityError(message)
ArrowTypeError: ../src/arrow/python/python_to_arrow.cc:1019 code:
converter->AppendMultiple(seq, size)
../src/arrow/python/iterators.h:70 code: func(value,
static_cast<int64_t>(i), &keep_going)
../src/arrow/python/python_to_arrow.cc:794 code:
value_converters_[i]->AppendSingleVirtual(valueobj ? valueobj :
(&_Py_NoneStruct))
../src/arrow/python/python_to_arrow.cc:137 code:
internal::CIntFromPython(obj, &value)
../src/arrow/python/helpers.cc:197 code: CheckPyError()
an integer is required (got type str)
On Fri, Nov 30, 2018 at 9:28 AM Francois Saint-Jacques
<[email protected]> wrote:
>
> Hello,
>
> With JSON and other "typed" formats (msgpack, protobuf, ...) you need to
> take account unions, e.g.
>
> {a: "herp", b: 10}
> {a: true, c: "derp"}
>
> The type for `a` would be union<string, bool>.
>
> I think we should also evaluate into investing at ingesting different
> schema DSL (protobuf idl, json-schema) to avoid inference entirely.
>
> On Fri, Nov 30, 2018 at 9:43 AM Ben Kietzman <[email protected]>
> wrote:
>
> > Hi Antoine,
> >
> > The conversion of previous blocks is part of the fall back mechanism I'm
> > trying to describe. When type inference fails (even in a different block),
> > conversion of all blocks of the column is attempted to the next type in the
> > fallback graph.
> >
> > If there is no problem with the fallback graph model, the API would
> > probably look like a reusable LoosenType- something which simplifies
> > querying for the loosened type when inference fails.
> >
> > Unrelated: I forgot to include some edges in the json graph
> >
> > NULL -> BOOL
> > NULL -> INT64 -> DOUBLE
> > NULL -> TIMESTAMP -> STRING -> BINARY
> > NULL -> STRUCT
> > NULL -> LIST
> >
> > On Fri, Nov 30, 2018, 04:52 Antoine Pitrou <[email protected]> wrote:
> >
> > >
> > > Hi Ben,
> > >
> > > Le 30/11/2018 à 02:19, Ben Kietzman a écrit :
> > > > Currently, to figure out which types may be inferred and under which
> > > > circumstances they will be inferred involves digging through code. I
> > > think
> > > > it would be useful to have an API for expressing type inference rules.
> > > > Ideally this would be provided as utility functions alongside
> > > > StringConverter and used by anything which does type inference while
> > > > parsing/unboxing.
> > >
> > > It may be a bit more complicated. For example, a CSV file is parsed by
> > > blocks, and each block produces an array chunk. But when the type of a
> > > later block changes due to type inference failing on the current type,
> > > all previous blocks must be parsed again.
> > >
> > > So I'm curious what you would make the API look like.
> > >
> > > > By contrast, when reading JSON (which is explicit about numbers vs
> > > > strings), the graph would be:
> > > >
> > > > NULL -> BOOL
> > > > NULL -> INT64 -> DOUBLE
> > > > NULL -> TIMESTAMP -> STRING -> BINARY
> > > >
> > > > Seem reasonable?
> > > > Is there a case which isn't covered by a fallback graph as above?
> > >
> > > I have no idea. Someone else may be able to answer your question.
> > >
> > > Regards
> > >
> > > Antoine.
> > >
> >
>
>
> --
> Sent from my jetpack.