I think there's two useful modes for for schema-on-read:

* Unions allowed
* Unions not allowed

We haven't implemented union inference for converting Python sequences
yet. see e.g.

In [1]: import pyarrow as pa

In [2]: pa.array([{'a': 'foo'}, {'a': 'bar'}])
Out[2]:
<pyarrow.lib.StructArray object at 0x7f08c92c8e58>
-- is_valid: all not null
-- child 0 type: string
  [
    "foo",
    "bar"
  ]

In [3]: pa.array([{'a': 'foo'}, {'a': 1}])
---------------------------------------------------------------------------
ArrowTypeError                            Traceback (most recent call last)
<ipython-input-3-60e46588953f> in <module>()
----> 1 pa.array([{'a': 'foo'}, {'a': 1}])

~/code/arrow/python/pyarrow/array.pxi in pyarrow.lib.array()
    173     else:
    174         # ConvertPySequence does strict conversion if type is
explicitly passed
--> 175         return _sequence_to_array(obj, mask, size, type, pool,
from_pandas)
    176
    177

~/code/arrow/python/pyarrow/array.pxi in pyarrow.lib._sequence_to_array()
     34
     35     with nogil:
---> 36         check_status(ConvertPySequence(sequence, mask, options, &out))
     37
     38     if out.get().num_chunks() == 1:

~/code/arrow/python/pyarrow/error.pxi in pyarrow.lib.check_status()
     89             raise ArrowNotImplementedError(message)
     90         elif status.IsTypeError():
---> 91             raise ArrowTypeError(message)
     92         elif status.IsCapacityError():
     93             raise ArrowCapacityError(message)

ArrowTypeError: ../src/arrow/python/python_to_arrow.cc:1019 code:
converter->AppendMultiple(seq, size)
../src/arrow/python/iterators.h:70 code: func(value,
static_cast<int64_t>(i), &keep_going)
../src/arrow/python/python_to_arrow.cc:794 code:
value_converters_[i]->AppendSingleVirtual(valueobj ? valueobj :
(&_Py_NoneStruct))
../src/arrow/python/python_to_arrow.cc:137 code:
internal::CIntFromPython(obj, &value)
../src/arrow/python/helpers.cc:197 code: CheckPyError()
an integer is required (got type str)
On Fri, Nov 30, 2018 at 9:28 AM Francois Saint-Jacques
<fsaintjacq...@networkdump.com> wrote:
>
> Hello,
>
> With JSON and other "typed" formats (msgpack, protobuf, ...) you need to
> take account unions, e.g.
>
> {a: "herp", b: 10}
> {a: true, c: "derp"}
>
> The type for `a` would be union<string, bool>.
>
> I think we should also evaluate into investing at ingesting different
> schema DSL (protobuf idl, json-schema) to avoid inference entirely.
>
> On Fri, Nov 30, 2018 at 9:43 AM Ben Kietzman <ben.kietz...@rstudio.com>
> wrote:
>
> > Hi Antoine,
> >
> > The conversion of previous blocks is part of the fall back mechanism I'm
> > trying to describe. When type inference fails (even in a different block),
> > conversion of all blocks of the column is attempted to the next type in the
> > fallback graph.
> >
> > If there is no problem with the fallback graph model, the API would
> > probably look like a reusable LoosenType- something which simplifies
> > querying for the loosened type when inference fails.
> >
> > Unrelated: I forgot to include some edges in the json graph
> >
> > NULL -> BOOL
> > NULL -> INT64 -> DOUBLE
> > NULL -> TIMESTAMP -> STRING -> BINARY
> > NULL -> STRUCT
> > NULL -> LIST
> >
> > On Fri, Nov 30, 2018, 04:52 Antoine Pitrou <anto...@python.org> wrote:
> >
> > >
> > > Hi Ben,
> > >
> > > Le 30/11/2018 à 02:19, Ben Kietzman a écrit :
> > > > Currently, to figure out which types may be inferred and under which
> > > > circumstances they will be inferred involves digging through code. I
> > > think
> > > > it would be useful to have an API for expressing type inference rules.
> > > > Ideally this would be provided as utility functions alongside
> > > > StringConverter and used by anything which does type inference while
> > > > parsing/unboxing.
> > >
> > > It may be a bit more complicated.  For example, a CSV file is parsed by
> > > blocks, and each block produces an array chunk.  But when the type of a
> > > later block changes due to type inference failing on the current type,
> > > all previous blocks must be parsed again.
> > >
> > > So I'm curious what you would make the API look like.
> > >
> > > > By contrast, when reading JSON (which is explicit about numbers vs
> > > > strings), the graph would be:
> > > >
> > > >   NULL -> BOOL
> > > >   NULL -> INT64 -> DOUBLE
> > > >   NULL -> TIMESTAMP -> STRING -> BINARY
> > > >
> > > > Seem reasonable?
> > > > Is there a case which isn't covered by a fallback graph as above?
> > >
> > > I have no idea.  Someone else may be able to answer your question.
> > >
> > > Regards
> > >
> > > Antoine.
> > >
> >
>
>
> --
> Sent from my jetpack.

Reply via email to