On Fri, Nov 30, 2018 at 10:00 AM Ben Kietzman <ben.kietz...@rstudio.com> wrote:
>
> I think the fallback graph approach might still be useful in the case of
> parsing with unions allowed, albeit with a much broader graph.
>
> For example,
>
> INT64 + STRING -> UNION(INT64, STRING)
> T + UNION (*) -> UNION(T, *)
> # ...
>
> Related: how should ordinarily convertible types be handled in the context
> of unions? For example, if we have a column of mostly INT64 with a few
> DOUBLE mixed in, should the column become DOUBLE or UNION(INT64, DOUBLE)?
>
> WRT foregoing type inference altogether, I don't think we can make that
> work; users are still going to expect to be able to just parse a CSV or
> line-delimited JSON file without specifying column types explicitly. With
> that said the idea of using schemas is very interesting. How would you use
> a protobuf file to express that a string field should be parsed as a
> timestamp? (IIRC, protobuf doesn't have built-in time types)

I didn't quite understand those comments. We have metadata data
structures in each implementation (e.g. arrow::{DataType, Field,
Schema}). If we want to disable all type inference, we would need to
provide a complete Arrow schema. How that schema is constructed by an
application may vary, but seems out of scope for this discussion.

>
> On Fri, Nov 30, 2018, 10:28 Francois Saint-Jacques <
> fsaintjacq...@networkdump.com> wrote:
>
> > Hello,
> >
> > With JSON and other "typed" formats (msgpack, protobuf, ...) you need to
> > take account unions, e.g.
> >
> > {a: "herp", b: 10}
> > {a: true, c: "derp"}
> >
> > The type for `a` would be union<string, bool>.
> >
> > I think we should also evaluate into investing at ingesting different
> > schema DSL (protobuf idl, json-schema) to avoid inference entirely.
> >
> > On Fri, Nov 30, 2018 at 9:43 AM Ben Kietzman <ben.kietz...@rstudio.com>
> > wrote:
> >
> > > Hi Antoine,
> > >
> > > The conversion of previous blocks is part of the fall back mechanism I'm
> > > trying to describe. When type inference fails (even in a different
> > block),
> > > conversion of all blocks of the column is attempted to the next type in
> > the
> > > fallback graph.
> > >
> > > If there is no problem with the fallback graph model, the API would
> > > probably look like a reusable LoosenType- something which simplifies
> > > querying for the loosened type when inference fails.
> > >
> > > Unrelated: I forgot to include some edges in the json graph
> > >
> > > NULL -> BOOL
> > > NULL -> INT64 -> DOUBLE
> > > NULL -> TIMESTAMP -> STRING -> BINARY
> > > NULL -> STRUCT
> > > NULL -> LIST
> > >
> > > On Fri, Nov 30, 2018, 04:52 Antoine Pitrou <anto...@python.org> wrote:
> > >
> > > >
> > > > Hi Ben,
> > > >
> > > > Le 30/11/2018 à 02:19, Ben Kietzman a écrit :
> > > > > Currently, to figure out which types may be inferred and under which
> > > > > circumstances they will be inferred involves digging through code. I
> > > > think
> > > > > it would be useful to have an API for expressing type inference
> > rules.
> > > > > Ideally this would be provided as utility functions alongside
> > > > > StringConverter and used by anything which does type inference while
> > > > > parsing/unboxing.
> > > >
> > > > It may be a bit more complicated.  For example, a CSV file is parsed by
> > > > blocks, and each block produces an array chunk.  But when the type of a
> > > > later block changes due to type inference failing on the current type,
> > > > all previous blocks must be parsed again.
> > > >
> > > > So I'm curious what you would make the API look like.
> > > >
> > > > > By contrast, when reading JSON (which is explicit about numbers vs
> > > > > strings), the graph would be:
> > > > >
> > > > >   NULL -> BOOL
> > > > >   NULL -> INT64 -> DOUBLE
> > > > >   NULL -> TIMESTAMP -> STRING -> BINARY
> > > > >
> > > > > Seem reasonable?
> > > > > Is there a case which isn't covered by a fallback graph as above?
> > > >
> > > > I have no idea.  Someone else may be able to answer your question.
> > > >
> > > > Regards
> > > >
> > > > Antoine.
> > > >
> > >
> >
> >
> > --
> > Sent from my jetpack.
> >

Reply via email to