On Fri, Nov 30, 2018 at 10:00 AM Ben Kietzman <ben.kietz...@rstudio.com> wrote: > > I think the fallback graph approach might still be useful in the case of > parsing with unions allowed, albeit with a much broader graph. > > For example, > > INT64 + STRING -> UNION(INT64, STRING) > T + UNION (*) -> UNION(T, *) > # ... > > Related: how should ordinarily convertible types be handled in the context > of unions? For example, if we have a column of mostly INT64 with a few > DOUBLE mixed in, should the column become DOUBLE or UNION(INT64, DOUBLE)? > > WRT foregoing type inference altogether, I don't think we can make that > work; users are still going to expect to be able to just parse a CSV or > line-delimited JSON file without specifying column types explicitly. With > that said the idea of using schemas is very interesting. How would you use > a protobuf file to express that a string field should be parsed as a > timestamp? (IIRC, protobuf doesn't have built-in time types)
I didn't quite understand those comments. We have metadata data structures in each implementation (e.g. arrow::{DataType, Field, Schema}). If we want to disable all type inference, we would need to provide a complete Arrow schema. How that schema is constructed by an application may vary, but seems out of scope for this discussion. > > On Fri, Nov 30, 2018, 10:28 Francois Saint-Jacques < > fsaintjacq...@networkdump.com> wrote: > > > Hello, > > > > With JSON and other "typed" formats (msgpack, protobuf, ...) you need to > > take account unions, e.g. > > > > {a: "herp", b: 10} > > {a: true, c: "derp"} > > > > The type for `a` would be union<string, bool>. > > > > I think we should also evaluate into investing at ingesting different > > schema DSL (protobuf idl, json-schema) to avoid inference entirely. > > > > On Fri, Nov 30, 2018 at 9:43 AM Ben Kietzman <ben.kietz...@rstudio.com> > > wrote: > > > > > Hi Antoine, > > > > > > The conversion of previous blocks is part of the fall back mechanism I'm > > > trying to describe. When type inference fails (even in a different > > block), > > > conversion of all blocks of the column is attempted to the next type in > > the > > > fallback graph. > > > > > > If there is no problem with the fallback graph model, the API would > > > probably look like a reusable LoosenType- something which simplifies > > > querying for the loosened type when inference fails. > > > > > > Unrelated: I forgot to include some edges in the json graph > > > > > > NULL -> BOOL > > > NULL -> INT64 -> DOUBLE > > > NULL -> TIMESTAMP -> STRING -> BINARY > > > NULL -> STRUCT > > > NULL -> LIST > > > > > > On Fri, Nov 30, 2018, 04:52 Antoine Pitrou <anto...@python.org> wrote: > > > > > > > > > > > Hi Ben, > > > > > > > > Le 30/11/2018 à 02:19, Ben Kietzman a écrit : > > > > > Currently, to figure out which types may be inferred and under which > > > > > circumstances they will be inferred involves digging through code. I > > > > think > > > > > it would be useful to have an API for expressing type inference > > rules. > > > > > Ideally this would be provided as utility functions alongside > > > > > StringConverter and used by anything which does type inference while > > > > > parsing/unboxing. > > > > > > > > It may be a bit more complicated. For example, a CSV file is parsed by > > > > blocks, and each block produces an array chunk. But when the type of a > > > > later block changes due to type inference failing on the current type, > > > > all previous blocks must be parsed again. > > > > > > > > So I'm curious what you would make the API look like. > > > > > > > > > By contrast, when reading JSON (which is explicit about numbers vs > > > > > strings), the graph would be: > > > > > > > > > > NULL -> BOOL > > > > > NULL -> INT64 -> DOUBLE > > > > > NULL -> TIMESTAMP -> STRING -> BINARY > > > > > > > > > > Seem reasonable? > > > > > Is there a case which isn't covered by a fallback graph as above? > > > > > > > > I have no idea. Someone else may be able to answer your question. > > > > > > > > Regards > > > > > > > > Antoine. > > > > > > > > > > > > > -- > > Sent from my jetpack. > >