Re: "Death of Schema-on-Read"

Charles Givre Thu, 05 Apr 2018 12:13:16 -0700

I’ll weigh in here.  IMHO, Drill’s schema discovery features are excellent, but 
what would be nice is if Drill could:
1.  Accept hints or directives for the schema
2.  Remember these hints if the file doesn’t change. 
3.  Allow these hints to be applied to a directory of files.


Therefore, I do think it would be useful for Drill to have some sort of 
metastore which would enable Drill to remember previously defined schemata so 
that you don’t have to do it over and over again.  For the logfile AKA regex 
file reader that Paul and I are working on, it does this via the config, but it 
would be nice for this capability to exist for delimited and other file-based 
data types. 
-C


> On Apr 5, 2018, at 12:02, Aman Sinha <amansi...@apache.org> wrote:
> 
> All good discussions in this thread.  It clearly shows that Drill's
> schema-on-read is not only a nice-to-have but for applications like IOT, it
> is a must-have.
> For other types of data that is slowly changing,  in order to improve
> overall user experience where the user is willing to run offline commands
> to discover schema (as opposed to doing it while querying),
> we should consider doing sampling of the files with different sampling
> percentages.  This would be similar to collecting statistics through
> sampling.  In fact,
> the two things (schema discovery and stats) can be done in a single pass
> over the data.
> 
> -Aman
> 
> On Thu, Apr 5, 2018 at 7:24 AM, Joel Pfaff <joel.pf...@gmail.com> wrote:
> 
>> Hello,
>> 
>> A lot of versioning problems arise when trying to share data through kafka
>> between multiple applications with different lifecycles and maintainers,
>> since by default, a single message in Kafka is just a blob.
>> One way to solve that is to agree on a single serialization format,
>> friendly with a record per record storage (like avro) and in order to not
>> have to serialize the schema in use for every message, just reference an
>> entry in the Avro Schema Registry (this flow is described here:
>> https://medium.com/@stephane.maarek/introduction-to-
>> schemas-in-apache-kafka-with-the-confluent-schema-registry-3bf55e401321
>> ).
>> On top of the schema registry, specific client libs allow to validate the
>> message structure prior to the injection in kafka.
>> So while comcast mentions the usage of an Avro Schema to describe its
>> feeds, it does not mention directly the usage of avro files (to describe
>> the schema).
>> 
>> Coming back to Drill, I think it tries to make a nice effort to provide
>> similar features on top of loosely typed datasets. It could probably try to
>> do better in some cases (handling unknown types as `Unknown` is probably
>> better than `Nullable Int`), but its ability to dynamically merge data with
>> different (but still compatible) schemas is really nice.
>> 
>> When using untyped file formats (JSON, CSV), Drill does its best, and while
>> it is not perfect, but it is already pretty good.
>> When relying on types formats like Parquet /ORC / Avro, lot of problems are
>> solved because each file describes its columns (name / types), allowing
>> even for complex structures.
>> But the usage of CSV/JSON still is problematic. I like the idea of having
>> an optional way to describe the expected types somewhere (either in a
>> central meta-store, or in a structured file next to the dataset).
>> That would make the usage of CTAS much safer/easier (sometimes, we have to
>> use Spark to generate the Parquet files because of schema/type problems).
>> 
>> Independently from the meta-store, it is a bit annoying that Drill would
>> need to `discover` the columns and types at every scan through trial and
>> error, and cannot benefit from the previous queries.
>> Extending the `Analyze Table` command so that meta-data could be generated
>> from JSON/CSV file/folder could improve this situation without introducing
>> a costly/painful ETL process.
>> 
>> Regards, Joel
>> 
>> 
>> On Wed, Apr 4, 2018 at 10:35 PM, Jinfeng Ni <j...@apache.org> wrote:
>> 
>>> I feel it's probably premature to cal it "death of schema-on-read" just
>>> based on one application case. For one product I have been working on
>>> recently, one use case is for IOT related application where data is sent
>>> from a variety of small devices (sensors, camera, etc). It would be a
>> hard
>>> requirement to pre-define schema upfront for each device, before write
>> data
>>> into the system. Further, the value of data is likely to decrease
>>> significantly over time; data within hours/days is way more important
>> than
>>> that of weeks/months ago. It's unimaginable to wait for weeks to run data
>>> clean/preparation job, before user could query such data. In other words,
>>> for application with requirements of  flexibility and time-sensitivity,
>>> 'schema-on-read' provides a huge benefit, compared with traditional
>>> ETL-then-query approach.
>>> 
>>> Drill's schema-on-read is actually trying to solve a rather hard problem,
>>> in that we deal with not only relational type, but also nested type. In
>>> that sense, Drill is walking in an uncharted territory where not many
>>> others are doing similar things.  Dealing with undocumented/unstructured
>>> data is a big challenge. Although Drill's solution is not perfect, IMHO,
>>> it's still a big step towards such a problem.
>>> 
>>> With that said, I agreed with points people raised earlier. In addition
>> to
>>> "schema-on-read", Drill has to do a better to handle the traditional
>> cases
>>> where schema is known beforehand, by introducing a meta-store /catalog,
>> or
>>> by allowing users to declare schema upfront ( I probably will not call
>>> Drill "schema-forbidden"). The restart strategy seems to be also
>>> interesting to handle failure caused by missing schema / schema change.
>>> 
>>> 
>>> 
>>> 
>>> On Tue, Apr 3, 2018 at 10:01 PM, Ted Dunning <ted.dunn...@gmail.com>
>>> wrote:
>>> 
>>>> Well, the restart strategy still works for your examples. And you only
>>> pay
>>>> once. From them you look at the cached type information and used an
>> upper
>>>> bound data type as you read the data. Since it works to read the values
>>> in
>>>> the right order, it is obviously possible to push down typing
>> information
>>>> even into the json reader.
>>>> 
>>>> 
>>>> 
>>>> On Tue, Apr 3, 2018, 21:42 Paul Rogers <par0...@yahoo.com.invalid>
>>> wrote:
>>>> 
>>>>> Subtle point. I can provide schema with Parquet, as you note.
>>> (Actually,
>>>>> for Parquet, Drill is schema-required: I can't not provide a schema
>> due
>>>> to
>>>>> the nature of Parquet...)
>>>>> 
>>>>> But, I can't provide a schema for JSON, CSV, etc. The point is, Drill
>>>>> forbids the user from providing a schema; only the file format itself
>>> can
>>>>> provide the schema (or not, in the case of JSON). This is the very
>>> heart
>>>> of
>>>>> the problem.
>>>>> 
>>>>> The root cause of our schema change exception is that vectors are,
>>>> indeed,
>>>>> strongly typed. But, file columns are not. Here is my favorite:
>>>>> 
>>>>> {x: 10} {x: 10.1}
>>>>> 
>>>>> Blam! Query fails because the vector is chosen as BigInt, then we
>>>> discover
>>>>> it really should have been Float8. (If the answer is: go back and
>>> rebuild
>>>>> the vector with the new type, consider the case that 100K records
>>>> separate
>>>>> the two above so that the first batch is long gone by the time we see
>>> the
>>>>> offending record. If only I could tell Drill to use Float8 (or
>> Decimal)
>>>> up
>>>>> front...
>>>>> 
>>>>> Views won't help here because the failure occurs before a view can
>> kick
>>>>> in. However, presumably, I could write a view to handle a different
>>>> classic
>>>>> case:
>>>>> 
>>>>> myDir /
>>>>> |- File 1: {a: 10, b: "foo"}
>>>>> |- File 2: {a: 20}
>>>>> 
>>>>> With query: SELECT a, b FROM myDir
>>>>> 
>>>>> For File 2, Drill will guess that b is a Nullable Int, but it is
>> really
>>>>> VarChar. I think I could write clever SQL that says:
>>>>> 
>>>>> If b is of type Nullable Int, return NULL cast to nullable VarChar,
>>> else
>>>>> return b
>>>>> 
>>>>> The irony is that I must to write procedural code to declare a static
>>>>> attribute of the data. Yet SQL is otherwise declarative: I state
>> what I
>>>>> want, not how to implement it.
>>>>> 
>>>>> Life would be so much easier if I could just say, "trust me, when you
>>>> read
>>>>> column b, it is a VarChar."
>>>>> 
>>>>> Thanks,
>>>>> - Paul
>>>>> 
>>>>> 
>>>>> 
>>>>>    On Tuesday, April 3, 2018, 10:53:27 AM PDT, Ted Dunning <
>>>>> ted.dunn...@gmail.com> wrote:
>>>>> 
>>>>> I don't see why you say that Drill is schema-forbidden.
>>>>> 
>>>>> The Parquet reader, for instance, makes strong use of the implied
>>> schema
>>>> to
>>>>> facilitate reading of typed data.
>>>>> 
>>>>> Likewise, the vectorized internal format is strongly typed and, as
>>> such,
>>>>> uses schema information.
>>>>> 
>>>>> Views are another way to communicate schema information.
>>>>> 
>>>>> It is true that you can't, say, view comments on fields from the
>>> command
>>>>> line. But I don't understand saying "schema-forbidden".
>>>>> 
>>>>> 
>>>>> On Tue, Apr 3, 2018 at 10:01 AM, Paul Rogers
>> <par0...@yahoo.com.invalid
>>>> 
>>>>> wrote:
>>>>> 
>>>>>> Here is another way to think about it. Today, Drill is
>>>>> "schema-forbidden":
>>>>>> even if I know the schema, I can't communicate that to Drill; Drill
>>>> must
>>>>>> figure it out on its own, making the same mistakes every time on
>>>>> ambiguous
>>>>>> schemas.
>>>>>> 
>>>>> 
>>>> 
>>> 
>>

Re: "Death of Schema-on-Read"

Reply via email to