msgpack format reader with schema learning feature

Jean-Claude Cote Tue, 09 Oct 2018 20:32:14 -0700

I'm writing a msgpack reader and in doing so I noticed that the JSON reader
will put a INT column place holder when no records match a select statement
like "select str from.." when the str field is not seen in the first batch.


However this is problematic because in the first batch it is not know what
data type is the column and drill will throw an error if the column turns
out to be a VARCHAR in the second batch.

In order to work around these types of issues I decided to add a "learning
schema mode" to my msgpack reader. Essentially in learning mode you can
feed it records that you know are complete and valid. The reader will
accumulate, merge and write the resulting schema to disk.

Once the schema is learned you can configure the msgpack reader to be in
"use schema mode". In this mode the reader will load the schema and apply
it to the writer after each batch. Any missing column like the str column
mentioned above will now be defined in the writer and with the proper type
VARCHAR. Drill will now be able to read files which have missing str
columns in the first few thousand rows.

One interesting thing about the implementation I did is that it uses
Drill's own schema i.e.:MaterializedField. So for example to save the
schema I do

SerializedField serializedMapField =
writer.getMapVector().getField().getSerializedField();
String data = TextFormat.printToString(serializedMapField);
IOUtils.write(data, out);

To apply the schema to the writer I walk the schema and create the
corresponding maps, list, bigint, varchar etc

if (mapWriter != null) {
  mapWriter.varChar(name);
} else {
  listWriter.varChar();
}

Because I use the Materialized fields in the writer to save the schema and
use them to apply the schema back to the writer I think this solution could
be used for the JSON reader and potentially others.

Another benefit I see in using a schema is that you can also skip over
records str column of type BINARY, FLOAT8 or MAP because those do not match
the schema which says str should be of type VARCHAR.

I have a working proof of concept in
https://github.com/jcmcote/drill-1/tree/master/contrib/format-msgpack.
There are many test cases showing how building a schema enables drill to
read files that it would not normally be able to.

I would greatly appreciate some feedback. Is this a good idea? Is it okay
to fill out the schema of the writer like I did. Any negative side effects?

Thank you
jc

msgpack format reader with schema learning feature

Reply via email to