Re: msgpack format reader with schema learning feature

Paul Rogers Tue, 09 Oct 2018 21:32:23 -0700

Hi JC,

Very cool indeed. You are the man!


Ted's been advocating for this approach for as long as I can remember (2+ 
years). You're well on your way to solving the JSON problems that I documented 
a while back in DRILL-4710 and summarize as "Drill can't predict the future." 
Basically, without a schema, Drill is forced to make a column schema decision 
(use nullable INT!) before Drill has the full information.

And, because Drill is distributed (have you tested this case yet?) a scan over 
a large file, or multiple files, will be spread across readers, each attempting 
to make independent decisions. These decisions will conflict (reader 1 saw your 
VARCHAR column, reader 2 didn't and guessed nullable INT). We've always hoped 
that Drill operators can sort out the mess and do the right thing -- something 
that is probably impossible in the general case.

So, having a schema available is the simplest possible solution: all readers 
agree on the schema and the rest of the Drill operators are happy because they 
get consistent inputs. Hat's off to you for creating this proof of concept to 
move this conversation forward.

Your MaterializedField approach is clever. It is somewhat awkward for this use 
case, but can work. It is awkward because a properly-constructed 
MaterializedField for a MAP, say, includes the Map schema. But, for a LIST, the 
Materialized field does not include the child types. Ignoring these 
long-standing corner cases, it will get you 90% of where you want to go.

Just FYI, I had been working on a project to help with this stuff. I've checked 
in a newer schema mechanism that handles all Drill's nuances: unions,  
(non-repeated) LISTs, REPEATED LISTs, etc. Search for "SchemaBuilder" and 
"TupleSchema". (There are two SchemaBuilder classes, you want the one that 
builds a TupleSchema.) Then if you search for references, you'll find lots of 
unit tests that use this to define a schema. There is no serialization format, 
but you could easily add that.

The project includes a "result set loader" that provides a convent way to 
convert from Java types to Drill types. It is designed to allow, say, 
converting a String from a CSV file into a VARCHAR, INT, DOUBLE or whatever. 
Each reader would need its own string-to-whatever parser, but these are meant 
to be easy to add.

The TupleSchema provides the input to create a ResultSetLoader for that schema, 
kind of how you're using a map, but far more simply. There are lots of unit 
tests that illustrate how this works.

All this would be wrapped in a new format reader framework that accepts a 
schema. That part has stalled and is not yet committed. If we can push this 
idea forward, perhaps I can get off the dime and do a few more PRs to get the 
rest of the framework into Drill for your use.

The next problem is how to use Drill to build the schema? How does the user 
specify to gather the schema? Would be great to have a DESCRIBE statement that 
outputs the schema you discovered. Then maybe CREATE SCHEMA <file name> AS 
DESCRIBE ...  The CREATE VIEW statement could act as a prototype.

Drill is distributed and two users could do the CREATE SCHEMA at the same time, 
an in concert with someone doing a query on that same directory. Drill has no 
good synchronization solution. Since this seems to not be a problem for views, 
perhaps things will work for schemas. (Both are a single file.) We have had 
problem with metadata because refreshing that updates multiple files and lack 
of synchronization has been a lingering problem.

Finally, I'd suggest ensuring your format can be created by hand or by other 
tools. Heck, if you want to make a big impact, work with the Arrow project to 
define a cross-project schema format that can be used to describe either Arrow 
or Drill schemas.

Thanks,
- Paul

 

    On Tuesday, October 9, 2018, 8:31:45 PM PDT, Jean-Claude Cote 
<[email protected]> wrote:  
 
 I'm writing a msgpack reader and in doing so I noticed that the JSON reader
will put a INT column place holder when no records match a select statement
like "select str from.." when the str field is not seen in the first batch.

However this is problematic because in the first batch it is not know what
data type is the column and drill will throw an error if the column turns
out to be a VARCHAR in the second batch.

In order to work around these types of issues I decided to add a "learning
schema mode" to my msgpack reader. Essentially in learning mode you can
feed it records that you know are complete and valid. The reader will
accumulate, merge and write the resulting schema to disk.

Once the schema is learned you can configure the msgpack reader to be in
"use schema mode". In this mode the reader will load the schema and apply
it to the writer after each batch. Any missing column like the str column
mentioned above will now be defined in the writer and with the proper type
VARCHAR. Drill will now be able to read files which have missing str
columns in the first few thousand rows.

One interesting thing about the implementation I did is that it uses
Drill's own schema i.e.:MaterializedField. So for example to save the
schema I do

SerializedField serializedMapField =
writer.getMapVector().getField().getSerializedField();
String data = TextFormat.printToString(serializedMapField);
IOUtils.write(data, out);

To apply the schema to the writer I walk the schema and create the
corresponding maps, list, bigint, varchar etc

if (mapWriter != null) {
  mapWriter.varChar(name);
} else {
  listWriter.varChar();
}

Because I use the Materialized fields in the writer to save the schema and
use them to apply the schema back to the writer I think this solution could
be used for the JSON reader and potentially others.

Another benefit I see in using a schema is that you can also skip over
records str column of type BINARY, FLOAT8 or MAP because those do not match
the schema which says str should be of type VARCHAR.

I have a working proof of concept in
https://github.com/jcmcote/drill-1/tree/master/contrib/format-msgpack.
There are many test cases showing how building a schema enables drill to
read files that it would not normally be able to.

I would greatly appreciate some feedback. Is this a good idea? Is it okay
to fill out the schema of the writer like I did. Any negative side effects?

Thank you
jc

Re: msgpack format reader with schema learning feature

Reply via email to