Re: msgpack format reader with schema learning feature

Arina Yelchiyeva Wed, 10 Oct 2018 06:36:02 -0700

Somehow this is correlates with two projects which are currently actively
being investigated / prototyping:
1. Drill metastore (DRILL-6552)
2. Providing schema from the query (casts, hints).
The second one will allow to provide schema using hints, as well as from
the file.
Regarding how to use Drill to propose the schema, there would be ANALYZE
table command which will output the schema.


Anyway, both of them in the end will need to feed schema to the readers, so
at this point Paul's new format reader framework that accepts schema would
become really important.
@Paul, do you have any ETA when you will be able to submit the PRs? Maybe
also do some presentation? Can you please share Jira number(-s) as well?

Kind regards,
Arina

On Wed, Oct 10, 2018 at 7:31 AM Paul Rogers <[email protected]>
wrote:

> Hi JC,
>
> Very cool indeed. You are the man!
>
> Ted's been advocating for this approach for as long as I can remember (2+
> years). You're well on your way to solving the JSON problems that I
> documented a while back in DRILL-4710 and summarize as "Drill can't predict
> the future." Basically, without a schema, Drill is forced to make a column
> schema decision (use nullable INT!) before Drill has the full information.
>
> And, because Drill is distributed (have you tested this case yet?) a scan
> over a large file, or multiple files, will be spread across readers, each
> attempting to make independent decisions. These decisions will conflict
> (reader 1 saw your VARCHAR column, reader 2 didn't and guessed nullable
> INT). We've always hoped that Drill operators can sort out the mess and do
> the right thing -- something that is probably impossible in the general
> case.
>
> So, having a schema available is the simplest possible solution: all
> readers agree on the schema and the rest of the Drill operators are happy
> because they get consistent inputs. Hat's off to you for creating this
> proof of concept to move this conversation forward.
>
> Your MaterializedField approach is clever. It is somewhat awkward for this
> use case, but can work. It is awkward because a properly-constructed
> MaterializedField for a MAP, say, includes the Map schema. But, for a LIST,
> the Materialized field does not include the child types. Ignoring these
> long-standing corner cases, it will get you 90% of where you want to go.
>
> Just FYI, I had been working on a project to help with this stuff. I've
> checked in a newer schema mechanism that handles all Drill's nuances:
> unions,  (non-repeated) LISTs, REPEATED LISTs, etc. Search for
> "SchemaBuilder" and "TupleSchema". (There are two SchemaBuilder classes,
> you want the one that builds a TupleSchema.) Then if you search for
> references, you'll find lots of unit tests that use this to define a
> schema. There is no serialization format, but you could easily add that.
>
> The project includes a "result set loader" that provides a convent way to
> convert from Java types to Drill types. It is designed to allow, say,
> converting a String from a CSV file into a VARCHAR, INT, DOUBLE or
> whatever. Each reader would need its own string-to-whatever parser, but
> these are meant to be easy to add.
>
> The TupleSchema provides the input to create a ResultSetLoader for that
> schema, kind of how you're using a map, but far more simply. There are lots
> of unit tests that illustrate how this works.
>
> All this would be wrapped in a new format reader framework that accepts a
> schema. That part has stalled and is not yet committed. If we can push this
> idea forward, perhaps I can get off the dime and do a few more PRs to get
> the rest of the framework into Drill for your use.
>
> The next problem is how to use Drill to build the schema? How does the
> user specify to gather the schema? Would be great to have a DESCRIBE
> statement that outputs the schema you discovered. Then maybe CREATE SCHEMA
> <file name> AS DESCRIBE ...  The CREATE VIEW statement could act as a
> prototype.
>
> Drill is distributed and two users could do the CREATE SCHEMA at the same
> time, an in concert with someone doing a query on that same directory.
> Drill has no good synchronization solution. Since this seems to not be a
> problem for views, perhaps things will work for schemas. (Both are a single
> file.) We have had problem with metadata because refreshing that updates
> multiple files and lack of synchronization has been a lingering problem.
>
> Finally, I'd suggest ensuring your format can be created by hand or by
> other tools. Heck, if you want to make a big impact, work with the Arrow
> project to define a cross-project schema format that can be used to
> describe either Arrow or Drill schemas.
>
> Thanks,
> - Paul
>
>
>
>     On Tuesday, October 9, 2018, 8:31:45 PM PDT, Jean-Claude Cote <
> [email protected]> wrote:
>
>  I'm writing a msgpack reader and in doing so I noticed that the JSON
> reader
> will put a INT column place holder when no records match a select statement
> like "select str from.." when the str field is not seen in the first batch.
>
> However this is problematic because in the first batch it is not know what
> data type is the column and drill will throw an error if the column turns
> out to be a VARCHAR in the second batch.
>
> In order to work around these types of issues I decided to add a "learning
> schema mode" to my msgpack reader. Essentially in learning mode you can
> feed it records that you know are complete and valid. The reader will
> accumulate, merge and write the resulting schema to disk.
>
> Once the schema is learned you can configure the msgpack reader to be in
> "use schema mode". In this mode the reader will load the schema and apply
> it to the writer after each batch. Any missing column like the str column
> mentioned above will now be defined in the writer and with the proper type
> VARCHAR. Drill will now be able to read files which have missing str
> columns in the first few thousand rows.
>
> One interesting thing about the implementation I did is that it uses
> Drill's own schema i.e.:MaterializedField. So for example to save the
> schema I do
>
> SerializedField serializedMapField =
> writer.getMapVector().getField().getSerializedField();
> String data = TextFormat.printToString(serializedMapField);
> IOUtils.write(data, out);
>
> To apply the schema to the writer I walk the schema and create the
> corresponding maps, list, bigint, varchar etc
>
> if (mapWriter != null) {
>   mapWriter.varChar(name);
> } else {
>   listWriter.varChar();
> }
>
> Because I use the Materialized fields in the writer to save the schema and
> use them to apply the schema back to the writer I think this solution could
> be used for the JSON reader and potentially others.
>
> Another benefit I see in using a schema is that you can also skip over
> records str column of type BINARY, FLOAT8 or MAP because those do not match
> the schema which says str should be of type VARCHAR.
>
> I have a working proof of concept in
> https://github.com/jcmcote/drill-1/tree/master/contrib/format-msgpack.
> There are many test cases showing how building a schema enables drill to
> read files that it would not normally be able to.
>
> I would greatly appreciate some feedback. Is this a good idea? Is it okay
> to fill out the schema of the writer like I did. Any negative side effects?
>
> Thank you
> jc
>

Re: msgpack format reader with schema learning feature

Reply via email to