Re: msgpack format reader with schema learning feature

Jean-Claude Cote Thu, 11 Oct 2018 19:33:09 -0700

I'm pretty sure I can have the type of the list. Here's an example of the
schema I use. The root map correspond to the row. Notice how the
arrayOfArray says it is a LIST:REPEATED. But if you drill down and ask for
the first child of that MaterializedField you get the inner array named
$data$ and it is of type BIGINT:REPEATED. So I know it's another array.
Then I drill down some more and I get to BIGINT:REQUIRED. That's the type
of the elements.


[`` (MAP:REQUIRED), children=(

[`arrayOfArray` (LIST:REPEATED), children=([`$data$` (BIGINT:REPEATED),
children=([`$data$` (BIGINT(0, 0):REQUIRED)])])],

 [`arrayOfMap` (MAP:REPEATED), children=([`raw` (VARCHAR:OPTIONAL)], [`bin`
(VARBINARY:OPTIONAL)], [`int` (BIGINT:OPTIONAL)], [`big`
(BIGINT:OPTIONAL)], [`byt` (BIGINT:OPTIONAL)], [`lon` (BIGINT:OPTIONAL)],
[`sho` (BIGINT:OPTIONAL)], [`dou` (FLOAT8:OPTIONAL)], [`flo`
(FLOAT8:OPTIONAL)], [`str` (VARCHAR:OPTIONAL)])],

 [`mapOfMap` (MAP:REQUIRED), children=([`aMap` (MAP:REQUIRED),
children=([`raw` (VARCHAR:OPTIONAL)], [`bin` (VARBINARY:OPTIONAL)], [`int`
(BIGINT:OPTIONAL)], [`big` (BIGINT:OPTIONAL)], [`byt` (BIGINT:OPTIONAL)],
[`lon` (BIGINT:OPTIONAL)], [`sho` (BIGINT:OPTIONAL)], [`dou`
(FLOAT8:OPTIONAL)], [`flo` (FLOAT8:OPTIONAL)], [`str`
(VARCHAR:OPTIONAL)])])],

[`mapWithArray` (MAP:REQUIRED), children=([`anArray` (VARCHAR:REPEATED),
children=([`$data$` (VARCHAR(0, 0):REQUIRED)])], [`aString`
(VARCHAR:OPTIONAL)])], [`

The problem I had with the TupleSchema is that it seems designed to build a
row. In my use case I want the each cell in the row to have all possible
leaf of the structure including array of array of array etc. The
TupleSchema did not let me drill down those structures. So I decided to use
the MaterializedField directly since then I could navigate down to the
leafs.


Interesting the Result Set Loader will have a dynamic batch size. I like
that. I've been wondering what to set my msgpack batch size too.


On Thu, Oct 11, 2018 at 1:52 AM Paul Rogers <[email protected]>
wrote:

> Hi JC,
>
> Drill's complex types can be a bit confusing. Note that, in your example,
> for the REPEATED BIGINT, we know that this is an array (REPEATED) and we
> know the type of each element (BIGINT).
>
> But, that REPEATED LIST, it is a list of ... what? The element type is
> missing.
>
> This is not the only hole. The UNION type has a list of child types which
> tell you the types in the UNION. But, if the UNION's child type is a MAP,
> that type does not include the full MAP schema.  A LIST is a list of a
> single type, or a LIST of UNIONs. It has the same schema ambiguity.
>
> The TupleSchema mechanism fills in this gap. But, for your use,
> MaterializedList should be fine because you probably don't want to use the
> "obscure" types. Frankly, LIST, REPEATED LIST and UNION are still pretty
> broken. I would recommend sticking with the scalar types (REQUIRED,
> OPTIONAL, REPEATED) and MAP (REQUIRED and REPEATED, there is no OPTIONAL).
> I ran into bug after bug when trying to use LIST or UNION. You can populate
> them, but some of the "fancier" operators (sort, hash join, aggregation)
> can't handle them yet.
>
>
> Can you explain a bit more the problem you ran into with the SchemaBuilder
> (the one that uses TupleSchema)? It is supposed to handle all types. I'd
> like to fix any issues you may have found.
>
>
> Just to give a bit more background on the tuple schema and related
> classes... The builder creates a schema that can be used with the RowSet
> class to create a record batch that matches the schema. The RowSet provides
> column writers to populate your record batch, and column readers to read
> it. The column accessors convert between Java types and vector types and
> can provide the custom type conversion I mentioned.
>
> For simple cases (working with a few types), the simple mechanism shown in
> the log reader works well. (It is what we explain in the Drill book.) But,
> as you add types, especially structured types, things get pretty complex.
> The RowSet family handles all that cruft for you.
>
> The part I still need to add is the "result set loader" which goes one
> step further: it can limit memory taken by a record batch. Most readers
> today use a fixed number, say 4K records. 4K of INTs is pretty small. 4K of
> 1 MB images is pretty big. The Result Set Loader works against a memory
> limit (20 MB, say) and automatically limits records per batch to that
> memory limit.
>
>
> Thanks for doing the PR. Will be great to see what you've created.
>
> Thanks,
> - Paul
>
>
>
>     On Wednesday, October 10, 2018, 7:59:06 PM PDT, Jean-Claude Cote <
> [email protected]> wrote:
>
>  Hey Paul,
>
> You mentionned that
>
> "But, for a LIST, the Materialized field does not include the child types"
>
> However MaterializedField do have type information for child types. You can
> see it in this example. I think it has all relevant information. Anyways
> all test cases I've tried so far are working..
>
>
> child {
>   major_type {
>     minor_type: LIST
>     mode: REPEATED
>   }
>   name_part {
>     name: "arrayOfArray"
>   }
>   child {
>     major_type {
>       minor_type: BIGINT
>       mode: REPEATED
>     }
>     name_part {
>       name: "$data$"
>     }
>     child {
>       major_type {
>         minor_type: BIGINT
>         mode: REQUIRED
>         precision: 0
>         scale: 0
>       }
>       name_part {
>         name: "$data$"
>       }
>     }
>   }
> }
>
> You also mention that I could leverage the "SchemaBuilder" and
> "TupleSchema". I did try to use the TupleSchema but back out of using it. I
> think my issue was navigating down into lists. Anyways in the end using a
> Map MaterializedField to represent my row seem to do the job.
>
> You also mention
> "The next problem is how to use Drill to build the schema? How does the
> user specify to gather the schema? Would be great to have a DESCRIBE
> statement that outputs the schema you discovered. Then maybe CREATE SCHEMA
> <file name> AS DESCRIBE ...  The CREATE VIEW statement could act as a
> prototype."
>
> Totally agree, I initially though that it would be great to trigger the
> schema learning like that but did not really know how it could be done. So
> in the end I just used a format plugin property to toggle it on/off.
>
> Tomorrow I'm going to submit a pull request so you guys can have a better
> look at what I've done. I'd really like my zstandard codec and msgpack
> support to make it's way into the drill code base.
>
> Cheers
> jc
>
>
> On Wed, Oct 10, 2018 at 9:35 AM Arina Yelchiyeva <
> [email protected]>
> wrote:
>
> > Somehow this is correlates with two projects which are currently actively
> > being investigated / prototyping:
> > 1. Drill metastore (DRILL-6552)
> > 2. Providing schema from the query (casts, hints).
> > The second one will allow to provide schema using hints, as well as from
> > the file.
> > Regarding how to use Drill to propose the schema, there would be ANALYZE
> > table command which will output the schema.
> >
> > Anyway, both of them in the end will need to feed schema to the readers,
> so
> > at this point Paul's new format reader framework that accepts schema
> would
> > become really important.
> > @Paul, do you have any ETA when you will be able to submit the PRs? Maybe
> > also do some presentation? Can you please share Jira number(-s) as well?
> >
> > Kind regards,
> > Arina
> >
> > On Wed, Oct 10, 2018 at 7:31 AM Paul Rogers <[email protected]>
> > wrote:
> >
> > > Hi JC,
> > >
> > > Very cool indeed. You are the man!
> > >
> > > Ted's been advocating for this approach for as long as I can remember
> (2+
> > > years). You're well on your way to solving the JSON problems that I
> > > documented a while back in DRILL-4710 and summarize as "Drill can't
> > predict
> > > the future." Basically, without a schema, Drill is forced to make a
> > column
> > > schema decision (use nullable INT!) before Drill has the full
> > information.
> > >
> > > And, because Drill is distributed (have you tested this case yet?) a
> scan
> > > over a large file, or multiple files, will be spread across readers,
> each
> > > attempting to make independent decisions. These decisions will conflict
> > > (reader 1 saw your VARCHAR column, reader 2 didn't and guessed nullable
> > > INT). We've always hoped that Drill operators can sort out the mess and
> > do
> > > the right thing -- something that is probably impossible in the general
> > > case.
> > >
> > > So, having a schema available is the simplest possible solution: all
> > > readers agree on the schema and the rest of the Drill operators are
> happy
> > > because they get consistent inputs. Hat's off to you for creating this
> > > proof of concept to move this conversation forward.
> > >
> > > Your MaterializedField approach is clever. It is somewhat awkward for
> > this
> > > use case, but can work. It is awkward because a properly-constructed
> > > MaterializedField for a MAP, say, includes the Map schema. But, for a
> > LIST,
> > > the Materialized field does not include the child types. Ignoring these
> > > long-standing corner cases, it will get you 90% of where you want to
> go.
> > >
> > > Just FYI, I had been working on a project to help with this stuff. I've
> > > checked in a newer schema mechanism that handles all Drill's nuances:
> > > unions,  (non-repeated) LISTs, REPEATED LISTs, etc. Search for
> > > "SchemaBuilder" and "TupleSchema". (There are two SchemaBuilder
> classes,
> > > you want the one that builds a TupleSchema.) Then if you search for
> > > references, you'll find lots of unit tests that use this to define a
> > > schema. There is no serialization format, but you could easily add
> that.
> > >
> > > The project includes a "result set loader" that provides a convent way
> to
> > > convert from Java types to Drill types. It is designed to allow, say,
> > > converting a String from a CSV file into a VARCHAR, INT, DOUBLE or
> > > whatever. Each reader would need its own string-to-whatever parser, but
> > > these are meant to be easy to add.
> > >
> > > The TupleSchema provides the input to create a ResultSetLoader for that
> > > schema, kind of how you're using a map, but far more simply. There are
> > lots
> > > of unit tests that illustrate how this works.
> > >
> > > All this would be wrapped in a new format reader framework that
> accepts a
> > > schema. That part has stalled and is not yet committed. If we can push
> > this
> > > idea forward, perhaps I can get off the dime and do a few more PRs to
> get
> > > the rest of the framework into Drill for your use.
> > >
> > > The next problem is how to use Drill to build the schema? How does the
> > > user specify to gather the schema? Would be great to have a DESCRIBE
> > > statement that outputs the schema you discovered. Then maybe CREATE
> > SCHEMA
> > > <file name> AS DESCRIBE ...  The CREATE VIEW statement could act as a
> > > prototype.
> > >
> > > Drill is distributed and two users could do the CREATE SCHEMA at the
> same
> > > time, an in concert with someone doing a query on that same directory.
> > > Drill has no good synchronization solution. Since this seems to not be
> a
> > > problem for views, perhaps things will work for schemas. (Both are a
> > single
> > > file.) We have had problem with metadata because refreshing that
> updates
> > > multiple files and lack of synchronization has been a lingering
> problem.
> > >
> > > Finally, I'd suggest ensuring your format can be created by hand or by
> > > other tools. Heck, if you want to make a big impact, work with the
> Arrow
> > > project to define a cross-project schema format that can be used to
> > > describe either Arrow or Drill schemas.
> > >
> > > Thanks,
> > > - Paul
> > >
> > >
> > >
> > >    On Tuesday, October 9, 2018, 8:31:45 PM PDT, Jean-Claude Cote <
> > > [email protected]> wrote:
> > >
> > >  I'm writing a msgpack reader and in doing so I noticed that the JSON
> > > reader
> > > will put a INT column place holder when no records match a select
> > statement
> > > like "select str from.." when the str field is not seen in the first
> > batch.
> > >
> > > However this is problematic because in the first batch it is not know
> > what
> > > data type is the column and drill will throw an error if the column
> turns
> > > out to be a VARCHAR in the second batch.
> > >
> > > In order to work around these types of issues I decided to add a
> > "learning
> > > schema mode" to my msgpack reader. Essentially in learning mode you can
> > > feed it records that you know are complete and valid. The reader will
> > > accumulate, merge and write the resulting schema to disk.
> > >
> > > Once the schema is learned you can configure the msgpack reader to be
> in
> > > "use schema mode". In this mode the reader will load the schema and
> apply
> > > it to the writer after each batch. Any missing column like the str
> column
> > > mentioned above will now be defined in the writer and with the proper
> > type
> > > VARCHAR. Drill will now be able to read files which have missing str
> > > columns in the first few thousand rows.
> > >
> > > One interesting thing about the implementation I did is that it uses
> > > Drill's own schema i.e.:MaterializedField. So for example to save the
> > > schema I do
> > >
> > > SerializedField serializedMapField =
> > > writer.getMapVector().getField().getSerializedField();
> > > String data = TextFormat.printToString(serializedMapField);
> > > IOUtils.write(data, out);
> > >
> > > To apply the schema to the writer I walk the schema and create the
> > > corresponding maps, list, bigint, varchar etc
> > >
> > > if (mapWriter != null) {
> > >  mapWriter.varChar(name);
> > > } else {
> > >  listWriter.varChar();
> > > }
> > >
> > > Because I use the Materialized fields in the writer to save the schema
> > and
> > > use them to apply the schema back to the writer I think this solution
> > could
> > > be used for the JSON reader and potentially others.
> > >
> > > Another benefit I see in using a schema is that you can also skip over
> > > records str column of type BINARY, FLOAT8 or MAP because those do not
> > match
> > > the schema which says str should be of type VARCHAR.
> > >
> > > I have a working proof of concept in
> > > https://github.com/jcmcote/drill-1/tree/master/contrib/format-msgpack.
> > > There are many test cases showing how building a schema enables drill
> to
> > > read files that it would not normally be able to.
> > >
> > > I would greatly appreciate some feedback. Is this a good idea? Is it
> okay
> > > to fill out the schema of the writer like I did. Any negative side
> > effects?
> > >
> > > Thank you
> > > jc
> > >
> >
>

Re: msgpack format reader with schema learning feature

Reply via email to