Re: msgpack format reader with schema learning feature

Arina Yelchiyeva Thu, 11 Oct 2018 02:51:33 -0700

Paul,

sounds good. I like the idea of mock scanner being done first, since
besides csv and json, other readers would have to be updated as well.
Could you please share Jira number(-s) if any so I can follow them?


Kind regards,
Arina

On Thu, Oct 11, 2018 at 8:52 AM Paul Rogers <[email protected]>
wrote:

> Hi JC,
>
> Drill's complex types can be a bit confusing. Note that, in your example,
> for the REPEATED BIGINT, we know that this is an array (REPEATED) and we
> know the type of each element (BIGINT).
>
> But, that REPEATED LIST, it is a list of ... what? The element type is
> missing.
>
> This is not the only hole. The UNION type has a list of child types which
> tell you the types in the UNION. But, if the UNION's child type is a MAP,
> that type does not include the full MAP schema.  A LIST is a list of a
> single type, or a LIST of UNIONs. It has the same schema ambiguity.
>
> The TupleSchema mechanism fills in this gap. But, for your use,
> MaterializedList should be fine because you probably don't want to use the
> "obscure" types. Frankly, LIST, REPEATED LIST and UNION are still pretty
> broken. I would recommend sticking with the scalar types (REQUIRED,
> OPTIONAL, REPEATED) and MAP (REQUIRED and REPEATED, there is no OPTIONAL).
> I ran into bug after bug when trying to use LIST or UNION. You can populate
> them, but some of the "fancier" operators (sort, hash join, aggregation)
> can't handle them yet.
>
>
> Can you explain a bit more the problem you ran into with the SchemaBuilder
> (the one that uses TupleSchema)? It is supposed to handle all types. I'd
> like to fix any issues you may have found.
>
>
> Just to give a bit more background on the tuple schema and related
> classes... The builder creates a schema that can be used with the RowSet
> class to create a record batch that matches the schema. The RowSet provides
> column writers to populate your record batch, and column readers to read
> it. The column accessors convert between Java types and vector types and
> can provide the custom type conversion I mentioned.
>
> For simple cases (working with a few types), the simple mechanism shown in
> the log reader works well. (It is what we explain in the Drill book.) But,
> as you add types, especially structured types, things get pretty complex.
> The RowSet family handles all that cruft for you.
>
> The part I still need to add is the "result set loader" which goes one
> step further: it can limit memory taken by a record batch. Most readers
> today use a fixed number, say 4K records. 4K of INTs is pretty small. 4K of
> 1 MB images is pretty big. The Result Set Loader works against a memory
> limit (20 MB, say) and automatically limits records per batch to that
> memory limit.
>
>
> Thanks for doing the PR. Will be great to see what you've created.
>
> Thanks,
> - Paul
>
>
>
>     On Wednesday, October 10, 2018, 7:59:06 PM PDT, Jean-Claude Cote <
> [email protected]> wrote:
>
>  Hey Paul,
>
> You mentionned that
>
> "But, for a LIST, the Materialized field does not include the child types"
>
> However MaterializedField do have type information for child types. You can
> see it in this example. I think it has all relevant information. Anyways
> all test cases I've tried so far are working..
>
>
> child {
>   major_type {
>     minor_type: LIST
>     mode: REPEATED
>   }
>   name_part {
>     name: "arrayOfArray"
>   }
>   child {
>     major_type {
>       minor_type: BIGINT
>       mode: REPEATED
>     }
>     name_part {
>       name: "$data$"
>     }
>     child {
>       major_type {
>         minor_type: BIGINT
>         mode: REQUIRED
>         precision: 0
>         scale: 0
>       }
>       name_part {
>         name: "$data$"
>       }
>     }
>   }
> }
>
> You also mention that I could leverage the "SchemaBuilder" and
> "TupleSchema". I did try to use the TupleSchema but back out of using it. I
> think my issue was navigating down into lists. Anyways in the end using a
> Map MaterializedField to represent my row seem to do the job.
>
> You also mention
> "The next problem is how to use Drill to build the schema? How does the
> user specify to gather the schema? Would be great to have a DESCRIBE
> statement that outputs the schema you discovered. Then maybe CREATE SCHEMA
> <file name> AS DESCRIBE ...  The CREATE VIEW statement could act as a
> prototype."
>
> Totally agree, I initially though that it would be great to trigger the
> schema learning like that but did not really know how it could be done. So
> in the end I just used a format plugin property to toggle it on/off.
>
> Tomorrow I'm going to submit a pull request so you guys can have a better
> look at what I've done. I'd really like my zstandard codec and msgpack
> support to make it's way into the drill code base.
>
> Cheers
> jc
>
>
> On Wed, Oct 10, 2018 at 9:35 AM Arina Yelchiyeva <
> [email protected]>
> wrote:
>
> > Somehow this is correlates with two projects which are currently actively
> > being investigated / prototyping:
> > 1. Drill metastore (DRILL-6552)
> > 2. Providing schema from the query (casts, hints).
> > The second one will allow to provide schema using hints, as well as from
> > the file.
> > Regarding how to use Drill to propose the schema, there would be ANALYZE
> > table command which will output the schema.
> >
> > Anyway, both of them in the end will need to feed schema to the readers,
> so
> > at this point Paul's new format reader framework that accepts schema
> would
> > become really important.
> > @Paul, do you have any ETA when you will be able to submit the PRs? Maybe
> > also do some presentation? Can you please share Jira number(-s) as well?
> >
> > Kind regards,
> > Arina
> >
> > On Wed, Oct 10, 2018 at 7:31 AM Paul Rogers <[email protected]>
> > wrote:
> >
> > > Hi JC,
> > >
> > > Very cool indeed. You are the man!
> > >
> > > Ted's been advocating for this approach for as long as I can remember
> (2+
> > > years). You're well on your way to solving the JSON problems that I
> > > documented a while back in DRILL-4710 and summarize as "Drill can't
> > predict
> > > the future." Basically, without a schema, Drill is forced to make a
> > column
> > > schema decision (use nullable INT!) before Drill has the full
> > information.
> > >
> > > And, because Drill is distributed (have you tested this case yet?) a
> scan
> > > over a large file, or multiple files, will be spread across readers,
> each
> > > attempting to make independent decisions. These decisions will conflict
> > > (reader 1 saw your VARCHAR column, reader 2 didn't and guessed nullable
> > > INT). We've always hoped that Drill operators can sort out the mess and
> > do
> > > the right thing -- something that is probably impossible in the general
> > > case.
> > >
> > > So, having a schema available is the simplest possible solution: all
> > > readers agree on the schema and the rest of the Drill operators are
> happy
> > > because they get consistent inputs. Hat's off to you for creating this
> > > proof of concept to move this conversation forward.
> > >
> > > Your MaterializedField approach is clever. It is somewhat awkward for
> > this
> > > use case, but can work. It is awkward because a properly-constructed
> > > MaterializedField for a MAP, say, includes the Map schema. But, for a
> > LIST,
> > > the Materialized field does not include the child types. Ignoring these
> > > long-standing corner cases, it will get you 90% of where you want to
> go.
> > >
> > > Just FYI, I had been working on a project to help with this stuff. I've
> > > checked in a newer schema mechanism that handles all Drill's nuances:
> > > unions,  (non-repeated) LISTs, REPEATED LISTs, etc. Search for
> > > "SchemaBuilder" and "TupleSchema". (There are two SchemaBuilder
> classes,
> > > you want the one that builds a TupleSchema.) Then if you search for
> > > references, you'll find lots of unit tests that use this to define a
> > > schema. There is no serialization format, but you could easily add
> that.
> > >
> > > The project includes a "result set loader" that provides a convent way
> to
> > > convert from Java types to Drill types. It is designed to allow, say,
> > > converting a String from a CSV file into a VARCHAR, INT, DOUBLE or
> > > whatever. Each reader would need its own string-to-whatever parser, but
> > > these are meant to be easy to add.
> > >
> > > The TupleSchema provides the input to create a ResultSetLoader for that
> > > schema, kind of how you're using a map, but far more simply. There are
> > lots
> > > of unit tests that illustrate how this works.
> > >
> > > All this would be wrapped in a new format reader framework that
> accepts a
> > > schema. That part has stalled and is not yet committed. If we can push
> > this
> > > idea forward, perhaps I can get off the dime and do a few more PRs to
> get
> > > the rest of the framework into Drill for your use.
> > >
> > > The next problem is how to use Drill to build the schema? How does the
> > > user specify to gather the schema? Would be great to have a DESCRIBE
> > > statement that outputs the schema you discovered. Then maybe CREATE
> > SCHEMA
> > > <file name> AS DESCRIBE ...  The CREATE VIEW statement could act as a
> > > prototype.
> > >
> > > Drill is distributed and two users could do the CREATE SCHEMA at the
> same
> > > time, an in concert with someone doing a query on that same directory.
> > > Drill has no good synchronization solution. Since this seems to not be
> a
> > > problem for views, perhaps things will work for schemas. (Both are a
> > single
> > > file.) We have had problem with metadata because refreshing that
> updates
> > > multiple files and lack of synchronization has been a lingering
> problem.
> > >
> > > Finally, I'd suggest ensuring your format can be created by hand or by
> > > other tools. Heck, if you want to make a big impact, work with the
> Arrow
> > > project to define a cross-project schema format that can be used to
> > > describe either Arrow or Drill schemas.
> > >
> > > Thanks,
> > > - Paul
> > >
> > >
> > >
> > >    On Tuesday, October 9, 2018, 8:31:45 PM PDT, Jean-Claude Cote <
> > > [email protected]> wrote:
> > >
> > >  I'm writing a msgpack reader and in doing so I noticed that the JSON
> > > reader
> > > will put a INT column place holder when no records match a select
> > statement
> > > like "select str from.." when the str field is not seen in the first
> > batch.
> > >
> > > However this is problematic because in the first batch it is not know
> > what
> > > data type is the column and drill will throw an error if the column
> turns
> > > out to be a VARCHAR in the second batch.
> > >
> > > In order to work around these types of issues I decided to add a
> > "learning
> > > schema mode" to my msgpack reader. Essentially in learning mode you can
> > > feed it records that you know are complete and valid. The reader will
> > > accumulate, merge and write the resulting schema to disk.
> > >
> > > Once the schema is learned you can configure the msgpack reader to be
> in
> > > "use schema mode". In this mode the reader will load the schema and
> apply
> > > it to the writer after each batch. Any missing column like the str
> column
> > > mentioned above will now be defined in the writer and with the proper
> > type
> > > VARCHAR. Drill will now be able to read files which have missing str
> > > columns in the first few thousand rows.
> > >
> > > One interesting thing about the implementation I did is that it uses
> > > Drill's own schema i.e.:MaterializedField. So for example to save the
> > > schema I do
> > >
> > > SerializedField serializedMapField =
> > > writer.getMapVector().getField().getSerializedField();
> > > String data = TextFormat.printToString(serializedMapField);
> > > IOUtils.write(data, out);
> > >
> > > To apply the schema to the writer I walk the schema and create the
> > > corresponding maps, list, bigint, varchar etc
> > >
> > > if (mapWriter != null) {
> > >  mapWriter.varChar(name);
> > > } else {
> > >  listWriter.varChar();
> > > }
> > >
> > > Because I use the Materialized fields in the writer to save the schema
> > and
> > > use them to apply the schema back to the writer I think this solution
> > could
> > > be used for the JSON reader and potentially others.
> > >
> > > Another benefit I see in using a schema is that you can also skip over
> > > records str column of type BINARY, FLOAT8 or MAP because those do not
> > match
> > > the schema which says str should be of type VARCHAR.
> > >
> > > I have a working proof of concept in
> > > https://github.com/jcmcote/drill-1/tree/master/contrib/format-msgpack.
> > > There are many test cases showing how building a schema enables drill
> to
> > > read files that it would not normally be able to.
> > >
> > > I would greatly appreciate some feedback. Is this a good idea? Is it
> okay
> > > to fill out the schema of the writer like I did. Any negative side
> > effects?
> > >
> > > Thank you
> > > jc
> > >
> >
>

Re: msgpack format reader with schema learning feature

Reply via email to