Paul, sounds good. I like the idea of mock scanner being done first, since besides csv and json, other readers would have to be updated as well. Could you please share Jira number(-s) if any so I can follow them?
Kind regards, Arina On Thu, Oct 11, 2018 at 8:52 AM Paul Rogers <[email protected]> wrote: > Hi JC, > > Drill's complex types can be a bit confusing. Note that, in your example, > for the REPEATED BIGINT, we know that this is an array (REPEATED) and we > know the type of each element (BIGINT). > > But, that REPEATED LIST, it is a list of ... what? The element type is > missing. > > This is not the only hole. The UNION type has a list of child types which > tell you the types in the UNION. But, if the UNION's child type is a MAP, > that type does not include the full MAP schema. A LIST is a list of a > single type, or a LIST of UNIONs. It has the same schema ambiguity. > > The TupleSchema mechanism fills in this gap. But, for your use, > MaterializedList should be fine because you probably don't want to use the > "obscure" types. Frankly, LIST, REPEATED LIST and UNION are still pretty > broken. I would recommend sticking with the scalar types (REQUIRED, > OPTIONAL, REPEATED) and MAP (REQUIRED and REPEATED, there is no OPTIONAL). > I ran into bug after bug when trying to use LIST or UNION. You can populate > them, but some of the "fancier" operators (sort, hash join, aggregation) > can't handle them yet. > > > Can you explain a bit more the problem you ran into with the SchemaBuilder > (the one that uses TupleSchema)? It is supposed to handle all types. I'd > like to fix any issues you may have found. > > > Just to give a bit more background on the tuple schema and related > classes... The builder creates a schema that can be used with the RowSet > class to create a record batch that matches the schema. The RowSet provides > column writers to populate your record batch, and column readers to read > it. The column accessors convert between Java types and vector types and > can provide the custom type conversion I mentioned. > > For simple cases (working with a few types), the simple mechanism shown in > the log reader works well. (It is what we explain in the Drill book.) But, > as you add types, especially structured types, things get pretty complex. > The RowSet family handles all that cruft for you. > > The part I still need to add is the "result set loader" which goes one > step further: it can limit memory taken by a record batch. Most readers > today use a fixed number, say 4K records. 4K of INTs is pretty small. 4K of > 1 MB images is pretty big. The Result Set Loader works against a memory > limit (20 MB, say) and automatically limits records per batch to that > memory limit. > > > Thanks for doing the PR. Will be great to see what you've created. > > Thanks, > - Paul > > > > On Wednesday, October 10, 2018, 7:59:06 PM PDT, Jean-Claude Cote < > [email protected]> wrote: > > Hey Paul, > > You mentionned that > > "But, for a LIST, the Materialized field does not include the child types" > > However MaterializedField do have type information for child types. You can > see it in this example. I think it has all relevant information. Anyways > all test cases I've tried so far are working.. > > > child { > major_type { > minor_type: LIST > mode: REPEATED > } > name_part { > name: "arrayOfArray" > } > child { > major_type { > minor_type: BIGINT > mode: REPEATED > } > name_part { > name: "$data$" > } > child { > major_type { > minor_type: BIGINT > mode: REQUIRED > precision: 0 > scale: 0 > } > name_part { > name: "$data$" > } > } > } > } > > You also mention that I could leverage the "SchemaBuilder" and > "TupleSchema". I did try to use the TupleSchema but back out of using it. I > think my issue was navigating down into lists. Anyways in the end using a > Map MaterializedField to represent my row seem to do the job. > > You also mention > "The next problem is how to use Drill to build the schema? How does the > user specify to gather the schema? Would be great to have a DESCRIBE > statement that outputs the schema you discovered. Then maybe CREATE SCHEMA > <file name> AS DESCRIBE ... The CREATE VIEW statement could act as a > prototype." > > Totally agree, I initially though that it would be great to trigger the > schema learning like that but did not really know how it could be done. So > in the end I just used a format plugin property to toggle it on/off. > > Tomorrow I'm going to submit a pull request so you guys can have a better > look at what I've done. I'd really like my zstandard codec and msgpack > support to make it's way into the drill code base. > > Cheers > jc > > > On Wed, Oct 10, 2018 at 9:35 AM Arina Yelchiyeva < > [email protected]> > wrote: > > > Somehow this is correlates with two projects which are currently actively > > being investigated / prototyping: > > 1. Drill metastore (DRILL-6552) > > 2. Providing schema from the query (casts, hints). > > The second one will allow to provide schema using hints, as well as from > > the file. > > Regarding how to use Drill to propose the schema, there would be ANALYZE > > table command which will output the schema. > > > > Anyway, both of them in the end will need to feed schema to the readers, > so > > at this point Paul's new format reader framework that accepts schema > would > > become really important. > > @Paul, do you have any ETA when you will be able to submit the PRs? Maybe > > also do some presentation? Can you please share Jira number(-s) as well? > > > > Kind regards, > > Arina > > > > On Wed, Oct 10, 2018 at 7:31 AM Paul Rogers <[email protected]> > > wrote: > > > > > Hi JC, > > > > > > Very cool indeed. You are the man! > > > > > > Ted's been advocating for this approach for as long as I can remember > (2+ > > > years). You're well on your way to solving the JSON problems that I > > > documented a while back in DRILL-4710 and summarize as "Drill can't > > predict > > > the future." Basically, without a schema, Drill is forced to make a > > column > > > schema decision (use nullable INT!) before Drill has the full > > information. > > > > > > And, because Drill is distributed (have you tested this case yet?) a > scan > > > over a large file, or multiple files, will be spread across readers, > each > > > attempting to make independent decisions. These decisions will conflict > > > (reader 1 saw your VARCHAR column, reader 2 didn't and guessed nullable > > > INT). We've always hoped that Drill operators can sort out the mess and > > do > > > the right thing -- something that is probably impossible in the general > > > case. > > > > > > So, having a schema available is the simplest possible solution: all > > > readers agree on the schema and the rest of the Drill operators are > happy > > > because they get consistent inputs. Hat's off to you for creating this > > > proof of concept to move this conversation forward. > > > > > > Your MaterializedField approach is clever. It is somewhat awkward for > > this > > > use case, but can work. It is awkward because a properly-constructed > > > MaterializedField for a MAP, say, includes the Map schema. But, for a > > LIST, > > > the Materialized field does not include the child types. Ignoring these > > > long-standing corner cases, it will get you 90% of where you want to > go. > > > > > > Just FYI, I had been working on a project to help with this stuff. I've > > > checked in a newer schema mechanism that handles all Drill's nuances: > > > unions, (non-repeated) LISTs, REPEATED LISTs, etc. Search for > > > "SchemaBuilder" and "TupleSchema". (There are two SchemaBuilder > classes, > > > you want the one that builds a TupleSchema.) Then if you search for > > > references, you'll find lots of unit tests that use this to define a > > > schema. There is no serialization format, but you could easily add > that. > > > > > > The project includes a "result set loader" that provides a convent way > to > > > convert from Java types to Drill types. It is designed to allow, say, > > > converting a String from a CSV file into a VARCHAR, INT, DOUBLE or > > > whatever. Each reader would need its own string-to-whatever parser, but > > > these are meant to be easy to add. > > > > > > The TupleSchema provides the input to create a ResultSetLoader for that > > > schema, kind of how you're using a map, but far more simply. There are > > lots > > > of unit tests that illustrate how this works. > > > > > > All this would be wrapped in a new format reader framework that > accepts a > > > schema. That part has stalled and is not yet committed. If we can push > > this > > > idea forward, perhaps I can get off the dime and do a few more PRs to > get > > > the rest of the framework into Drill for your use. > > > > > > The next problem is how to use Drill to build the schema? How does the > > > user specify to gather the schema? Would be great to have a DESCRIBE > > > statement that outputs the schema you discovered. Then maybe CREATE > > SCHEMA > > > <file name> AS DESCRIBE ... The CREATE VIEW statement could act as a > > > prototype. > > > > > > Drill is distributed and two users could do the CREATE SCHEMA at the > same > > > time, an in concert with someone doing a query on that same directory. > > > Drill has no good synchronization solution. Since this seems to not be > a > > > problem for views, perhaps things will work for schemas. (Both are a > > single > > > file.) We have had problem with metadata because refreshing that > updates > > > multiple files and lack of synchronization has been a lingering > problem. > > > > > > Finally, I'd suggest ensuring your format can be created by hand or by > > > other tools. Heck, if you want to make a big impact, work with the > Arrow > > > project to define a cross-project schema format that can be used to > > > describe either Arrow or Drill schemas. > > > > > > Thanks, > > > - Paul > > > > > > > > > > > > On Tuesday, October 9, 2018, 8:31:45 PM PDT, Jean-Claude Cote < > > > [email protected]> wrote: > > > > > > I'm writing a msgpack reader and in doing so I noticed that the JSON > > > reader > > > will put a INT column place holder when no records match a select > > statement > > > like "select str from.." when the str field is not seen in the first > > batch. > > > > > > However this is problematic because in the first batch it is not know > > what > > > data type is the column and drill will throw an error if the column > turns > > > out to be a VARCHAR in the second batch. > > > > > > In order to work around these types of issues I decided to add a > > "learning > > > schema mode" to my msgpack reader. Essentially in learning mode you can > > > feed it records that you know are complete and valid. The reader will > > > accumulate, merge and write the resulting schema to disk. > > > > > > Once the schema is learned you can configure the msgpack reader to be > in > > > "use schema mode". In this mode the reader will load the schema and > apply > > > it to the writer after each batch. Any missing column like the str > column > > > mentioned above will now be defined in the writer and with the proper > > type > > > VARCHAR. Drill will now be able to read files which have missing str > > > columns in the first few thousand rows. > > > > > > One interesting thing about the implementation I did is that it uses > > > Drill's own schema i.e.:MaterializedField. So for example to save the > > > schema I do > > > > > > SerializedField serializedMapField = > > > writer.getMapVector().getField().getSerializedField(); > > > String data = TextFormat.printToString(serializedMapField); > > > IOUtils.write(data, out); > > > > > > To apply the schema to the writer I walk the schema and create the > > > corresponding maps, list, bigint, varchar etc > > > > > > if (mapWriter != null) { > > > mapWriter.varChar(name); > > > } else { > > > listWriter.varChar(); > > > } > > > > > > Because I use the Materialized fields in the writer to save the schema > > and > > > use them to apply the schema back to the writer I think this solution > > could > > > be used for the JSON reader and potentially others. > > > > > > Another benefit I see in using a schema is that you can also skip over > > > records str column of type BINARY, FLOAT8 or MAP because those do not > > match > > > the schema which says str should be of type VARCHAR. > > > > > > I have a working proof of concept in > > > https://github.com/jcmcote/drill-1/tree/master/contrib/format-msgpack. > > > There are many test cases showing how building a schema enables drill > to > > > read files that it would not normally be able to. > > > > > > I would greatly appreciate some feedback. Is this a good idea? Is it > okay > > > to fill out the schema of the writer like I did. Any negative side > > effects? > > > > > > Thank you > > > jc > > > > > >
