I made some progress. I looked at the parquet format plugin because it looked the closest to what I want to do. I thus now have a custom BasicFormatMatcher which overrides isReadable(). IsReadable() returns a DrillTable which hard codes a schema for now.
If I select a column not in the hard coded schema I get an error. I noticed that when creating the DrillTable I have to pass in a parameter of type "Object". Which is where you want me to "pass the schema to the reader". Is that correct? Currently I'm passing in a FormatSelection object. Tomorrow I'll try to subclass it and add my schema information so the reader can have access to it. Thanks again for your help much appreciated jc On Tue, Oct 2, 2018 at 9:51 PM Paul Rogers <[email protected]> wrote: > The Drill table looks like the way to integrate schema into the planner. > As you noted, pcap is hardcoded. You'd want to load your schema from a > file. Because pcap is hard coded, that planner does not need to pass the > schema to the reader, but you would need to do so to know the type of > missing columns (which is where you started this discussion.) > > I don't know about how DESCRIBE is connected. Arina or Aman, do you know > how to connect up a Drill table with DESCRIBE? > > Thanks, > - Paul > > > > On Tuesday, October 2, 2018, 6:12:30 PM PDT, Jean-Claude Cote < > [email protected]> wrote: > > I've been looking at the pcap reader > > https://github.com/apache/drill/tree/master/exec/java-exec/src/main/java/org/apache/drill/exec/store/pcap > it has a simple PcapDrillTable (hard coded). I though of using it as > inspiration but I wonder if the way it is implemented would the DESCRIBE > functionality work? Or if I want DESCRIBE to work I need to implement it at > the planner level? > > Thanks Paul > jc > > On Tue, Oct 2, 2018 at 12:54 PM Paul Rogers <[email protected]> > wrote: > > > Hi JC, > > > > Now that you have a working reader, sounds like your next task is to pass > > column schema to the reader. There are two ways to do that. There are > > sketchy examples of both. First, the reader itself can obtain the schema > > (as is done in Parquet.) Second, the planner can pass the schema to the > > reader as part of the physical plan is is sort of done for the new log > > reader. > > > > Let's start with the reader side first, since that is easiest to test. > > When scanning a directory, the physical plan includes the path to the > base > > directory. (I'm sure you've see it, if not, let me know and I'll track it > > down.) Each reader can use that path to infer the location of the > > (optional) .schema file, then read the schema and proceed from there. In > > this model, the reader has to handle columns requested by the query which > > are not in the schema. Go ahead and create them of type nullable INT. > > > > (By the way, you can name the file anything you like. If the file starts > > with a dot or underscore, Dill will ignore it when looking for data > files, > > which can be handy way to prevent your schema file from being treated as > > data.) > > > > Having each of possibly hundreds of readers reading the same file with a > > remote read is less than optimal, however. You probably want to read the > > file once. You can do that by creating custom plan nodes using Calcite. > I'm > > not an expert in this area, but you can look at the existing code for, > say, > > HBase or other readers for the classes you need to create. Others on this > > list might be able to provide more detailed hints. > > > > A bonus of planner-side metadata is that you can fail queries that > request > > columns which do not exist. (Or, you can keep the usual technique of > making > > up a nullable INT column in the reader.) > > > > Next you need to pass the schema information from the planner to each > > reader. You do that via the physical plan. When you looked at the EXPLAIN > > PLAN, in JSON, you should have seen the JSON form of the "easy" file > scan. > > You'd create your own Jackson-serializable class to use for your reader > in > > place of the "easy" version. You'd then pass the schema in that custom > > class. (Again, you sound familiar with the code, so I've not spelled out > > the details. Let me know if you get stuck.) > > > > Finally, I'd draw your attention to several features that the log reader > > illustrates. One is handling of SELECT all/some/none projection. Another > is > > using "column state" classes to perform type-specific handling instead > of a > > big (slow) switch statement. That new JSON reader I mentioned takes the > > idea a step further and has the column state classes handle data > > translation, vector writing and parse state handling. > > > > I hope this helps. > > > > Thanks, > > - Paul > > > > > > > > On Monday, October 1, 2018, 7:49:50 PM PDT, Jean-Claude Cote < > > [email protected]> wrote: > > > > I've though of declaring schema like was done with the pcap reader. It > > seems to use a DrillTable to define it's schema. However I like the idea > of > > a .schema file at the root of the table. I tried what you suggested (look > > at the explain plan). However I failed to see what you were pointing me > at? > > > > Yes I realize the JSON reader is complex. I actually used the JSON reader > > and BSON (mongo) reader. So all the dfs related stuff I took from JSON > and > > the actual parsing from BSON. I think I have something fairly complete. > > https://github.com/jcmcote/drill-1/tree/master/contrib/format-msgpack. > > > > I'm running it with real data now and I'm uncovering edge cases like the > > issue where like you say drill can't see into the future. But like I > said I > > think having a .schema file would be a very good idea. > > > > How would I best leverage such a file. > > Thank you very much > > jc > > > > > > On Mon, Oct 1, 2018 at 9:51 PM Paul Rogers <[email protected]> > > wrote: > > > > > Hi JC, > > > > > > One of Drill's challenges is that it cannot predict the future: it > can't > > > know what type your column will be in later records or in another file. > > All > > > it knows is the here and now: the type of columns it actually reads. > This > > > is the flip side of "schema less". > > > > > > The only real solution to this problem is for the user to tell Drill > the > > > column type. The team is looking at several options for doing this. [1] > > > > > > If you are creating your own reader, you can perhaps come up with an > > > alternative solution. A quick check of the MessagePack spec [2] reveals > > > that each field is prefixed by a type code. This resolves the > INT/DOUBLE > > > problems in JSON. But, it does not help if you don't actually see an > > > instance of the field: you still have to get that type information from > > > somewhere. > > > > > > There is a JSON Schema spec [3] which could be useful. You could, for > > > example. load it at plan time and pass it to your readers as part of > the > > > physical plan. (If you poke around the physical plan, in JSON form, > > > returned from EXPLAIN PLAN, you'll see what I mean.) You might require, > > > say, that the user put a ".schema" file in the root directory of your > > table. > > > > > > While the JSON reader is one way to create a reader, it is probably the > > > most complex example around. I'd suggest you look at the recently-added > > log > > > reader [4] for a simpler way to get started. We also dedicate a chapter > > to > > > format plugins in the soon-to-be-published book Learning Apache Drill > > > (available in preview from Safari.) > > > > > > There is also a not-yet-committed revised version [5] of the JSON > reader > > > [4] that uses state objects instead of complex if-statements to keep > > track > > > of parse states. This implementation is made more complex, however, > > because > > > of the need to deal with all the ambiguities that occur in JSON > without a > > > schema. Feel free to borrow ideas if helpful. > > > > > > Thanks, > > > - Paul > > > > > > [1] https://issues.apache.org/jira/browse/DRILL-6552 > > > [2] https://github.com/msgpack/msgpack/blob/master/spec.md > > > > > > [3] JSON Schema > > > > > > [4] > > > > > > https://github.com/apache/drill/tree/master/exec/java-exec/src/main/java/org/apache/drill/exec/store/log > > > > > > [5] > > > > > > https://github.com/paul-rogers/drill/tree/RowSetRev4/exec/java-exec/src/main/java/org/apache/drill/exec/store/easy/json > > > > > > > > > > > > > > > On Monday, October 1, 2018, 6:03:38 PM PDT, Jean-Claude Cote < > > > [email protected]> wrote: > > > > > > I'm implementing a msgpack reader and use the JSON reader as > > inspiration. > > > I've noticed that in the JSON reader there's code to detect if rows > were > > > written but with no columns it will actually add one row with columns > of > > > type INT. The comment in the code is > > > "if we had no columns, create one empty one so we can return some data > > for > > > count purposes" > > > > > > Is this what the proposal above would solve. The JSON reader would not > > have > > > to guess what is the data type of the selected fields. Right now I get > > > failures in my msgpack reader because the schema is guessed to be an > INT > > > but later it encounters a value and it is a VARCHAR. > > > > > > I'm wondering why the JSON reader actually even writes a row if no > fields > > > were found in the batch.. Is there something I can do better than the > > JSON > > > reader or is this a limitation of the drill engine? > > > > > > Thanks > > > jc > > > > > > On Mon, Oct 1, 2018 at 1:54 PM Arina Yelchiyeva < > > > [email protected]> > > > wrote: > > > > > > > Currently Calcite supports the following syntax, apparently used in > > > > Phoenix. > > > > *select empno + x from EMP_MODIFIABLEVIEW extend (x int not null)* > > > > > > > > Another option to consider is hint syntax (many DBs use this one) > > > basically > > > > it's a multiline comment followed by a plus: > > > > *select /*+.....*/ col_name from t* > > > > This would allow us to pass not only schema but join / index hints > etc. > > > > > > > > Example: > > > > *select /*+ SCHEMA(a int not null, b int) */ a from t* > > > > > > > > One minus we would need to implement this first in Calcite if Calcite > > > > community would be in favor of such changes. > > > > > > > > Kind regards, > > > > Arina > > > > > > > > On Mon, Sep 10, 2018 at 7:42 AM Paul Rogers > <[email protected] > > > > > > > wrote: > > > > > > > > > Hi Weijie, > > > > > > > > > > Thanks for the paper pointer. F1 uses the same syntax as Scope (the > > > > system > > > > > cited in my earlier note): data type after the name. > > > > > > > > > > Another description is [1]. Neither paper describe how F1 handles > > > arrays. > > > > > However, this second paper points out that Protobuf is F1's native > > > > format, > > > > > and so F1 has support for nested types. Drill does also, but in > > Drill, > > > a > > > > > reference to "customer.phone.cell" cause the nested "cell" column > to > > be > > > > > projected as a top-level column. And, neither paper say whether F1 > is > > > > used > > > > > with O/JDBC, and if so, how they handle the mapping from nested > types > > > to > > > > > the flat tuple structure required by xDBC. > > > > > > > > > > Have you come across these details? > > > > > > > > > > Thanks, > > > > > - Paul > > > > > > > > > > > > > > > > > > > > On Thursday, September 6, 2018, 8:43:57 PM PDT, weijie tong < > > > > > [email protected]> wrote: > > > > > > > > > > Google's latest paper about F1[1] claims to support any data > sources > > > by > > > > > using an extension api called TVF see section 6.3. Also need to > > declare > > > > > column datatype before the query. > > > > > > > > > > > > > > > [1] http://www.vldb.org/pvldb/vol11/p1835-samwel.pdf > > > > > > > > > > On Fri, Sep 7, 2018 at 9:47 AM Paul Rogers > <[email protected] > > > > > > > > wrote: > > > > > > > > > > > Hi All, > > > > > > > > > > > > We've discussed quite a few times whether Drill should or should > > not > > > > > > support or require schemas, and if so, how the user might express > > the > > > > > > schema. > > > > > > > > > > > > I came across a paper [1] that suggests a simple, elegant SQL > > > > extension: > > > > > > > > > > > > EXTRACT <column>[:<type>] {,<column>[:<type>]} > > > > > > FROM <stream_name> > > > > > > > > > > > > Paraphrasing into Drill's SQL: > > > > > > > > > > > > SELECT <column>[:<type>][AS <alias>] {,<column>[:<type>][AS > > <alias>]} > > > > > > FROM <table_name> > > > > > > > > > > > > Have a collection of JSON files in which string column `foo` > > appears > > > in > > > > > > only half the files? Don't want to get schema conflicts with > > VARCHAR > > > > and > > > > > > nullable INT? Just do: > > > > > > > > > > > > SELECT name:VARCHAR, age:INT, foo:VARCHAR > > > > > > FROM `my-dir` ... > > > > > > > > > > > > Not only can the syntax be used to specify the "natural" type > for a > > > > > > column, it might also specify a preferred type. For example. > > > "age:INT" > > > > > says > > > > > > that "age" is an INT, even though JSON would normally parse it > as a > > > > > BIGINT. > > > > > > Similarly, using this syntax is a easy way to tell Drill how to > > > convert > > > > > CSV > > > > > > columns from strings to DATE, INT, FLOAT, etc. without the need > for > > > > CAST > > > > > > functions. (CAST functions read the data in one format, then > > convert > > > it > > > > > to > > > > > > another in a Project operator. Using a column type might let the > > > reader > > > > > do > > > > > > the conversion -- something that is easy to implement if using > the > > > > > "result > > > > > > set loader" mechanism.) > > > > > > > > > > > > Plus, the syntax fits nicely into the existing view file > structure. > > > If > > > > > the > > > > > > types appear in views, then client tools can continue to use > > standard > > > > SQL > > > > > > without the type information. > > > > > > > > > > > > When this idea came up in the past, someone mentioned the issue > of > > > > > > nullable vs. non-nullable. (Let's also include arrays, since > Drill > > > > > supports > > > > > > that. Maybe add a suffix to the the name: > > > > > > > > > > > > SELECT req:VARCHAR NOT NULL, opt:INT NULL, arr:FLOAT[] FROM ... > > > > > > > > > > > > Not pretty, but works with the existing SQL syntax rules. > > > > > > > > > > > > Obviously, Drill has much on its plate, so not suggestion that > > Drill > > > > > > should do this soon. Just passing it along as yet another option > to > > > > > > consider. > > > > > > > > > > > > Thanks, > > > > > > - Paul > > > > > > > > > > > > [1] http://www.cs.columbia.edu/~jrzhou/pub/Scope-VLDBJ.pdf > > > > > > > > > > > > > > >
