Hi JC,
Now that you have a working reader, sounds like your next task is to pass
column schema to the reader. There are two ways to do that. There are sketchy
examples of both. First, the reader itself can obtain the schema (as is done in
Parquet.) Second, the planner can pass the schema to the reader as part of the
physical plan is is sort of done for the new log reader.
Let's start with the reader side first, since that is easiest to test. When
scanning a directory, the physical plan includes the path to the base
directory. (I'm sure you've see it, if not, let me know and I'll track it
down.) Each reader can use that path to infer the location of the (optional)
.schema file, then read the schema and proceed from there. In this model, the
reader has to handle columns requested by the query which are not in the
schema. Go ahead and create them of type nullable INT.
(By the way, you can name the file anything you like. If the file starts with a
dot or underscore, Dill will ignore it when looking for data files, which can
be handy way to prevent your schema file from being treated as data.)
Having each of possibly hundreds of readers reading the same file with a remote
read is less than optimal, however. You probably want to read the file once.
You can do that by creating custom plan nodes using Calcite. I'm not an expert
in this area, but you can look at the existing code for, say, HBase or other
readers for the classes you need to create. Others on this list might be able
to provide more detailed hints.
A bonus of planner-side metadata is that you can fail queries that request
columns which do not exist. (Or, you can keep the usual technique of making up
a nullable INT column in the reader.)
Next you need to pass the schema information from the planner to each reader.
You do that via the physical plan. When you looked at the EXPLAIN PLAN, in
JSON, you should have seen the JSON form of the "easy" file scan. You'd create
your own Jackson-serializable class to use for your reader in place of the
"easy" version. You'd then pass the schema in that custom class. (Again, you
sound familiar with the code, so I've not spelled out the details. Let me know
if you get stuck.)
Finally, I'd draw your attention to several features that the log reader
illustrates. One is handling of SELECT all/some/none projection. Another is
using "column state" classes to perform type-specific handling instead of a big
(slow) switch statement. That new JSON reader I mentioned takes the idea a step
further and has the column state classes handle data translation, vector
writing and parse state handling.
I hope this helps.
Thanks,
- Paul
On Monday, October 1, 2018, 7:49:50 PM PDT, Jean-Claude Cote
<[email protected]> wrote:
I've though of declaring schema like was done with the pcap reader. It
seems to use a DrillTable to define it's schema. However I like the idea of
a .schema file at the root of the table. I tried what you suggested (look
at the explain plan). However I failed to see what you were pointing me at?
Yes I realize the JSON reader is complex. I actually used the JSON reader
and BSON (mongo) reader. So all the dfs related stuff I took from JSON and
the actual parsing from BSON. I think I have something fairly complete.
https://github.com/jcmcote/drill-1/tree/master/contrib/format-msgpack.
I'm running it with real data now and I'm uncovering edge cases like the
issue where like you say drill can't see into the future. But like I said I
think having a .schema file would be a very good idea.
How would I best leverage such a file.
Thank you very much
jc
On Mon, Oct 1, 2018 at 9:51 PM Paul Rogers <[email protected]>
wrote:
> Hi JC,
>
> One of Drill's challenges is that it cannot predict the future: it can't
> know what type your column will be in later records or in another file. All
> it knows is the here and now: the type of columns it actually reads. This
> is the flip side of "schema less".
>
> The only real solution to this problem is for the user to tell Drill the
> column type. The team is looking at several options for doing this. [1]
>
> If you are creating your own reader, you can perhaps come up with an
> alternative solution. A quick check of the MessagePack spec [2] reveals
> that each field is prefixed by a type code. This resolves the INT/DOUBLE
> problems in JSON. But, it does not help if you don't actually see an
> instance of the field: you still have to get that type information from
> somewhere.
>
> There is a JSON Schema spec [3] which could be useful. You could, for
> example. load it at plan time and pass it to your readers as part of the
> physical plan. (If you poke around the physical plan, in JSON form,
> returned from EXPLAIN PLAN, you'll see what I mean.) You might require,
> say, that the user put a ".schema" file in the root directory of your table.
>
> While the JSON reader is one way to create a reader, it is probably the
> most complex example around. I'd suggest you look at the recently-added log
> reader [4] for a simpler way to get started. We also dedicate a chapter to
> format plugins in the soon-to-be-published book Learning Apache Drill
> (available in preview from Safari.)
>
> There is also a not-yet-committed revised version [5] of the JSON reader
> [4] that uses state objects instead of complex if-statements to keep track
> of parse states. This implementation is made more complex, however, because
> of the need to deal with all the ambiguities that occur in JSON without a
> schema. Feel free to borrow ideas if helpful.
>
> Thanks,
> - Paul
>
> [1] https://issues.apache.org/jira/browse/DRILL-6552
> [2] https://github.com/msgpack/msgpack/blob/master/spec.md
>
> [3] JSON Schema
>
> [4]
> https://github.com/apache/drill/tree/master/exec/java-exec/src/main/java/org/apache/drill/exec/store/log
>
> [5]
> https://github.com/paul-rogers/drill/tree/RowSetRev4/exec/java-exec/src/main/java/org/apache/drill/exec/store/easy/json
>
>
>
>
> On Monday, October 1, 2018, 6:03:38 PM PDT, Jean-Claude Cote <
> [email protected]> wrote:
>
> I'm implementing a msgpack reader and use the JSON reader as inspiration.
> I've noticed that in the JSON reader there's code to detect if rows were
> written but with no columns it will actually add one row with columns of
> type INT. The comment in the code is
> "if we had no columns, create one empty one so we can return some data for
> count purposes"
>
> Is this what the proposal above would solve. The JSON reader would not have
> to guess what is the data type of the selected fields. Right now I get
> failures in my msgpack reader because the schema is guessed to be an INT
> but later it encounters a value and it is a VARCHAR.
>
> I'm wondering why the JSON reader actually even writes a row if no fields
> were found in the batch.. Is there something I can do better than the JSON
> reader or is this a limitation of the drill engine?
>
> Thanks
> jc
>
> On Mon, Oct 1, 2018 at 1:54 PM Arina Yelchiyeva <
> [email protected]>
> wrote:
>
> > Currently Calcite supports the following syntax, apparently used in
> > Phoenix.
> > *select empno + x from EMP_MODIFIABLEVIEW extend (x int not null)*
> >
> > Another option to consider is hint syntax (many DBs use this one)
> basically
> > it's a multiline comment followed by a plus:
> > *select /*+.....*/ col_name from t*
> > This would allow us to pass not only schema but join / index hints etc.
> >
> > Example:
> > *select /*+ SCHEMA(a int not null, b int) */ a from t*
> >
> > One minus we would need to implement this first in Calcite if Calcite
> > community would be in favor of such changes.
> >
> > Kind regards,
> > Arina
> >
> > On Mon, Sep 10, 2018 at 7:42 AM Paul Rogers <[email protected]>
> > wrote:
> >
> > > Hi Weijie,
> > >
> > > Thanks for the paper pointer. F1 uses the same syntax as Scope (the
> > system
> > > cited in my earlier note): data type after the name.
> > >
> > > Another description is [1]. Neither paper describe how F1 handles
> arrays.
> > > However, this second paper points out that Protobuf is F1's native
> > format,
> > > and so F1 has support for nested types. Drill does also, but in Drill,
> a
> > > reference to "customer.phone.cell" cause the nested "cell" column to be
> > > projected as a top-level column. And, neither paper say whether F1 is
> > used
> > > with O/JDBC, and if so, how they handle the mapping from nested types
> to
> > > the flat tuple structure required by xDBC.
> > >
> > > Have you come across these details?
> > >
> > > Thanks,
> > > - Paul
> > >
> > >
> > >
> > > On Thursday, September 6, 2018, 8:43:57 PM PDT, weijie tong <
> > > [email protected]> wrote:
> > >
> > > Google's latest paper about F1[1] claims to support any data sources
> by
> > > using an extension api called TVF see section 6.3. Also need to declare
> > > column datatype before the query.
> > >
> > >
> > > [1] http://www.vldb.org/pvldb/vol11/p1835-samwel.pdf
> > >
> > > On Fri, Sep 7, 2018 at 9:47 AM Paul Rogers <[email protected]>
> > > wrote:
> > >
> > > > Hi All,
> > > >
> > > > We've discussed quite a few times whether Drill should or should not
> > > > support or require schemas, and if so, how the user might express the
> > > > schema.
> > > >
> > > > I came across a paper [1] that suggests a simple, elegant SQL
> > extension:
> > > >
> > > > EXTRACT <column>[:<type>] {,<column>[:<type>]}
> > > > FROM <stream_name>
> > > >
> > > > Paraphrasing into Drill's SQL:
> > > >
> > > > SELECT <column>[:<type>][AS <alias>] {,<column>[:<type>][AS <alias>]}
> > > > FROM <table_name>
> > > >
> > > > Have a collection of JSON files in which string column `foo` appears
> in
> > > > only half the files? Don't want to get schema conflicts with VARCHAR
> > and
> > > > nullable INT? Just do:
> > > >
> > > > SELECT name:VARCHAR, age:INT, foo:VARCHAR
> > > > FROM `my-dir` ...
> > > >
> > > > Not only can the syntax be used to specify the "natural" type for a
> > > > column, it might also specify a preferred type. For example.
> "age:INT"
> > > says
> > > > that "age" is an INT, even though JSON would normally parse it as a
> > > BIGINT.
> > > > Similarly, using this syntax is a easy way to tell Drill how to
> convert
> > > CSV
> > > > columns from strings to DATE, INT, FLOAT, etc. without the need for
> > CAST
> > > > functions. (CAST functions read the data in one format, then convert
> it
> > > to
> > > > another in a Project operator. Using a column type might let the
> reader
> > > do
> > > > the conversion -- something that is easy to implement if using the
> > > "result
> > > > set loader" mechanism.)
> > > >
> > > > Plus, the syntax fits nicely into the existing view file structure.
> If
> > > the
> > > > types appear in views, then client tools can continue to use standard
> > SQL
> > > > without the type information.
> > > >
> > > > When this idea came up in the past, someone mentioned the issue of
> > > > nullable vs. non-nullable. (Let's also include arrays, since Drill
> > > supports
> > > > that. Maybe add a suffix to the the name:
> > > >
> > > > SELECT req:VARCHAR NOT NULL, opt:INT NULL, arr:FLOAT[] FROM ...
> > > >
> > > > Not pretty, but works with the existing SQL syntax rules.
> > > >
> > > > Obviously, Drill has much on its plate, so not suggestion that Drill
> > > > should do this soon. Just passing it along as yet another option to
> > > > consider.
> > > >
> > > > Thanks,
> > > > - Paul
> > > >
> > > > [1] http://www.cs.columbia.edu/~jrzhou/pub/Scope-VLDBJ.pdf
> > >
> >
>