Hi Weijie,

Thanks for raising this topic. I think you've got a great suggestion.

My two cents: there is no harm in reading all manner of ugly data. But, rather 
than try to process the mess throughout Drill (as we do today with schema 
changes, just-in-time code generation, union vectors and the rest), simply 
require that the user provide disambiguation rules prior to running the query. 
Require that the reader convert input data into a clean relational format.

That is, rather than try to be schema-less, just be schema-on-read, which, as 
you point out, is what most users probably want anyway.


JSON is my favorite example of the current issues. JSON is a general-purpose 
data format. SQL is a relational format. IMHO, if JDBC and ODBC are Drill's 
primary APIs, data must fit those APIs, and that means data must be relational. 
So, the goal when using JDBC is to map the JDBC structure into a relational 
structure.

Since Drill now supports arrays of nested tuples via implicit joins, 
"relational format" here means the extended relational format. Perhaps a 
requirement is that arrays MUST be flattened to simple tuples prior to 
returning results to JDBC or ODBC. 


If we provide schema hints ("field x, when it appears, will be a Double"), then 
Drill need not fight with the problem that the data will be Double in some 
files, Nullable Int in others. The schema hint removes the ambiguity.

The result is that Drill can read any type of data. But, as you say,  the Drill 
internals are simpler, cleaner and faster. There is no ambiguity about types 
that Drill kinda-sorta supports but that ODBC/JDBC don't support. Effort can go 
into new features rather than fighting an unwinnable battle to use 
non-relational data in a relational engine.


In short, as part of the metadata API work, perhaps define how metadata can 
simplify Drill internals. Ensure that users can create simple metadata hint 
files for ad-hoc use, maybe as an extension to view files. Then, push the 
problem of messy schemas into a conversion layer in the reader and out of the 
rest of the Drill execution engine.

The reason I keep jumping into these metadata discussions is that I encountered 
the ambiguity problems first hand in the "result set loader" work. There are 
ambiguities in JSON that simply cannot be resolved until Drill can predict the 
future (by having access to metadata.)

Consider the query "SELECT a, x FROM foo.json". Column x is missing from the 
first 1000 records. In the 1001st record, it shows up and is a Double. 
Previously, Drill would guess Nullable Int on the first column, then blow up 
when x appears as a Double. A revision was to postpone picking a type as late 
as possible so that, if column x does show up in the first batch, ambiguity is 
avoided. But, this trick does not work if the column shows up in the second or 
later batch. We need a column in the first batch, and we will guess Nullable 
Int. Depending on the query, this will result in a schema change elsewhere in 
the DAG or in the client.

Similarly, in the new version, JSON can handle data of the form {x: 10.1} {x: 
10} because it can convert an Int to a Double. Sadly, {x: 10} {x: 10.1} still 
fails because Drill can't convert a Double to an Int. Sigh...


To implement these tricks, the revised reader framework accepts a schema and 
automagically does the work of converting data from the input type to the 
defined type. Once this code is into master (we are still about three PRs 
away), it can be combined with the metadata system to achieve the "apply schema 
on read" idea discussed above. We then would not need the horrible hacks like 
those just discussed.


Thanks,
- Paul

 

    On Wednesday, August 15, 2018, 5:02:08 PM PDT, weijie tong 
<[email protected]> wrote:  
 
 Hi all:
  Hope the statement not seems too dash to you.
  Drill claims be a schema-free distributed SQL engine. It pays lots of
work to make the execution engine to support it to support JSON file like
storage format. It is easier to make bugs and let the code logic ugly. I
wonder do we still insist on this ,since we are designing the metadata
system with DRILL-6552.
  Traditionally, people is used to design its table schema firstly before
firing a SQL query. I don't think this saves people too much time. Other
system like Spark is popular not due to lack the schema claiming. I think
we should be brave enough to take the right decision whether to still
insist on this feature which seems not so important but a burden.
    Thanks.
  

Reply via email to