I see what you mean by a schema only gets you so far. Your Fred Flinstone example shows how you almost need the ability to apply a transformation at the reader level (instead of at the projection level) to properly read such data files.
I think I agree with Charles Givre. I've always like the tag line from Drill using the data "in-situ". I like the fact that you can just write files to disk without an ingestion process and start playing with them. On Wed, Oct 17, 2018 at 2:02 PM Paul Rogers <[email protected]> wrote: > Hi JC, > > Bingo, you just hit the core problem with schema-on-read: there is no > "right" rule for how to handle ambiguous or inconsistent schemas. Take your > string/binary example. You determined that the binary fields were actually > strings (encoded in what, UTF-8? ASCII? Host's native codeset?) The answer > could have been the opposite: maybe these are packet sniffs and the data is > typically binary, except when the analyzer was able to pull out strings. > The point is, there is no right answer: it depends. > > The same is true with heterogeneous lists, inconsistent maps, etc. Do > these represent a lazy script (writing numbers sometimes as strings, > sometimes as numbers) or a deeper problem: that you are supposed to look at > a "type" code in the object to determine the meaning of the other fields. > > I wrestled with these issues myself when rewriting the JSON reader to use > the new result set loader. My notes are in DRILL-4710. I ended up with a > slogan: "Drill cannot predict the future." > > Frankly, this issue has existed as long as Drill has existed. Somehow > we've muddled through, which might be an indication that this issue is not > worth fixing. (In the Drill book, for example, I document these issues and > conclude by telling people to use Spark or Hive to ETL the data into > Parquet.) Since Parquet is Drill's primary format, odd cases in JSON tend > to not get much attention. > > You are right: the only way I know of to resolve the issue is for the user > to tell us their intention. We just suggested that one way to express > intention is to do what Impala does, and what the book documents: have the > user use Spark or Hive to ETL the data into a clean, unambiguous Parquet > format. That is, delegate the solution of the problem to other tools in the > big data stack. > > We've also suggested that users sole the problem via very clever views and > use of all-text mode and numbers-as-double mode, doing lots of cases. But, > this does not scale (and the options must be set manually prior to each > query, then reset for the next.) There are cases, documented in DRILL-4710, > where even this does not work. (Column c is sometimes a map, sometimes a > scalar, say.) > > You've invented a mechanism for expressing schema, and the team appears to > be working on a Drill metastore. So, that's a third solution. > > The fourth solution is to build on what you've done with MsgPack: write a > custom parser for each odd file format. This might be needed if the format > is more odd than a schema can fix. Perhaps a custom "meta-parser" on top of > JSON or MsgPack would be needed to convert data from the odd file format to > the extended-relational format which Drill uses. > > Here are two of the classics that fall into that category. The "tuple as > an array" format: > > {fields: ["fred", "flintstone", 123.45, true, null]} > > The "metadata" format: > > { field1: { name: "first", type: "string", value: "fred"}, > field2: { name: "last", type: "string", value: "flintstone"}, > field3: { name: "balance", type: "money", value: 123.45}, > field4: { name: "is vip", type: "boolean", value: true}, ... > } > > I'm not making these up, I've seen them used in practice. Unless the > schema is very expressive, it probably can't handle these, which is why > some code will be need (in Spark/Hive or in a Drill plugin of some kind.) > > Charles Givre makes a very good point: he suggests that Drill's unique > opportunity is to handle such odd files clearly, avoiding the need for ETL. > That is, rather than thinking of Drill as a junior version of Impala (read > one format really, really well), think of it as the open source version of > Splunk (read all formats via adapters.) > > Thanks, > - Paul > > > > On Wednesday, October 17, 2018, 6:43:04 AM PDT, Jean-Claude Cote < > [email protected]> wrote: > > I'm writing a msgpack reader and have encountered datasets where an array > contains different types for example a VARCHAR and a BINARY. Turns out the > BINARY is actually a string. I know this is probably just not modeled > correctly in the first place but I'll still going to modify the reading of > list so that it takes note of the first element in the list and tries to > coerce subsequent elements that are not of the same type. > > { > "column": [["name", \\0xAA\\0xBB],["surname", \\0xAA\\0xBB]] > } > > However I have an other scenario where it's actually the field of a map > that change type > { > "column": [ > { > "dataType": 1, > "value": 19 > }, > { > "dataType": 5, > "value": "string data" > } > ] > } > > When reading such a structure a BigInt writer is used to write out the > value of the first map but the same BigInt writer is used for value field > of the second map. I understand that drill will represent the "value" field > in a BitInt vector in memory. > > My question is how to best address situations like this one. What > alternatives is there. Read the value type as ANY? This situation is deeply > nested should I put a means to ignore elements at the certain depth? Is it > even possible to handle these situations gracefully? Is this a situation > where a schema would be helpful in determining what to do with fields that > are problematic. > > Thank you > jc >
