I see what you mean by a schema only gets you so far. Your Fred Flinstone
example shows how you almost need the ability to apply a transformation at
the reader level (instead of at the projection level) to properly read such
data files.

I think I agree with Charles Givre. I've always like the tag line from
Drill using the data "in-situ". I like the fact that you can just write
files to disk without an ingestion process and start playing with them.




On Wed, Oct 17, 2018 at 2:02 PM Paul Rogers <[email protected]>
wrote:

> Hi JC,
>
> Bingo, you just hit the core problem with schema-on-read: there is no
> "right" rule for how to handle ambiguous or inconsistent schemas. Take your
> string/binary example. You determined that the binary fields were actually
> strings (encoded in what, UTF-8? ASCII? Host's native codeset?) The answer
> could have been the opposite: maybe these are packet sniffs and the data is
> typically binary, except when the analyzer was able to pull out strings.
> The point is, there is no right answer: it depends.
>
> The same is true with heterogeneous lists, inconsistent maps, etc. Do
> these represent a lazy script (writing numbers sometimes as strings,
> sometimes as numbers) or a deeper problem: that you are supposed to look at
> a "type" code in the object to determine the meaning of the other fields.
>
> I wrestled with these issues myself when rewriting the JSON reader to use
> the new result set loader. My notes are in DRILL-4710. I ended up with a
> slogan: "Drill cannot predict the future."
>
> Frankly, this issue has existed as long as Drill has existed. Somehow
> we've muddled through, which might be an indication that this issue is not
> worth fixing. (In the Drill book, for example, I document these issues and
> conclude by telling people to use Spark or Hive to ETL the data into
> Parquet.) Since Parquet is Drill's primary format, odd cases in JSON tend
> to not get much attention.
>
> You are right: the only way I know of to resolve the issue is for the user
> to tell us their intention. We just suggested that one way to express
> intention is to do what Impala does, and what the book documents: have the
> user use Spark or Hive to ETL the data into a clean, unambiguous Parquet
> format. That is, delegate the solution of the problem to other tools in the
> big data stack.
>
> We've also suggested that users sole the problem via very clever views and
> use of all-text mode and numbers-as-double mode, doing lots of cases. But,
> this does not scale (and the options must be set manually prior to each
> query, then reset for the next.) There are cases, documented in DRILL-4710,
> where even this does not work. (Column c is sometimes a map, sometimes a
> scalar, say.)
>
> You've invented a mechanism for expressing schema, and the team appears to
> be working on a Drill metastore. So, that's a third solution.
>
> The fourth solution is to build on what you've done with MsgPack: write a
> custom parser for each odd file format. This might be needed if the format
> is more odd than a schema can fix. Perhaps a custom "meta-parser" on top of
> JSON or MsgPack would be needed to convert data from the odd file format to
> the extended-relational format which Drill uses.
>
> Here are two of the classics that fall into that category. The "tuple as
> an array" format:
>
> {fields: ["fred", "flintstone", 123.45, true, null]}
>
> The "metadata" format:
>
> { field1: { name: "first", type: "string", value: "fred"},
>   field2: { name: "last", type: "string", value: "flintstone"},
>   field3: { name: "balance", type: "money", value: 123.45},
>   field4: { name: "is vip", type: "boolean", value: true}, ...
> }
>
> I'm not making these up, I've seen them used in practice. Unless the
> schema is very expressive, it probably can't handle these, which is why
> some code will be need (in Spark/Hive or in a Drill plugin of some kind.)
>
> Charles Givre makes a very good point: he suggests that Drill's unique
> opportunity is to handle such odd files clearly, avoiding the need for ETL.
> That is, rather than thinking of Drill as a junior version of Impala (read
> one format really, really well), think of it as the open source version of
> Splunk (read all formats via adapters.)
>
> Thanks,
> - Paul
>
>
>
>     On Wednesday, October 17, 2018, 6:43:04 AM PDT, Jean-Claude Cote <
> [email protected]> wrote:
>
>  I'm writing a msgpack reader and have encountered datasets where an array
> contains different types for example a VARCHAR and a BINARY. Turns out the
> BINARY is actually a string. I know this is probably just not modeled
> correctly in the first place but I'll still going to modify the reading of
> list so that it takes note of the first element in the list and tries to
> coerce subsequent elements that are not of the same type.
>
> {
> "column": [["name", \\0xAA\\0xBB],["surname", \\0xAA\\0xBB]]
> }
>
> However I have an other scenario where it's actually the field of a map
> that change type
> {
> "column": [
> {
> "dataType": 1,
> "value": 19
> },
> {
> "dataType": 5,
> "value": "string data"
> }
> ]
> }
>
> When reading such a structure a BigInt writer is used to write out the
> value of the first map but the same BigInt writer is used for value field
> of the second map. I understand that drill will represent the "value" field
> in a BitInt vector in memory.
>
> My question is how to best address situations like this one. What
> alternatives is there. Read the value type as ANY? This situation is deeply
> nested should I put a means to ignore elements at the certain depth? Is it
> even possible to handle these situations gracefully? Is this a situation
> where a schema would be helpful in determining what to do with fields that
> are problematic.
>
> Thank you
> jc
>

Reply via email to