Re: Drill TupleMetadata created from DFDL Schema - how do I inform Drill about it

Charles Givre Wed, 18 Oct 2023 16:06:02 -0700

Got it.  I’ll review today and tomorrow and hopefully we can get you unblocked. 
 
Sent from my iPhone


> On Oct 18, 2023, at 18:01, Mike Beckerle <mbecke...@apache.org> wrote:
> 
> I am very much hoping someone will look at my open PR soon.
> https://github.com/apache/drill/pull/2836
> 
> I am basically blocked on this effort until you help me with one key area
> of that.
> 
> I expect the part I am puzzling over is routine to you, so it will save me
> much effort.
> 
> This is the key area in the DaffodilBatchReader.java code:
> 
>  // FIXME: Next, a MIRACLE occurs.
>  //
>  // We get the dfdlSchemaURI filled in from the query, or a default config
> location
>  // We get the rootName (or null if not supplied) from the query, or a
> default config location
>  // We get the rootNamespace (or null if not supplied) from the query, or
> a default config location
>  // We get the validationMode (true/false) filled in from the query or a
> default config location
>  // We get the dataInputURI filled in from the query, or from a default
> config location
>  //
>  // For a first cut, let's just fake it. :-)
>  boolean validationMode = true;
>  URI dfdlSchemaURI = new URI("schema/complexArray1.dfdl.xsd");
>  String rootName = null;
>  String rootNamespace = null;
>  URI dataInputURI = new URI("data/complexArray1.dat");
> 
> 
> I imagine this is just a few lines of code to grab these from the query,
> and i don't even care about config files for now.
> 
> I gave up on trying to figure out how to do this myself. It was actually
> quite unclear from looking at the other format plugins. The way Drill does
> configuration is obviously motivated by the distributed architecture
> combined with pluggability, but all that combined with the negotation over
> schemas which extends into runtime, and it all became quite muddy to me. I
> think what I need is super straightforward, so i figured I should just
> ask.
> 
> This is just to get enough working (against local files only) that I can be
> unblocked on creating and testing the rest of the Daffodil-to-Drill
> metadata bridge and data bridge.
> 
> My plan is to get all kinds of data and queries working first but just
> against local-only files.  Fixing it to work in distributed Drill can come
> later.
> 
> -mikeb
> 
>> On Wed, Oct 18, 2023 at 2:11 PM Paul Rogers <par0...@gmail.com> wrote:
>> 
>> Hi Charles,
>> 
>> The persistent store is just ZooKeeper, and ZK is known to work poorly as
>> a distributed DB. ZK works great for things like tokens, node registrations
>> and the like. But, ZK scales very poorly for things like schemas (or query
>> profiles or a list of active queries.)
>> 
>> A more scalable approach may be to cache the schemas in each Drillbit,
>> then translate them to Drill's format and include them in each Scan
>> operator definition sent to each execution Drillbit. That solution avoids
>> race conditions when the schemas change while a query is in flight. This
>> is, in fact, the model used for storage plugin definitions. (The storage
>> plugin definitions are, in fact, stored in ZK, but tend to be small and few
>> in number.)
>> 
>> - Paul
>> 
>> 
>>> On Wed, Oct 18, 2023 at 7:51 AM Charles Givre <cgi...@gmail.com> wrote:
>>> 
>>> Hi Mike,
>>> I hope all is well.  I remembered one other piece which might be useful
>>> for you.  Drill has an interface called a PersistentStore which is used for
>>> storing artifacts such as tokens etc.  I've uesd it on two occasions: in
>>> the GoogleSheets plugin and the Http plugin.  In both cases, I used it to
>>> store OAuth user tokens which need to be preserved and shared across
>>> drillbits, and also frequently updated.  I was thinking that this might be
>>> useful for caching the DFDL schemata.  If you take a look here:
>>> https://github.com/apache/drill/blob/master/contrib/storage-http/src/main/java/org/apache/drill/exec/store/http/oauth/AccessTokenRepository.java,
>>> 
>>> https://github.com/apache/drill/tree/master/exec/java-exec/src/main/java/org/apache/drill/exec/oauth.
>>> and here
>>> https://github.com/apache/drill/blob/master/contrib/storage-http/src/main/java/org/apache/drill/exec/store/http/HttpStoragePlugin.java,
>>> you can see how I used that.
>>> 
>>> Best,
>>> -- C
>>> 
>>> 
>>> 
>>> 
>>> 
>>> 
>>>> On Oct 13, 2023, at 1:25 PM, Mike Beckerle <mbecke...@apache.org>
>>> wrote:
>>>> 
>>>> Very helpful.
>>>> 
>>>> Answers to your questions, and comments are below:
>>>> 
>>>> On Thu, Oct 12, 2023 at 5:14 PM Charles Givre <cgi...@gmail.com
>>> <mailto:cgi...@gmail.com>> wrote:
>>>>> HI Mike,
>>>>> I hope all is well.  I'll take a stab at answering your questions.
>>> But I have a few questions as well:
>>>>> 
>>>>> 1.  Are you writing a storage or format plugin for DFDL?  My thinking
>>> was that this would be a format plugin, but let me know if you were
>>> thinking differently
>>>> 
>>>> Format plugin.
>>>> 
>>>>> 2.  In traditional deployments, where do people store the DFDL
>>> schemata files?  Are they local or accessible via URL?
>>>> 
>>>> Schemas are stored in files, or in jar files created when packaging a
>>> schema project. Hence URI is the preferred identifier for them.  They are
>>> not retrieved remotely or anything like that. It's a matter of whether they
>>> are in jars on the classpath, directories on the classpath, or just a file
>>> location.
>>>> 
>>>> The source-code of DFDL schemas are often created using other schemas
>>> as components, so a single "DFDL schema" may have parts that come from 5
>>> jar files on the classpath e.g., 2 different header schemas, a library
>>> schema, and the "main" schema that assembles them all.  Inside schemas they
>>> refer to each other via xs:include or xs:import, and the schemaLocation
>>> attribute takes a URI to the location of the included/imported schema and
>>> those URIs are interpreted this same way we would want Drill to identify
>>> the location of a schema.
>>>> 
>>>> However, really people will want to pre-compile any real non-toy/test
>>> DFDL schemas into binary ".bin" files for faster loading. Otherwise
>>> Daffodil schema compilation time can be excessive (minutes for large DFDL
>>> schemas - for example the DFDL schema for VMF is 180K lines of DFDL).
>>> Compiled schemas live in exactly 1 file (relatively small. The compiled
>>> form of VMF schema is 8Mbytes). So the path given for schema in Drill sql
>>> query, or in the config wants to be allowed to be either a compiled schema
>>> or a source-code schema (.xsd) this latter mostly being for test, training,
>>> and toy examples that we would compile on-the-fly.
>>>> 
>>>>> To get the DFDL schema file or URL we have a few options, all of which
>>> revolve around setting a config variable.  For now, let's just say that the
>>> schema file is contained in the same folder as the data.  (We can make this
>>> more sophisticated later...)
>>>> 
>>>> It would make life difficult if the schemas and test data must be
>>> co-resident. Most schema projects have these in entirely separate
>>> sub-trees. Schema will be under src/main/resources/..../xsd, compiled
>>> schema would be under target/... and test data under
>>> src/test/resources/.../data
>>>> 
>>>> For now I think the easiest thing is just we get two URIs. One is for
>>> the data, one is for the schema. We access them via
>>> getClass().getResource().
>>>> 
>>>> We should not worry about caching or anything for now. Once the above
>>> works for a decent scope of tests we can worry about making it more
>>> convenient to have a library of schemas at one's disposal.
>>>> 
>>>>> 
>>>>> Here's what you have to do.
>>>>> 
>>>>> 1.  In the formatConfig file, define a String called 'dfdlSchema'.
>>> Note... config variables must be private and final.  If they aren't it can
>>> cause weird errors that are really difficult to debug.  For some reference,
>>> take a look at the Excel plugin.  (
>>> https://github.com/apache/drill/blob/master/contrib/format-excel/src/main/java/org/apache/drill/exec/store/excel/ExcelFormatConfig.java
>>> )
>>>>> 
>>>>> Setting a config variable there will allow a user to set a global
>>> schema definition.  This can also be configured individually for various
>>> workspaces.  So let's say you had PCAP files in one workspace, you could
>>> globally set the DFDL file for that and then another workspace which has
>>> some other file, you could create another DFDL plugin instance for that.
>>>> 
>>>> Ok, so the above lets me play with Drill and one schema by default. Ok
>>> for using Drill to explore data, and useful for testing.
>>>> 
>>>>> 
>>>>> Now, this is all fine and good, but a user might also want to define
>>> the schema file at query time.  The good news is that Drill allows you to
>>> do that via the table() function.
>>>>> 
>>>> 
>>>> This would allow real data-integration queries against multiple
>>> different DFDL-described data sources. Needed for a compelling demo.
>>>> 
>>>>> So let's say that we want to use a different schema file than the
>>> default, we could do something like this:
>>>>> 
>>>>> SELECT ....
>>>>> FROM table(dfs.dfdl_workspace.`myfile` (type=>'dfdl',
>>> dfdlSchema=>'path_to_schema')
>>>>> 
>>>>> Take a look at the Excel docs (
>>> https://github.com/apache/drill/blob/master/contrib/format-excel/README.md)
>>> which demonstrate how to write queries like that.  I believe that the
>>> parameters in the table function take higher precedence than the parameters
>>> from the config.  That would make sense at least.
>>>>> 
>>>> 
>>>> Perfect. I'll start with this.
>>>> 
>>>>> 
>>>>> 2.  Now that we have the schema file, the next thing would be to
>>> convert that into a Drill schema.  Let's say that we have a function called
>>> dfdlToDrill that handles the conversion.
>>>>> 
>>>>> What you'd have to do is in the constructor for the BatchReader, you'd
>>> have to set the schema there.  So pseudo code:
>>>>> 
>>>>> public DFDLBatchReader(DFDLReaderConfig, EasySubScan scan,
>>> FileSchemaNegotiator negotiator) {
>>>>>     // Other stuff...
>>>>> 
>>>>>     // Get Drill schema from DFDL
>>>>>     TupleMetadata schema = dfldToDrill(<dfdl schema file);
>>>>> 
>>>>>     // Here's the important part
>>>>>     negotiator.tableSchema(schema, true);
>>>>> }
>>>>> 
>>>>> The negotiator.tableSchema() accepts two args, a TupleMetadata and a
>>> boolean as to whether the schema is final or not.  Once this schema has
>>> been added to the negotiator object, you can then create the writers.
>>>>> 
>>>> 
>>>> That negotiator.tableSchema() is ideal. I was hoping that this was
>>> going to be the only place the metadata had to be given to drill.
>>> Excellent.
>>>> 
>>>>> 
>>>>> Take a look here...
>>>>> 
>>>>> 
>>>>> 
>>> drill/contrib/format-excel/src/main/java/org/apache/drill/exec/store/excel/ExcelBatchReader.java
>>> at 2ab46a9411a52f12a0f9acb1144a318059439bc4 · apache/drill
>>>>> github.com
>>>>> <
>>> https://github.com/apache/drill/blob/2ab46a9411a52f12a0f9acb1144a318059439bc4/contrib/format-excel/src/main/java/org/apache/drill/exec/store/excel/ExcelBatchReader.java#L199>drill/contrib/format-excel/src/main/java/org/apache/drill/exec/store/excel/ExcelBatchReader.java
>>> at 2ab46a9411a52f12a0f9acb1144a318059439bc4 · apache/drill <
>>> https://github.com/apache/drill/blob/2ab46a9411a52f12a0f9acb1144a318059439bc4/contrib/format-excel/src/main/java/org/apache/drill/exec/store/excel/ExcelBatchReader.java#L199
>>>> 
>>>>> github.com <
>>> https://github.com/apache/drill/blob/2ab46a9411a52f12a0f9acb1144a318059439bc4/contrib/format-excel/src/main/java/org/apache/drill/exec/store/excel/ExcelBatchReader.java#L199
>>>> 
>>>>> 
>>>>> 
>>>>> I see Paul just responded so I'll leave you with this.  If you have
>>> additional questions, send them our way.  Do take a look at the Excel
>>> plugin as I think it will be helpful.
>>>>> 
>>>> Yes, I've found the JsonLoaderImpl.readBatch() method, and Daffodil can
>>> work similarly.
>>>> 
>>>> This will take me a few more days to get to a pull request. The first
>>> one will be initial review, i.e., not intended to merge without more tests.
>>> Probably it will support only integer data fields, but should support lots
>>> of data shapes including vectors, choices, sequences, nested records, etc.
>>>> 
>>>> Thanks for the help.
>>>> 
>>>>> 
>>>>>> On Oct 12, 2023, at 2:58 PM, Mike Beckerle <mbecke...@apache.org
>>> <mailto:mbecke...@apache.org>> wrote:
>>>>>> 
>>>>>> So when a data format is described by a DFDL schema, I can generate
>>>>>> equivalent Drill schema (TupleMetadata). This schema is always
>>> complete. I
>>>>>> have unit tests working with this.
>>>>>> 
>>>>>> To do this for a real SQL query, I need the DFDL schema to be
>>> identified on
>>>>>> the SQL query by a file path or URI.
>>>>>> 
>>>>>> Q: How do I get that DFDL schema File/URI parameter from the SQL
>>> query?
>>>>>> 
>>>>>> Next, assuming I have the DFDL schema identified, I generate an
>>> equivalent
>>>>>> Drill TupleMetadata from it. (Or, hopefully retrieve it from a cache)
>>>>>> 
>>>>>> What objects do I call, or what classes do I have to create to make
>>> this
>>>>>> Drill TupleMetadata available to Drill so it uses it in all the ways a
>>>>>> static Drill schema can be useful?
>>>>>> 
>>>>>> I just need pointers to the code that illustrate how to do this.
>>> Thanks
>>>>>> 
>>>>>> -Mike Beckerle
>>>>>> 
>>>>>> 
>>>>>> 
>>>>>> 
>>>>>> 
>>>>>> 
>>>>>> 
>>>>>> 
>>>>>> 
>>>>>> 
>>>>>> On Thu, Oct 12, 2023 at 12:13 AM Paul Rogers <par0...@gmail.com
>>> <mailto:par0...@gmail.com>> wrote:
>>>>>> 
>>>>>>> Mike,
>>>>>>> 
>>>>>>> This is a complex question and has two answers.
>>>>>>> 
>>>>>>> First, the standard enhanced vector framework (EVF) used by most
>>> readers
>>>>>>> assumes a "pull" model: read each record. This is where the next()
>>> comes
>>>>>>> in: readers just implement this to read the next record. But, the
>>> code
>>>>>>> under EVF works with a push model: the readers write to vectors, and
>>> signal
>>>>>>> the next record. EVF translates the lower-level push model to the
>>>>>>> higher-level, easier-to-use pull model. The best example of this is
>>> the
>>>>>>> JSON reader which uses Jackson to parse JSON and responds to the
>>>>>>> corresponding events.
>>>>>>> 
>>>>>>> You can thus take over the task of filling a batch of records. I'd
>>> have to
>>>>>>> poke around the code to refresh my memory. Or, you can take a look
>>> at the
>>>>>>> (quite complex) JSON parser, or the EVF itself to see what it does.
>>> There
>>>>>>> are many unit tests that show this at various levels of abstraction.
>>>>>>> 
>>>>>>> Basically, you have to:
>>>>>>> 
>>>>>>> * Start a batch
>>>>>>> * Ask if you can start the next record (which might be declined if
>>> the
>>>>>>> batch is full)
>>>>>>> * Write each field. For complex fields, such as records, recursively
>>> do the
>>>>>>> start/end record work.
>>>>>>> * Mark the record as complete.
>>>>>>> 
>>>>>>> You should be able to map event handlers to EVF actions as a result.
>>> Even
>>>>>>> though DFDL wants to "drive", it still has to give up control once
>>> the
>>>>>>> batch is full. EVF will then handle the (surprisingly complex) task
>>> of
>>>>>>> finishing up the batch and returning it as the output of the Scan
>>> operator.
>>>>>>> 
>>>>>>> - Paul
>>>>>>> 
>>>>>>> On Wed, Oct 11, 2023 at 6:30 PM Mike Beckerle <mbecke...@apache.org
>>> <mailto:mbecke...@apache.org>>
>>>>>>> wrote:
>>>>>>> 
>>>>>>>> Daffodil parsing generates event callbacks to an InfosetOutputter,
>>> which
>>>>>>> is
>>>>>>>> analogous to a SAX event handler.
>>>>>>>> 
>>>>>>>> Drill is expecting an iterator style of calling next() to advance
>>> through
>>>>>>>> the input, i.e., Drill has the control thread and expects to do pull
>>>>>>>> parsing. At least from the code I studied in the format-xml contrib.
>>>>>>>> 
>>>>>>>> Is there any alternative? Before I dig into creating another one of
>>> these
>>>>>>>> co-routine-style control inversions (which have proven to be
>>> problematic
>>>>>>>> for performance.
>>> 
>>>

Re: Drill TupleMetadata created from DFDL Schema - how do I inform Drill about it

Reply via email to