Got it. I’ll review today and tomorrow and hopefully we can get you unblocked. Sent from my iPhone
> On Oct 18, 2023, at 18:01, Mike Beckerle <mbecke...@apache.org> wrote: > > I am very much hoping someone will look at my open PR soon. > https://github.com/apache/drill/pull/2836 > > I am basically blocked on this effort until you help me with one key area > of that. > > I expect the part I am puzzling over is routine to you, so it will save me > much effort. > > This is the key area in the DaffodilBatchReader.java code: > > // FIXME: Next, a MIRACLE occurs. > // > // We get the dfdlSchemaURI filled in from the query, or a default config > location > // We get the rootName (or null if not supplied) from the query, or a > default config location > // We get the rootNamespace (or null if not supplied) from the query, or > a default config location > // We get the validationMode (true/false) filled in from the query or a > default config location > // We get the dataInputURI filled in from the query, or from a default > config location > // > // For a first cut, let's just fake it. :-) > boolean validationMode = true; > URI dfdlSchemaURI = new URI("schema/complexArray1.dfdl.xsd"); > String rootName = null; > String rootNamespace = null; > URI dataInputURI = new URI("data/complexArray1.dat"); > > > I imagine this is just a few lines of code to grab these from the query, > and i don't even care about config files for now. > > I gave up on trying to figure out how to do this myself. It was actually > quite unclear from looking at the other format plugins. The way Drill does > configuration is obviously motivated by the distributed architecture > combined with pluggability, but all that combined with the negotation over > schemas which extends into runtime, and it all became quite muddy to me. I > think what I need is super straightforward, so i figured I should just > ask. > > This is just to get enough working (against local files only) that I can be > unblocked on creating and testing the rest of the Daffodil-to-Drill > metadata bridge and data bridge. > > My plan is to get all kinds of data and queries working first but just > against local-only files. Fixing it to work in distributed Drill can come > later. > > -mikeb > >> On Wed, Oct 18, 2023 at 2:11 PM Paul Rogers <par0...@gmail.com> wrote: >> >> Hi Charles, >> >> The persistent store is just ZooKeeper, and ZK is known to work poorly as >> a distributed DB. ZK works great for things like tokens, node registrations >> and the like. But, ZK scales very poorly for things like schemas (or query >> profiles or a list of active queries.) >> >> A more scalable approach may be to cache the schemas in each Drillbit, >> then translate them to Drill's format and include them in each Scan >> operator definition sent to each execution Drillbit. That solution avoids >> race conditions when the schemas change while a query is in flight. This >> is, in fact, the model used for storage plugin definitions. (The storage >> plugin definitions are, in fact, stored in ZK, but tend to be small and few >> in number.) >> >> - Paul >> >> >>> On Wed, Oct 18, 2023 at 7:51 AM Charles Givre <cgi...@gmail.com> wrote: >>> >>> Hi Mike, >>> I hope all is well. I remembered one other piece which might be useful >>> for you. Drill has an interface called a PersistentStore which is used for >>> storing artifacts such as tokens etc. I've uesd it on two occasions: in >>> the GoogleSheets plugin and the Http plugin. In both cases, I used it to >>> store OAuth user tokens which need to be preserved and shared across >>> drillbits, and also frequently updated. I was thinking that this might be >>> useful for caching the DFDL schemata. If you take a look here: >>> https://github.com/apache/drill/blob/master/contrib/storage-http/src/main/java/org/apache/drill/exec/store/http/oauth/AccessTokenRepository.java, >>> >>> https://github.com/apache/drill/tree/master/exec/java-exec/src/main/java/org/apache/drill/exec/oauth. >>> and here >>> https://github.com/apache/drill/blob/master/contrib/storage-http/src/main/java/org/apache/drill/exec/store/http/HttpStoragePlugin.java, >>> you can see how I used that. >>> >>> Best, >>> -- C >>> >>> >>> >>> >>> >>> >>>> On Oct 13, 2023, at 1:25 PM, Mike Beckerle <mbecke...@apache.org> >>> wrote: >>>> >>>> Very helpful. >>>> >>>> Answers to your questions, and comments are below: >>>> >>>> On Thu, Oct 12, 2023 at 5:14 PM Charles Givre <cgi...@gmail.com >>> <mailto:cgi...@gmail.com>> wrote: >>>>> HI Mike, >>>>> I hope all is well. I'll take a stab at answering your questions. >>> But I have a few questions as well: >>>>> >>>>> 1. Are you writing a storage or format plugin for DFDL? My thinking >>> was that this would be a format plugin, but let me know if you were >>> thinking differently >>>> >>>> Format plugin. >>>> >>>>> 2. In traditional deployments, where do people store the DFDL >>> schemata files? Are they local or accessible via URL? >>>> >>>> Schemas are stored in files, or in jar files created when packaging a >>> schema project. Hence URI is the preferred identifier for them. They are >>> not retrieved remotely or anything like that. It's a matter of whether they >>> are in jars on the classpath, directories on the classpath, or just a file >>> location. >>>> >>>> The source-code of DFDL schemas are often created using other schemas >>> as components, so a single "DFDL schema" may have parts that come from 5 >>> jar files on the classpath e.g., 2 different header schemas, a library >>> schema, and the "main" schema that assembles them all. Inside schemas they >>> refer to each other via xs:include or xs:import, and the schemaLocation >>> attribute takes a URI to the location of the included/imported schema and >>> those URIs are interpreted this same way we would want Drill to identify >>> the location of a schema. >>>> >>>> However, really people will want to pre-compile any real non-toy/test >>> DFDL schemas into binary ".bin" files for faster loading. Otherwise >>> Daffodil schema compilation time can be excessive (minutes for large DFDL >>> schemas - for example the DFDL schema for VMF is 180K lines of DFDL). >>> Compiled schemas live in exactly 1 file (relatively small. The compiled >>> form of VMF schema is 8Mbytes). So the path given for schema in Drill sql >>> query, or in the config wants to be allowed to be either a compiled schema >>> or a source-code schema (.xsd) this latter mostly being for test, training, >>> and toy examples that we would compile on-the-fly. >>>> >>>>> To get the DFDL schema file or URL we have a few options, all of which >>> revolve around setting a config variable. For now, let's just say that the >>> schema file is contained in the same folder as the data. (We can make this >>> more sophisticated later...) >>>> >>>> It would make life difficult if the schemas and test data must be >>> co-resident. Most schema projects have these in entirely separate >>> sub-trees. Schema will be under src/main/resources/..../xsd, compiled >>> schema would be under target/... and test data under >>> src/test/resources/.../data >>>> >>>> For now I think the easiest thing is just we get two URIs. One is for >>> the data, one is for the schema. We access them via >>> getClass().getResource(). >>>> >>>> We should not worry about caching or anything for now. Once the above >>> works for a decent scope of tests we can worry about making it more >>> convenient to have a library of schemas at one's disposal. >>>> >>>>> >>>>> Here's what you have to do. >>>>> >>>>> 1. In the formatConfig file, define a String called 'dfdlSchema'. >>> Note... config variables must be private and final. If they aren't it can >>> cause weird errors that are really difficult to debug. For some reference, >>> take a look at the Excel plugin. ( >>> https://github.com/apache/drill/blob/master/contrib/format-excel/src/main/java/org/apache/drill/exec/store/excel/ExcelFormatConfig.java >>> ) >>>>> >>>>> Setting a config variable there will allow a user to set a global >>> schema definition. This can also be configured individually for various >>> workspaces. So let's say you had PCAP files in one workspace, you could >>> globally set the DFDL file for that and then another workspace which has >>> some other file, you could create another DFDL plugin instance for that. >>>> >>>> Ok, so the above lets me play with Drill and one schema by default. Ok >>> for using Drill to explore data, and useful for testing. >>>> >>>>> >>>>> Now, this is all fine and good, but a user might also want to define >>> the schema file at query time. The good news is that Drill allows you to >>> do that via the table() function. >>>>> >>>> >>>> This would allow real data-integration queries against multiple >>> different DFDL-described data sources. Needed for a compelling demo. >>>> >>>>> So let's say that we want to use a different schema file than the >>> default, we could do something like this: >>>>> >>>>> SELECT .... >>>>> FROM table(dfs.dfdl_workspace.`myfile` (type=>'dfdl', >>> dfdlSchema=>'path_to_schema') >>>>> >>>>> Take a look at the Excel docs ( >>> https://github.com/apache/drill/blob/master/contrib/format-excel/README.md) >>> which demonstrate how to write queries like that. I believe that the >>> parameters in the table function take higher precedence than the parameters >>> from the config. That would make sense at least. >>>>> >>>> >>>> Perfect. I'll start with this. >>>> >>>>> >>>>> 2. Now that we have the schema file, the next thing would be to >>> convert that into a Drill schema. Let's say that we have a function called >>> dfdlToDrill that handles the conversion. >>>>> >>>>> What you'd have to do is in the constructor for the BatchReader, you'd >>> have to set the schema there. So pseudo code: >>>>> >>>>> public DFDLBatchReader(DFDLReaderConfig, EasySubScan scan, >>> FileSchemaNegotiator negotiator) { >>>>> // Other stuff... >>>>> >>>>> // Get Drill schema from DFDL >>>>> TupleMetadata schema = dfldToDrill(<dfdl schema file); >>>>> >>>>> // Here's the important part >>>>> negotiator.tableSchema(schema, true); >>>>> } >>>>> >>>>> The negotiator.tableSchema() accepts two args, a TupleMetadata and a >>> boolean as to whether the schema is final or not. Once this schema has >>> been added to the negotiator object, you can then create the writers. >>>>> >>>> >>>> That negotiator.tableSchema() is ideal. I was hoping that this was >>> going to be the only place the metadata had to be given to drill. >>> Excellent. >>>> >>>>> >>>>> Take a look here... >>>>> >>>>> >>>>> >>> drill/contrib/format-excel/src/main/java/org/apache/drill/exec/store/excel/ExcelBatchReader.java >>> at 2ab46a9411a52f12a0f9acb1144a318059439bc4 · apache/drill >>>>> github.com >>>>> < >>> https://github.com/apache/drill/blob/2ab46a9411a52f12a0f9acb1144a318059439bc4/contrib/format-excel/src/main/java/org/apache/drill/exec/store/excel/ExcelBatchReader.java#L199>drill/contrib/format-excel/src/main/java/org/apache/drill/exec/store/excel/ExcelBatchReader.java >>> at 2ab46a9411a52f12a0f9acb1144a318059439bc4 · apache/drill < >>> https://github.com/apache/drill/blob/2ab46a9411a52f12a0f9acb1144a318059439bc4/contrib/format-excel/src/main/java/org/apache/drill/exec/store/excel/ExcelBatchReader.java#L199 >>>> >>>>> github.com < >>> https://github.com/apache/drill/blob/2ab46a9411a52f12a0f9acb1144a318059439bc4/contrib/format-excel/src/main/java/org/apache/drill/exec/store/excel/ExcelBatchReader.java#L199 >>>> >>>>> >>>>> >>>>> I see Paul just responded so I'll leave you with this. If you have >>> additional questions, send them our way. Do take a look at the Excel >>> plugin as I think it will be helpful. >>>>> >>>> Yes, I've found the JsonLoaderImpl.readBatch() method, and Daffodil can >>> work similarly. >>>> >>>> This will take me a few more days to get to a pull request. The first >>> one will be initial review, i.e., not intended to merge without more tests. >>> Probably it will support only integer data fields, but should support lots >>> of data shapes including vectors, choices, sequences, nested records, etc. >>>> >>>> Thanks for the help. >>>> >>>>> >>>>>> On Oct 12, 2023, at 2:58 PM, Mike Beckerle <mbecke...@apache.org >>> <mailto:mbecke...@apache.org>> wrote: >>>>>> >>>>>> So when a data format is described by a DFDL schema, I can generate >>>>>> equivalent Drill schema (TupleMetadata). This schema is always >>> complete. I >>>>>> have unit tests working with this. >>>>>> >>>>>> To do this for a real SQL query, I need the DFDL schema to be >>> identified on >>>>>> the SQL query by a file path or URI. >>>>>> >>>>>> Q: How do I get that DFDL schema File/URI parameter from the SQL >>> query? >>>>>> >>>>>> Next, assuming I have the DFDL schema identified, I generate an >>> equivalent >>>>>> Drill TupleMetadata from it. (Or, hopefully retrieve it from a cache) >>>>>> >>>>>> What objects do I call, or what classes do I have to create to make >>> this >>>>>> Drill TupleMetadata available to Drill so it uses it in all the ways a >>>>>> static Drill schema can be useful? >>>>>> >>>>>> I just need pointers to the code that illustrate how to do this. >>> Thanks >>>>>> >>>>>> -Mike Beckerle >>>>>> >>>>>> >>>>>> >>>>>> >>>>>> >>>>>> >>>>>> >>>>>> >>>>>> >>>>>> >>>>>> On Thu, Oct 12, 2023 at 12:13 AM Paul Rogers <par0...@gmail.com >>> <mailto:par0...@gmail.com>> wrote: >>>>>> >>>>>>> Mike, >>>>>>> >>>>>>> This is a complex question and has two answers. >>>>>>> >>>>>>> First, the standard enhanced vector framework (EVF) used by most >>> readers >>>>>>> assumes a "pull" model: read each record. This is where the next() >>> comes >>>>>>> in: readers just implement this to read the next record. But, the >>> code >>>>>>> under EVF works with a push model: the readers write to vectors, and >>> signal >>>>>>> the next record. EVF translates the lower-level push model to the >>>>>>> higher-level, easier-to-use pull model. The best example of this is >>> the >>>>>>> JSON reader which uses Jackson to parse JSON and responds to the >>>>>>> corresponding events. >>>>>>> >>>>>>> You can thus take over the task of filling a batch of records. I'd >>> have to >>>>>>> poke around the code to refresh my memory. Or, you can take a look >>> at the >>>>>>> (quite complex) JSON parser, or the EVF itself to see what it does. >>> There >>>>>>> are many unit tests that show this at various levels of abstraction. >>>>>>> >>>>>>> Basically, you have to: >>>>>>> >>>>>>> * Start a batch >>>>>>> * Ask if you can start the next record (which might be declined if >>> the >>>>>>> batch is full) >>>>>>> * Write each field. For complex fields, such as records, recursively >>> do the >>>>>>> start/end record work. >>>>>>> * Mark the record as complete. >>>>>>> >>>>>>> You should be able to map event handlers to EVF actions as a result. >>> Even >>>>>>> though DFDL wants to "drive", it still has to give up control once >>> the >>>>>>> batch is full. EVF will then handle the (surprisingly complex) task >>> of >>>>>>> finishing up the batch and returning it as the output of the Scan >>> operator. >>>>>>> >>>>>>> - Paul >>>>>>> >>>>>>> On Wed, Oct 11, 2023 at 6:30 PM Mike Beckerle <mbecke...@apache.org >>> <mailto:mbecke...@apache.org>> >>>>>>> wrote: >>>>>>> >>>>>>>> Daffodil parsing generates event callbacks to an InfosetOutputter, >>> which >>>>>>> is >>>>>>>> analogous to a SAX event handler. >>>>>>>> >>>>>>>> Drill is expecting an iterator style of calling next() to advance >>> through >>>>>>>> the input, i.e., Drill has the control thread and expects to do pull >>>>>>>> parsing. At least from the code I studied in the format-xml contrib. >>>>>>>> >>>>>>>> Is there any alternative? Before I dig into creating another one of >>> these >>>>>>>> co-routine-style control inversions (which have proven to be >>> problematic >>>>>>>> for performance. >>> >>>