cgivre commented on PR #2989: URL: https://github.com/apache/drill/pull/2989#issuecomment-3824439816
Hi Mike, How’s it going? I hope all is well. It’s been a while since we spoke and I’d really like to wrap up the Drill / Daffodil work and get it merged as I’d like to cut a new release in the next month or so. The way I understood the way Daffodil works was that to use it, there were various files which Daffodil needs to understand the schema. For Drill’s purpose, the type of file isn’t really important, but we need all these files to be on every Drill node and we need them to be in the Classpath. Ideally, we need a way for a user to upload these files such that they are distributed to all the nodes in the Drill cluster. My thinking here was that Drill UDFs do basically the same thing with the CREATE FUNCTION USING JAR… syntax. When a user executes a query like this, the JAR file is moved from a staging folder on a single Drillbit to the appropriate location on all the Drillbits in the cluster. I was thinking that we do the same thing for Daffodil. In this case, the syntax might be wonky, but if Daffodil requires a bin file, all the user would have to do is upload it to the staging directory on a single node and execute a query: CREATE DAFFODIL SCHEMA USING JAR ‘my-file.bin'; If there are additional files, the user could do the same: CREATE DAFFODIL SCHEMA USING JAR ‘my-other-file.jar'; In the current implementation, Drill doesn’t check the file types, so all that is happening here is that the user is uploading a file and Drill is distributing it to the cluster. So whether a Daffodil user creates a compiled BIN or uses a collection of JAR files, they can use the same mechanism to get them onto the Drill cluster. (We could modify the query to allow: CREATE DAFFODIL SCHEMA USING BIN ‘my bin.bin’; But in any event, all this is doing is upload a file and distributing it to the cluster. Would this work for you? Thanks! — C > On Nov 5, 2025, at 16:41, Mike Beckerle ***@***.***> wrote: > > > mbeckerle > left a comment > (apache/drill#2989) > <https://github.com/apache/drill/pull/2989#issuecomment-3493623186> > Ok, If I specify an actual jar file containing some compiled java code, will that be put onto the java classpath in the drill bits? > > The issue I'm seeing is that schemas are normally pre-compiled into a ".bin" file which is fast to load, but in addition to this file, the schema may have a dependency on certain Daffodil plug in code, which is compiled java in jar files. This dependency can be on multiple different jar files. All these dependency jar files need to be on the classpath. > > The daffodil plugins are of 3 kinds. UDFs, "layers" (which compute checksums or decompress zip files, etc. ), and charset definitions. All are dynamically loaded into the JVM when the DFDL schema requests them. They are found using the > > All these different jar files need to be on the Java classpath so that their metadata allows dynamic loading. > > So while a simple DFDL schema might be contained in one jar file, in general there can be a dependency on multiple jar files which must be placed onto the Java classpath in a specific order. The schema may be needed in source form also for validation of data. > > As a case in point, on github there are DFDL schema projects named: > > envelope-payload > tcpMessage > mil-std-2045 > PCAP > ethernetIP > These are separate component DFDL schemas that are assembled to form an assembly schema by way of schema composition. > The only jar file that needs to be on the classpath is the one from ethernetIP, since that defines a layer algorithm for computing IPv4 checksums. > > The DFDL schema that combines all these components can be pre-compiled into an envelope-payload.bin file. > > So in this case I need this ".bin" file to be distributed across the cluster and loaded by Daffodil in each drill bit, and with the ethernetIP.jar file distributed across the drill cluster and the ethernetIP.jar needs to be on the classpath of the drill bit java process. > > — > Reply to this email directly, view it on GitHub <https://github.com/apache/drill/pull/2989#issuecomment-3493623186>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/ABKB7PRW5VTVSJDDMJPE2ST33JVILAVCNFSM6AAAAAB4XWHMGWVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZTIOJTGYZDGMJYGY>. > You are receiving this because you were assigned. > -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected]
