[
https://issues.apache.org/jira/browse/DRILL-8474?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17803592#comment-17803592
]
ASF GitHub Bot commented on DRILL-8474:
---------------------------------------
mbeckerle commented on PR #2836:
URL: https://github.com/apache/drill/pull/2836#issuecomment-1878896878
> @mbeckerle I had a thought about your TODO list. See inline.
>
> > This is ready for a next review. All the scalar types are now
implemented with typed setter calls.
> > The prior review comments have all been addressed I believe.
> > Remaining things to do include:
> >
> > 1. How to get the compiled DFDL schema object so it can be loaded by
daffodil out at the distributed Drill nodes.
>
> I was thinking about this and I remembered something that might be useful.
Drill has support for User Defined Functions (UDF) which are written in Java.
To add a UDF to Drill, you also have to write some Java classes in a particular
way, and include the JARs. Much like the DFDL class files, the UDF JARs must be
accessible to all nodes of a Drill cluster.
>
> Additionally, Drill has the capability of adding UDFs dynamically. This
feature was added here: #574. Anyway, I wonder if we could use a similar
mechanism to load and store the DFDL files so that they are accessible to all
Drill nodes. What do you think?
Excellent: So drill has all the machinery, it's just a question of
repackaging it so it's available for this usage pattern, which is a bit
different from Drill's UDFs, but also very similar.
There are two user scenarios which we can call production and test.
1. Production: binary compiled DFDL schema file + code jars for Daffodil's
own UDFs and "layers" plugins. This should, ideally, cache the compiled schema
and not reload it for every query (at every node), but keep the same loaded
instance in memory in a persistant JVM image on each node. For large production
DFDL schemas this is the only sensible mechanism as it can take minutes to
compile large DFDL schemas.
2. Test: on-the-fly centralized compilation of DFDL schema (from a
combination of jars and files) to create and cache (to avoid recompiling) the
binary compiled DFDL schema file. Then using that compiled binary file, as item
1. For small DFDL schemas this can be fast enough for production use. Ideally,
if the DFDL schema is unchanged this would reuse the compiled binary file, but
that's an optimization that may not matter much.
Kinds of objects involved are:
- Daffodil plugin code jars
- DFDL schema jars
- DFDL schema files (just not packaged into a jar)
- Daffodil compiled schema binary file
- Daffodil config file - parameters, tunables, and options needed at compile
time and/or runtime
Code jars: Daffodil provides two extension features for DFDL users - DFDL
UDFs and DFDL 'layers' (ex: plug-ins for uudecode, or gunzip algorithms used in
part of the data format). Those are ordinary compiled class files in jars, so
in all scenarios those jars are needed on the node class path if the DFDL
schema uses them. Daffodil dynamically finds and loads these from the classpath
in regular Java Service-Provider Interface (SPI) mechanisms.
Schema jars: Daffodil packages DFDL schema files (source files i.e.,
mySchema.dfdl.xsd) into jar files to allow inter-schema dependencies to be
managed using ordinary jar/java-style managed dependencies. Tools like sbt and
maven can express the dependencies of one schema on another, grab and pull them
together, etc. Daffodil has a resolver so when one schema file referenes
another with include/import it searches the class path directories and jars for
the files.
Schema jars are only needed centrally when compiling the schema to a binary
file. All references to the jar files for inter-schema file references are
compiled into the compiled binary file.
It is possible for one DFDL schema 'project' to define a DFDL schema, along
with the code for a plugin like a Daffodil UDF or layer. In that case the one
jar created is both a code jar and a schema jar. The schema jar aspects are
used when the schema is compiled and ignored at Daffodil runtime. The code jar
aspects are used at Daffodil run time and ignored at schema compilation time.
So such a jar that is both code and schema jar needs to be on the class path in
both places, but there's no interaction of the two things.
Binary Compiled Schema File: Centrally, DFDL schemas in files and/or jars
are compiled to create a single binary object which can be reloaded in order to
actually use the schema to parse/unparse data.
- These binary files are tied to a specific version+build of Daffodil. (They
are just a java object serialization of the runtime data structures used by
Daffodil).
- Once reloaded into a JVM to create a Daffodil DataProcessor object, that
object is read-only so thread safe, and can be shared by parse calls happening
on many threads.
Daffodil Config File: This contains settings like what warnings to suppress
when compiling and/or at runtime, tunables, such as how large to allow a regex
match attempt, maximum parsed data size limit, etc. This also is needed both at
schema compile and at runtime, as the same file contains parameters for both
DFDL schema compile time and runtime.
> Add Daffodil Format Plugin
> --------------------------
>
> Key: DRILL-8474
> URL: https://issues.apache.org/jira/browse/DRILL-8474
> Project: Apache Drill
> Issue Type: New Feature
> Affects Versions: 1.21.1
> Reporter: Charles Givre
> Priority: Major
> Fix For: 1.22.0
>
>
--
This message was sent by Atlassian Jira
(v8.20.10#820010)