cgivre commented on PR #2989: URL: https://github.com/apache/drill/pull/2989#issuecomment-3897574658
Thanks Steve! We’d really welcome feedback once this is merged. Could I send something to the Daffodil list with demo and/or instructions so people could try it out? Regarding the review, unfortunately, you’re not a committer to Drill, so we need Mike or someone else to approve. But my goal is to merge this ASAP. Best, — C > On Feb 13, 2026, at 09:34, Steve Lawrence ***@***.***> wrote: > > > @stevedlawrence approved this pull request. > > Looks great to me! Just a couple tiny comments. > > I agree with Mike it would be useful to test some more complex schemas that are built with multiple component and plugins jars, like the PCAP one mentioned. I think having users modify the classpath would work if Drill doesn't have a built-in way to do that. Though maybe the jar registry does that? It's not clear to me if that automatically puts things on the classpath or just some files available somehow else. > > But even without that, it seems like the core functionality is here. And although many of the more interesting schemas require multiple jars or plugins, there are plenty of schemas that don't, so I think there is still lots of value even without it. > > I would recommend merging this, and later versions of Drill can work to add and or test Daffodil component/plugin support if needed. > > In contrib/format-daffodil/src/main/java/org/apache/drill/exec/store/daffodil/schema/DaffodilDataProcessorFactory.java <https://github.com/apache/drill/pull/2989#discussion_r2804134130>: > > > + dmp.setupDP(validationMode, null); > + } else { > + List<Diagnostic> pfDiags; > + try { > + pfDiags = dmp.compileSchema(schemaFileURI, rootName, rootNS); > + } catch (URISyntaxException | IOException e) { > + throw new CompileFailure(e); > + } > + dmp.setupDP(validationMode, pfDiags); > + } > + return dmp.dp; > + } > + > + private void loadSchema(URI schemaFileURI) throws IOException, InvalidParserException { > + Compiler c = Daffodil.compiler(); > + dp = c.reload(Channels.newChannel(schemaFileURI.toURL().openStream())); > Just to be clear, since I don't think Daffodil documents this very well, the way save/reload works in Daffodil is save uses Java serialization to serialize a DataProcessor, and reload just deserializes it. In all uses that I'm aware of, these .bin files are trusted and so are safe to reload/deserialize. But I know there are a number of security concerns with deserializing untrusted data. If the Drill security model doesn't necessarily trust users with files provided to CREATE DAFFODIL SCHEMA or SELECT * from ... type=daffodil, then it might be worth considering if reload is something you want to disallow. It's a potential loss since saved parsers are easier to distribute and can avoid long schema compilation times, but it does come with potential risk. > > In contrib/format-daffodil/src/main/java/org/apache/drill/exec/store/daffodil/DaffodilBatchReader.java <https://github.com/apache/drill/pull/2989#discussion_r2804303273>: > > > + * > + * @return true if there are rows retrieved, false if no rows were retrieved, which means no more > + * will ever be retrieved (end of data). > + * @throws RuntimeException > + * on parse errors. > + */ > + @Override > + public boolean next() { > + // Check assumed invariants > + // We don't know if there is data or not. This could be called on an empty data file. > + // We DO know that this won't be called if there is no space in the batch for even 1 > + // row. > + if (dafParser.isEOF()) { > + return false; // return without even checking for more rows or trying to parse. > + } > + while (rowSetLoader.start() && !dafParser.isEOF()) { // we never zero-trip this loop. > One thing to consider is that it is possible for Daffodil to consume zero bits of data and it be considered a success. Schemas that allow that are kindof a pathological case and don't really exist in the real world, but might be something to consider since it is technically possible. > > The way to detect this is to access ParseResult.location().bitPos1b() at the end of a parse. If that bit position is the same as before the parse then it means the parse successfully consumed zero bits, and it will continue to do so in an infinite loop if you keep attempting to parse. For example, this usually looks something like this: > > int lastBitPosition = 1; > while (...) > ParseResult pr = dp.parse(); > ... // check isError > int currBitPosition = pr.location().bitPos1b(); > if (currBitPosition == lastBitPosition)) { > // consumed no data, exit parsing to avoid infinite loop > break; > } > lastBitPosition = currBitPosition; > } > Note that even if Daffodil consumes zero bits, it is still possible to create infoset events, so you really do need to check the bit position instead checking if infoset events were created. > > — > Reply to this email directly, view it on GitHub <https://github.com/apache/drill/pull/2989#pullrequestreview-3797200685>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/ABKB7PTGDQYOJRHOBQWSSZ34LXOILAVCNFSM6AAAAAB4XWHMGWVHI2DSMVQWIX3LMV43YUDVNRWFEZLROVSXG5CSMV3GSZLXHMZTOOJXGIYDANRYGU>. > You are receiving this because you were assigned. > -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected]
