Re: [PR] mmap files when possible to improve CLI parse performance [daffodil]

via GitHub Wed, 21 Aug 2024 08:37:30 -0700


stevedlawrence commented on code in PR #1274:
URL: https://github.com/apache/daffodil/pull/1274#discussion_r1725314517



##########
daffodil-cli/src/main/scala/org/apache/daffodil/cli/Main.scala:
##########
@@ -1165,13 +1168,24 @@ class Main(
           case Some(processor) => {
             Assert.invariant(!processor.isError)
             val input = parseOpts.infile.toOption match {
-              case Some("-") | None => STDIN
+              case Some("-") | None => InputSourceDataInputStream(STDIN)
               case Some(file) => {
-                val f = new File(file)
-                new FileInputStream(f)
+                // for files <= 2GB, use a mapped byte buffer to avoid the 
overhead related to
+                // the BucketingInputSource. Larger files cannot be mapped so 
we cannot avoid it
+                val path = Paths.get(file)
+                val size = Files.size(path)
+                if (size <= Int.MaxValue) {

Review Comment:
   The nightlies don't use the `parse` command so won't see any change. They 
use the `performance` command which reads test files into a byte array before 
testing to avoid overhead related to disk I/O.
   
   We could create some patches that run on the nightlies, one patch change the 
performance command to use FileInputStream and one to use a MappedByteBuffer, 
which would give us an idea of mmap vs file input stream. But that's feels like 
a decent amount of work just to figure out an optimal size where mmap overhead 
> bucketing overhead. Also, based on my bucketing vs non-bucketing tests, I 
feel like bucketing overhead is probably more than mmap-overhead, even with 
small files and so we should always avoid bucketing when possible.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Re: [PR] mmap files when possible to improve CLI parse performance [daffodil]

Reply via email to