Re: [PR] mmap files when possible to improve CLI parse performance [daffodil]

via GitHub Wed, 21 Aug 2024 08:02:01 -0700


stevedlawrence commented on code in PR #1274:
URL: https://github.com/apache/daffodil/pull/1274#discussion_r1725232905



##########
daffodil-cli/src/main/scala/org/apache/daffodil/cli/Main.scala:
##########
@@ -1165,13 +1168,24 @@ class Main(
           case Some(processor) => {
             Assert.invariant(!processor.isError)
             val input = parseOpts.infile.toOption match {
-              case Some("-") | None => STDIN
+              case Some("-") | None => InputSourceDataInputStream(STDIN)
               case Some(file) => {
-                val f = new File(file)
-                new FileInputStream(f)
+                // for files <= 2GB, use a mapped byte buffer to avoid the 
overhead related to
+                // the BucketingInputSource. Larger files cannot be mapped so 
we cannot avoid it
+                val path = Paths.get(file)
+                val size = Files.size(path)
+                if (size <= Int.MaxValue) {
+                  val fc = FileChannel.open(path, StandardOpenOption.READ)
+                  val bb = fc.map(FileChannel.MapMode.READ_ONLY, 0, size)

Review Comment:
   We could, but I'm a little hesitant to force something on a API user if we 
can't say for sure it will be faster in 100% of cases, especially if there are 
cases where it could be slower (e.g. like with small files you mentioned).
   
   Maybe an alternative might be to instead just provide better API 
documentation, maybe something like:
   
   > The InputStream variant has potential overhead due to streaming 
capabilities and support for unlimited data sizes. In some cases, better 
performance might come from using the ByteBuffer variant instead. For example, 
if your data is already in a byte array, one should use the Array[Byte] or 
ByteBuffer variants instead of wrapping it in a ByteArrayInputStream. As 
another example, instead of using a FileInputStream one could consider mapping 
the File to a MappedByteBuffer, keeping in mind that MappedByteBuffers might 
have different performance characteristics depending on the file size and 
system.
   
   And then we leave it up to the API users to figure out what works best for 
their system/environment?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Re: [PR] mmap files when possible to improve CLI parse performance [daffodil]

Reply via email to