[GitHub] [daffodil] stevedlawrence commented on a change in pull request #539: Fix TCP/network input stream hangs on isAtEnd

GitBox Mon, 26 Apr 2021 06:39:54 -0700


stevedlawrence commented on a change in pull request #539:
URL: https://github.com/apache/daffodil/pull/539#discussion_r620276245




##########
File path: daffodil-io/src/main/scala/org/apache/daffodil/io/InputSource.scala
##########
@@ -255,18 +292,27 @@ class BucketingInputSource(
    */
   private def fillBucketsToIndex(goalBucketIndex: Long, bytesNeededInBucket: 
Long): Boolean = {
     var lastBucketIndex = buckets.length - 1
-    var hasMoreData = true
+
     var needsMoreData = goalBucketIndex > lastBucketIndex || (goalBucketIndex 
== lastBucketIndex && bytesNeededInBucket > bytesFilledInLastBucket)
 
     while (needsMoreData && hasMoreData) {
       // Try to fill the rest of this bucket, regardless of how many bytes are
       // actually needed
       val emptyBytesInLastBucket = bucketSize - bytesFilledInLastBucket
 
+      Assert.invariant(emptyBytesInLastBucket >= 0)

Review comment:
       Are there any errors if we change this assert to > instead of >=? I 
don't think we should ever be calling read with a len of zero bytes.

##########
File path: daffodil-io/src/main/scala/org/apache/daffodil/io/InputSource.scala
##########
@@ -255,18 +292,27 @@ class BucketingInputSource(
    */
   private def fillBucketsToIndex(goalBucketIndex: Long, bytesNeededInBucket: 
Long): Boolean = {
     var lastBucketIndex = buckets.length - 1
-    var hasMoreData = true
+
     var needsMoreData = goalBucketIndex > lastBucketIndex || (goalBucketIndex 
== lastBucketIndex && bytesNeededInBucket > bytesFilledInLastBucket)
 
     while (needsMoreData && hasMoreData) {
       // Try to fill the rest of this bucket, regardless of how many bytes are
       // actually needed
       val emptyBytesInLastBucket = bucketSize - bytesFilledInLastBucket
 
+      Assert.invariant(emptyBytesInLastBucket >= 0)
       // Try to read enough bytes to fill the rest of this bucket. Note that
       // the .read() function could hit EOF (returns -1) or could return
-      // anywhere from zero to emptyBytesInLastBucket bytes
-      val bytesRead = inputStream.read(buckets(lastBucketIndex).bytes, 
bytesFilledInLastBucket, emptyBytesInLastBucket)
+      // anywhere from 0 to emptyBytesInLastBucket bytes
+      val bytesRead =
+        inputStream.read(
+          buckets(lastBucketIndex).bytes,
+          bytesFilledInLastBucket,
+          emptyBytesInLastBucket)
+
+      // check for bad inputStream behavior. It's not our fault!
+      if (bytesRead == 0 && emptyBytesInLastBucket > 0)

Review comment:
       If we can change the above assert to > 0, this can just become 
``bytesRead == 0``

##########
File path: 
daffodil-io/src/main/scala/org/apache/daffodil/io/InputSourceDataInputStream.scala
##########
@@ -117,6 +118,57 @@ final class InputSourceDataInputStream private (val 
inputSource: InputSource)
   @inline
   override final def bitLimit0b: MaybeULong = cst.bitLimit0b
 
+  /**
+   * Tells us if the underlying input source has detected end-of-data
+   * (the read(...) call returned -1.
+   *
+   * But this does NOT tell us we are positioned at the end, only whether
+   * in the course of reading, we encountered the end of data. If we
+   * backtracked we could have seen the end of data, but backed up in
+   * the data to an earlier position.
+   */
+  def hasReachedEndOfData: Boolean = inputSource.hasReachedEndOfData
+
+  /**
+   * Determine if we're positioned at the end of data without
+   * doing any additional blocking operation such as reading more
+   * data to test if there is any.
+   *
+   * This depends on the underlying inputSource keeping track of
+   * whether it has previously hit the end of data or not.
+   *
+   * @return
+   */
+  final def isAtEnd(): Boolean = {
+    if (bitLimit0b.isDefined) {
+      bitPos0b == bitLimit0b.get
+    } else {
+      // There is no bit limit, so whether we're at the end
+      // requires that we've hit then end-of-data on a read
+      // operation, AND that we've not backtracked since then,
+      // meaning we're positioned at the end of the last byte
+      // read.
+      hasReachedEndOfData && {
+        //
+        // No real input source can end in the middle of a byte.
+        // The only way that can happen is with the bitLimit having been
+        // set, and we've already dealt with that.
+        //
+        // So we just have to know that we're positioned at the
+        // end of the data.
+        //
+        val inputStreamBytePos0b = inputSource.position()
+        val isAtFinalByte = bytePos0b == inputStreamBytePos0b

Review comment:
       I thought bytePos0b and inputSource.position() were basically always 
kept in sync? When we change bytePos0b, don't we also change the InputSource 
byte position, even when we backtrack. Seems like this is always going to be 
true, even if we've backtracked. Does the InputSource need a isAtEndOfData 
since only it really knows whre the last byte in the InputSource is?

##########
File path: 
daffodil-runtime1/src/main/scala/org/apache/daffodil/layers/LayerTransformer.scala
##########
@@ -216,7 +216,7 @@ class JavaIOInputStream(s: InputSourceDataInputStream, 
finfo: FormatInfo)
     }
   }
 
-  override def available(): Int = 1
+  override def available(): Int = 0

Review comment:
       Does nothing rely on this value? Surprised this didn't cause anything to 
fail.

##########
File path: daffodil-io/src/main/scala/org/apache/daffodil/io/InputSource.scala
##########
@@ -255,18 +292,27 @@ class BucketingInputSource(
    */
   private def fillBucketsToIndex(goalBucketIndex: Long, bytesNeededInBucket: 
Long): Boolean = {
     var lastBucketIndex = buckets.length - 1
-    var hasMoreData = true
+
     var needsMoreData = goalBucketIndex > lastBucketIndex || (goalBucketIndex 
== lastBucketIndex && bytesNeededInBucket > bytesFilledInLastBucket)
 
     while (needsMoreData && hasMoreData) {
       // Try to fill the rest of this bucket, regardless of how many bytes are
       // actually needed
       val emptyBytesInLastBucket = bucketSize - bytesFilledInLastBucket
 
+      Assert.invariant(emptyBytesInLastBucket >= 0)
       // Try to read enough bytes to fill the rest of this bucket. Note that
       // the .read() function could hit EOF (returns -1) or could return
-      // anywhere from zero to emptyBytesInLastBucket bytes
-      val bytesRead = inputStream.read(buckets(lastBucketIndex).bytes, 
bytesFilledInLastBucket, emptyBytesInLastBucket)
+      // anywhere from 0 to emptyBytesInLastBucket bytes

Review comment:
       Should we update this comment to say 1 to emptyBytesInLastBucket byte? 
``read()`` should never actually return 0.

##########
File path: daffodil-japi/src/main/scala/org/apache/daffodil/japi/Daffodil.scala
##########
@@ -490,7 +490,11 @@ class DataLocation private[japi] (dl: SDataLocation) {
   override def toString() = dl.toString
 
   /**
-   * Determine if this data location is at the end of the input data
+   * Determine if this data location is known to be at the end of the data 
stream.
+   *
+   * Note that this can be false, yet all bits consumed. This occurs if no
+   * subsequent input read operation from the data occurred causing the
+   * input stream to detect end-of-data/EOF.

Review comment:
       This is an API change. I can imagine cases where people relied on this 
behavior for checking left over data, and I feel like it's very easy for binary 
formats to never have this set to true. Prevously, unless you were using a 
streaming thing, we always knew for sure if we hit EOD.
   
   Maybe we need a tunable to switch between the "read one byte to figured out 
EOD" vs "must have hit EOD before, and we aren't actually sure if we are at EOD 
yet"? We can default to the old behavior, but people streaming or doing TCP 
stuff can switch to the new behavior, with the understanding that isAtEnd might 
not be accurate?

##########
File path: 
daffodil-io/src/main/scala/org/apache/daffodil/io/InputSourceDataInputStream.scala
##########
@@ -117,6 +118,57 @@ final class InputSourceDataInputStream private (val 
inputSource: InputSource)
   @inline
   override final def bitLimit0b: MaybeULong = cst.bitLimit0b
 
+  /**
+   * Tells us if the underlying input source has detected end-of-data
+   * (the read(...) call returned -1.
+   *
+   * But this does NOT tell us we are positioned at the end, only whether
+   * in the course of reading, we encountered the end of data. If we
+   * backtracked we could have seen the end of data, but backed up in
+   * the data to an earlier position.
+   */
+  def hasReachedEndOfData: Boolean = inputSource.hasReachedEndOfData
+
+  /**
+   * Determine if we're positioned at the end of data without
+   * doing any additional blocking operation such as reading more
+   * data to test if there is any.
+   *
+   * This depends on the underlying inputSource keeping track of
+   * whether it has previously hit the end of data or not.
+   *
+   * @return
+   */
+  final def isAtEnd(): Boolean = {
+    if (bitLimit0b.isDefined) {
+      bitPos0b == bitLimit0b.get

Review comment:
       I'm not sure it's safe to use bitLimit here? I believe we change 
``bitLimit0b`` when complex types have specified lengths to ensure children do 
not parse beyond those specified lengths. If we have reached the end of the 
complex type specified length, we shouldn't be considered isAtEnd, but I think 
this will do that?

##########
File path: 
daffodil-io/src/main/scala/org/apache/daffodil/io/InputSourceDataInputStream.scala
##########
@@ -117,6 +118,57 @@ final class InputSourceDataInputStream private (val 
inputSource: InputSource)
   @inline
   override final def bitLimit0b: MaybeULong = cst.bitLimit0b
 
+  /**
+   * Tells us if the underlying input source has detected end-of-data
+   * (the read(...) call returned -1.
+   *
+   * But this does NOT tell us we are positioned at the end, only whether
+   * in the course of reading, we encountered the end of data. If we
+   * backtracked we could have seen the end of data, but backed up in
+   * the data to an earlier position.
+   */
+  def hasReachedEndOfData: Boolean = inputSource.hasReachedEndOfData
+
+  /**
+   * Determine if we're positioned at the end of data without
+   * doing any additional blocking operation such as reading more
+   * data to test if there is any.
+   *
+   * This depends on the underlying inputSource keeping track of
+   * whether it has previously hit the end of data or not.
+   *
+   * @return

Review comment:
       Missing scaladoc for ``@return``

##########
File path: 
daffodil-japi/src/main/scala/org/apache/daffodil/japi/io/InputSourceDataInputStream.scala
##########
@@ -43,5 +43,13 @@ class InputSourceDataInputStream private[japi] (private 
[japi] val dis: SInputSo
   /**
    * Create an InputSourceDataInputStream from a byte array
    */
-  def this(arr: Array[Byte]) = this(SInputSourceDataInputStream(arr)) 
+  def this(arr: Array[Byte]) = this(SInputSourceDataInputStream(arr))
+
+  /**
+   * Used to see if there is more data or not after a parse operation.
+   * Whether more data is a left-over-data error, or just more data to
+   * parse in a subsequent parse call depends on the intention of the
+   * code calling the parse operation.
+   */

Review comment:
       I think it's worth adding a comment saying that this will block until 
either at least either n bytes are read or EOD is hit.

##########
File path: 
daffodil-tdml-processor/src/main/scala/org/apache/daffodil/tdml/processor/DaffodilTDMLDFDLProcessor.scala
##########
@@ -298,6 +298,13 @@ class DaffodilTDMLDFDLProcessor private (private var dp: 
DataProcessor) extends
 
   def doParseWithBothApis(dpInputStream: java.io.InputStream, saxInputStream: 
java.io.InputStream,
     lengthLimitInBits: Long): TDMLParseResult = {
+    //
+    // TDML Tests MUST have a length limit. Otherwise they cannot determine if
+    // there is left-over-data or not without doing more reading from the 
input stream
+    // so as to be sure to hit end-of-data.
+    //
+    Assert.usage(lengthLimitInBits >= 0)

Review comment:
       Can't TDML tests use the isDefinedForLength(1) thing? Then we don't 
necessarily have to rely on lengthLimit?

##########
File path: 
daffodil-io/src/test/scala/org/apache/daffodil/layers/TestBase64.scala
##########
@@ -66,7 +66,7 @@ 
W1vdXMgcXVvdGVzIG9yIHNvbmcgbHlyaWNzIG9yIGFueXRoaW5nIGxpa2UgdGhhdCBpbnRyb2R1Y
     pairs.foreach {
       case (exp, act) => {
         if (exp != act) {
-          println("differ at character %s (0-based). Expected '%s' got 
'%s'.".format(i, exp, act))
+          // println("differ at character %s (0-based). Expected '%s' got 
'%s'.".format(i, exp, act))

Review comment:
       Should we just delete these printlns?

##########
File path: 
daffodil-runtime1/src/main/scala/org/apache/daffodil/processors/parsers/PState.scala
##########
@@ -288,7 +285,8 @@ final class PState private (
   }
 
   def currentLocation: DataLocation = {
-    val isAtEnd = !dataInputStream.isDefinedForLength(1)
+    // val isAtEnd = !dataInputStream.isDefinedForLength(1)
+    val isAtEnd = dataInputStream.isAtEnd()

Review comment:
       Agreed, but need to consider backwards compatibility.

##########
File path: 
daffodil-test/src/test/resources/org/apache/daffodil/section12/lengthKind/ExplicitTests.tdml
##########
@@ -598,7 +598,7 @@
       <tdml:documentPart type="text"><![CDATA[000118Ridgewood Circle    
Rochester           NY123]]></tdml:documentPart>
     </tdml:document>
     <tdml:errors>
-      <tdml:error>6467715096</tdml:error>
+      <tdml:error>6467715464</tdml:error>

Review comment:
       Any idea why this number changed?

##########
File path: 
daffodil-io/src/main/scala/org/apache/daffodil/io/InputSourceDataInputStream.scala
##########
@@ -117,6 +118,57 @@ final class InputSourceDataInputStream private (val 
inputSource: InputSource)
   @inline
   override final def bitLimit0b: MaybeULong = cst.bitLimit0b
 
+  /**
+   * Tells us if the underlying input source has detected end-of-data
+   * (the read(...) call returned -1.
+   *
+   * But this does NOT tell us we are positioned at the end, only whether
+   * in the course of reading, we encountered the end of data. If we
+   * backtracked we could have seen the end of data, but backed up in
+   * the data to an earlier position.
+   */
+  def hasReachedEndOfData: Boolean = inputSource.hasReachedEndOfData
+
+  /**
+   * Determine if we're positioned at the end of data without
+   * doing any additional blocking operation such as reading more
+   * data to test if there is any.
+   *
+   * This depends on the underlying inputSource keeping track of
+   * whether it has previously hit the end of data or not.
+   *
+   * @return
+   */
+  final def isAtEnd(): Boolean = {
+    if (bitLimit0b.isDefined) {
+      bitPos0b == bitLimit0b.get
+    } else {
+      // There is no bit limit, so whether we're at the end
+      // requires that we've hit then end-of-data on a read
+      // operation, AND that we've not backtracked since then,
+      // meaning we're positioned at the end of the last byte
+      // read.
+      hasReachedEndOfData && {

Review comment:
       I think this is a now a non-backwards compatible change now. isAtEnd can 
only be true if we've gotten an EOD. And I imagine in a lot of binary formats 
we'll never try that extra read to get the EOD because we don't ever need to 
scan.

##########
File path: 
daffodil-sapi/src/main/scala/org/apache/daffodil/sapi/io/InputSourceDataInputStream.scala
##########
@@ -43,5 +43,14 @@ class InputSourceDataInputStream private[sapi] (private 
[sapi] val dis: SInputSo
   /**
    * Create an InputSourceDataInputStream from a byte array
    */
-  def this(arr: Array[Byte]) = this(SInputSourceDataInputStream(arr)) 
+  def this(arr: Array[Byte]) = this(SInputSourceDataInputStream(arr))
+
+
+  /**
+   * Used to see if there is more data or not after a parse operation.
+   * Whether more data is a left-over-data error, or just more data to
+   * parse in a subsequent parse call depends on the intention of the
+   * code calling the parse operation.
+   */

Review comment:
       Same comment here as in the JAPI, might want to mention the blocking 
aspect of this.




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
[email protected]

[GitHub] [daffodil] stevedlawrence commented on a change in pull request #539: Fix TCP/network input stream hangs on isAtEnd

Reply via email to