[jira] [Commented] (DAFFODIL-2502) Parse must behave properly for reading data from TCP sockets

Mike Beckerle (Jira) Fri, 23 Apr 2021 10:48:07 -0700


    [ 
https://issues.apache.org/jira/browse/DAFFODIL-2502?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17330957#comment-17330957
 ]


Mike Beckerle commented on DAFFODIL-2502:
-----------------------------------------

Consider two kinds of data

Format 1:

First has a 2-byte binary length giving the length of the data in bytes 
exclusive of the 2 bytes.
Then it has the data.

Format 2:

Second format has data delimited by a terminator which is from 3 to 5 $ 
characters. So $$$, or $$$$ or $$$$$

Consider reading streams of Format 1 vs. Streams of Format 2 from a network 
connection.

The sender on the other end of the connection has sent exactly 1 message.

Consider the behavior of the I/O Layer.

The I/O layer is going to attempt to fill some buckets with data. This is done 
in a way that is ignorant of the data format's requirements.

For Format 1, suppose the length is 6, the data is 123456. We need a total of 8 
bytes.

But the I/O layer is going to do blocking reads to try to fill a bucket. 
Daffodil will hang until enough messages arrive to fill a bucket, or an 
end-of-data arrives so that a short-bucket will be created.

If the sender doesn't send a second message for a long time, daffodil will hang 
that entire time.

For Format 2, suppose the data sent is 12345$$$.

Daffodil cannot convert this into a message, because until the next byte 
arrives, we don't know if it will be a $ making the terminator longer. If it 
does arrive and is a $ then we must wait for even another character to know if 
the terminator is the max length of $$$$$.

Until the next message is sent by the sender, we don't know if we've finished 
parsing the first.

This is a standard networking problem, and is why messages generally have a 
length header. There's nothing we can do about this particular Format 2 
problem. It's inherent.

But for Format 1, is is problematic to wait and be unable to return a parsed 
message just because the bucket algorithm is trying to fill buckets. Daffodil, 
even with a length header, we're still going to block waiting for more data 
than is needed to succeed with a parse.

It seems to me when the data is coming from a network connection, we want 
Daffodil to be able to inform the I/O layer about the required length, and not 
end up issuing blocking reads beyond that length.

 

 

> Parse must behave properly for reading data from TCP sockets
> ------------------------------------------------------------
>
>                 Key: DAFFODIL-2502
>                 URL: https://issues.apache.org/jira/browse/DAFFODIL-2502
>             Project: Daffodil
>          Issue Type: Bug
>          Components: API, Back End
>    Affects Versions: 3.0.0
>            Reporter: Mike Beckerle
>            Assignee: Mike Beckerle
>            Priority: Major
>
> Daffodil assumes the input streams are like files - reads are always blocking 
> for either 1 or more bytes of data, or End-of-data.
> People want to use Daffodil to read data from TCP/IP sockets. These can 
> return 0 bytes from a read because there is no data available, but that does 
> NOT mean the end of data. It's just a temporary condition. More data may come 
> along.
> Daffodil's InputSourceDataInputStream is wrapped around a regular Java input 
> stream, and enables us to support incoming messages which do not conform to 
> byte-boundaries.
> The problem is that there's no way for users to wrap an 
> InputSourceDataInputStream around a TCP/IP socket, and have it behave 
> properly when a read() call temporarily says 0 bytes available.
> Obviously we don't want to sit in a tight loop just retrying the read until 
> we get either some bytes or end-of-data.
> The right API here is that if the read() of the underlying java stream 
> returns 0 bytes, that a hook function supplied by the API user is called.
> One obvious thing a user can do is put a call to Thread.yield() in the hook. 
> (That might even want to be the default behavior if they supply no hook.) 
> Then if they have a separate thread parsing the data with daffodil, that 
> thread will at least yield the CPU, i.e., behave politely in a multi-threaded 
> world.
> More advanced usage could start a Daffodil parse using co-routines, returning 
> control to the caller when the parse must pause due to read() of the Java 
> input stream returning 0 bytes.
>  
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Commented] (DAFFODIL-2502) Parse must behave properly for reading data from TCP sockets

Reply via email to