[jira] [Commented] (DAFFODIL-2502) Parse must behave properly for reading data from TCP sockets

Steve Lawrence (Jira) Fri, 23 Apr 2021 11:20:06 -0700


    [ 
https://issues.apache.org/jira/browse/DAFFODIL-2502?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17330974#comment-17330974
 ]


Steve Lawrence commented on DAFFODIL-2502:
------------------------------------------

{quote}But the I/O layer is going to do blocking reads to try to fill a bucket. 
Daffodil will hang until enough messages arrive to fill a bucket, or an 
end-of-data arrives so that a short-bucket will be created.
{quote}
This isn't quite accurate, the I/O layer only blocks until it has read enough 
bytes required for it to continue parsing whatever simple element it's trying 
to parse. So for example, in data format 1, Daffodil first requires only 2 
bytes to figure out the length field. So the I/O layer will read and block 
until it buckets at least 2 bytes of data or gets an EOF. Once it gets at least 
2 bytes, it unblocks and the parse continues. Based on that parsed field, it 
will then determine it needs 6 bytes of data. If it hasn't already bucketed 
those needed 6 bytes, it will again read and block until it gets them, or EOF. 
So Daffodil doesn't need a full bucket to unblock.

Note however, that when we do read, we do specificy that the InputStream can 
return a full bucket, so that we do try to get big blocks chunks of data and 
aren't constantly calling read(). But the InputStream doesn't need to do that. 
The Javadoc API says about read:
{quote}An attempt is made to read as many as {{len}} bytes, but a smaller 
number may be read
{quote}
So if the InputStream is implemented so that read won't return until it gets 
all len bytes, there's not much we can do about that–we're at the mercy of the 
InputStream. I would hope that InputStreams related to TCP connections wouldn't 
do that and would return less than len bytes rather than hoping for more data 
to come over the wire. If that's not the case, then we are in trouble, and we 
either need to configure Daffodil to ask for smaller buckets, or use a 
different InputStream that will stream data better.

Regarding format #2, you say
{quote}Until the next message is sent by the sender, we don't know if we've 
finished parsing the first.
{quote}
Yep, Daffodil either needs end of data or or the following message and it will 
block until it gets at least one byte/character or an EOF and determines the 
termintor is complete. Hopefully there aren't too many formats like this 
though, especially things that are intended to go across a slow unpredictable 
wire.

> Parse must behave properly for reading data from TCP sockets
> ------------------------------------------------------------
>
>                 Key: DAFFODIL-2502
>                 URL: https://issues.apache.org/jira/browse/DAFFODIL-2502
>             Project: Daffodil
>          Issue Type: Bug
>          Components: API, Back End
>    Affects Versions: 3.0.0
>            Reporter: Mike Beckerle
>            Assignee: Mike Beckerle
>            Priority: Major
>
> Daffodil assumes the input streams are like files - reads are always blocking 
> for either 1 or more bytes of data, or End-of-data.
> People want to use Daffodil to read data from TCP/IP sockets. These can 
> return 0 bytes from a read because there is no data available, but that does 
> NOT mean the end of data. It's just a temporary condition. More data may come 
> along.
> Daffodil's InputSourceDataInputStream is wrapped around a regular Java input 
> stream, and enables us to support incoming messages which do not conform to 
> byte-boundaries.
> The problem is that there's no way for users to wrap an 
> InputSourceDataInputStream around a TCP/IP socket, and have it behave 
> properly when a read() call temporarily says 0 bytes available.
> Obviously we don't want to sit in a tight loop just retrying the read until 
> we get either some bytes or end-of-data.
> The right API here is that if the read() of the underlying java stream 
> returns 0 bytes, that a hook function supplied by the API user is called.
> One obvious thing a user can do is put a call to Thread.yield() in the hook. 
> (That might even want to be the default behavior if they supply no hook.) 
> Then if they have a separate thread parsing the data with daffodil, that 
> thread will at least yield the CPU, i.e., behave politely in a multi-threaded 
> world.
> More advanced usage could start a Daffodil parse using co-routines, returning 
> control to the caller when the parse must pause due to read() of the Java 
> input stream returning 0 bytes.
>  
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Commented] (DAFFODIL-2502) Parse must behave properly for reading data from TCP sockets

Reply via email to