[
https://issues.apache.org/jira/browse/DAFFODIL-2502?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17330997#comment-17330997
]
Mike Beckerle commented on DAFFODIL-2502:
-----------------------------------------
Ok, so there are numerous things in DFDL that likely request big chunks of data
from the I/O layer.
Ex: regex pattern match - I believe we fill a buffer of some adapted size, then
try to match, but the match may turn out smaller than the buffer, so having
requested the larger buffer we potentially blocked getting data to fill that
buffer, where the parse could have succeed with less.
So if I have a dfdl:assert or discriminator with testPattern="." that's only
looking for 1 byte, but I believe we're going to ask the I/O layer for
whatever the current "regex match buffer" has adapted itself to being. This is
hard to solve without a new regex library that sources data directly from an
input stream.
But "messages" aren't typically using that sort of regex thing to determine
their length.
If we have a complex type element with specified length. The combinator for
that could, right at the top, ask the I/O layer to insure there is that much
data available. That would block until that much is made available from the
network input stream. After that the parse operates within the already-fetched
data, so nothing should block after that so long as it is not doing things like
regex matches etc.
This only works if as you said
{quote}I would hope that InputStreams related to TCP connections wouldn't do
that and would return less than len bytes rather than hoping for more data to
come over the wire. If that's not the case, then we are in trouble
{quote}
I will investigate source code I can find online to see if this sort of logic
is there, and maybe rig up an experiment to test it.
> Parse must behave properly for reading data from TCP sockets
> ------------------------------------------------------------
>
> Key: DAFFODIL-2502
> URL: https://issues.apache.org/jira/browse/DAFFODIL-2502
> Project: Daffodil
> Issue Type: Bug
> Components: API, Back End
> Affects Versions: 3.0.0
> Reporter: Mike Beckerle
> Assignee: Mike Beckerle
> Priority: Major
>
> Daffodil assumes the input streams are like files - reads are always blocking
> for either 1 or more bytes of data, or End-of-data.
> People want to use Daffodil to read data from TCP/IP sockets. These can
> return 0 bytes from a read because there is no data available, but that does
> NOT mean the end of data. It's just a temporary condition. More data may come
> along.
> Daffodil's InputSourceDataInputStream is wrapped around a regular Java input
> stream, and enables us to support incoming messages which do not conform to
> byte-boundaries.
> The problem is that there's no way for users to wrap an
> InputSourceDataInputStream around a TCP/IP socket, and have it behave
> properly when a read() call temporarily says 0 bytes available.
> Obviously we don't want to sit in a tight loop just retrying the read until
> we get either some bytes or end-of-data.
> The right API here is that if the read() of the underlying java stream
> returns 0 bytes, that a hook function supplied by the API user is called.
> One obvious thing a user can do is put a call to Thread.yield() in the hook.
> (That might even want to be the default behavior if they supply no hook.)
> Then if they have a separate thread parsing the data with daffodil, that
> thread will at least yield the CPU, i.e., behave politely in a multi-threaded
> world.
> More advanced usage could start a Daffodil parse using co-routines, returning
> control to the caller when the parse must pause due to read() of the Java
> input stream returning 0 bytes.
>
>
--
This message was sent by Atlassian Jira
(v8.3.4#803005)