[
https://issues.apache.org/jira/browse/DAFFODIL-2502?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17331679#comment-17331679
]
Mike Beckerle commented on DAFFODIL-2502:
-----------------------------------------
The DFA that recognizes delimiters does a Registers.advance(). This does 2
characters of peek() forward into the data stream.
Those peek operations fetch characters into a decodedChars CharBuffer, which
defaults to size 8.
(these 8 magic numbers are in class
{color:#000000}InputSourceDataInputStreamCharIteratorState){color}
Why 8? Some efficiency argument likely. This is not data being cached by the
Bucket algorithm, which happens underneath this, this is trying to call decode
less often to decode bytes to characters.
Well the smallest this allocation can be is 2, because we peek ahead 2
characters. I set it to 2, and voila the number of characters needed to get the
read to unblock decreased from 7 more characters to just 2 more.
This whole algorithm should be re-examined to see if we in fact really need to
peek ahead 2 characters or not. It seems to me we're peeking ahead before we
need to.
In any case, this issue is a separate bug ticket (DAFFODIL-2504) because the
concern is only about simple types with representation text WITHOUT specified
length.
There will also be these issues for lengthKind 'pattern' because of the way
regex matching uses buffers of decoded characters of some adapting size, then
tries to fill them in from the data stream. This is also going to use text
decoding, and so should run into similar difficulties.
> Parse must behave properly for reading data from TCP sockets
> ------------------------------------------------------------
>
> Key: DAFFODIL-2502
> URL: https://issues.apache.org/jira/browse/DAFFODIL-2502
> Project: Daffodil
> Issue Type: Bug
> Components: API, Back End
> Affects Versions: 3.0.0
> Reporter: Mike Beckerle
> Assignee: Mike Beckerle
> Priority: Major
> Time Spent: 1h 10m
> Remaining Estimate: 0h
>
> Daffodil assumes the input streams are like files - reads are always blocking
> for either 1 or more bytes of data, or End-of-data.
> People want to use Daffodil to read data from TCP/IP sockets. These can
> return 0 bytes from a read because there is no data available, but that does
> NOT mean the end of data. It's just a temporary condition. More data may come
> along.
> Daffodil's InputSourceDataInputStream is wrapped around a regular Java input
> stream, and enables us to support incoming messages which do not conform to
> byte-boundaries.
> The problem is that there's no way for users to wrap an
> InputSourceDataInputStream around a TCP/IP socket, and have it behave
> properly when a read() call temporarily says 0 bytes available.
> Obviously we don't want to sit in a tight loop just retrying the read until
> we get either some bytes or end-of-data.
> The right API here is that if the read() of the underlying java stream
> returns 0 bytes, that a hook function supplied by the API user is called.
> One obvious thing a user can do is put a call to Thread.yield() in the hook.
> (That might even want to be the default behavior if they supply no hook.)
> Then if they have a separate thread parsing the data with daffodil, that
> thread will at least yield the CPU, i.e., behave politely in a multi-threaded
> world.
> More advanced usage could start a Daffodil parse using co-routines, returning
> control to the caller when the parse must pause due to read() of the Java
> input stream returning 0 bytes.
>
>
--
This message was sent by Atlassian Jira
(v8.3.4#803005)