[ 
https://issues.apache.org/jira/browse/DAFFODIL-2502?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17331679#comment-17331679
 ] 

Mike Beckerle commented on DAFFODIL-2502:
-----------------------------------------

The DFA that recognizes delimiters does a Registers.advance(). This does 2 
characters of peek() forward into the data stream.

Those peek operations fetch characters into a decodedChars CharBuffer, which 
defaults to size 8.

(these 8 magic numbers are in class 
{color:#000000}InputSourceDataInputStreamCharIteratorState){color}

Why 8? Some efficiency argument likely. This is not data being cached by the 
Bucket algorithm, which happens underneath this, this is trying to call decode 
less often to decode bytes to characters. 

Well the smallest this allocation can be is 2, because we peek ahead 2 
characters. I set it to 2, and voila the number of characters needed to get the 
read to unblock decreased from 7 more characters to just 2 more. 

This whole algorithm should be re-examined to see if we in fact really need to 
peek ahead 2 characters or not. It seems to me we're peeking ahead before we 
need to. 

In any case, this issue is a separate bug ticket (DAFFODIL-2504) because the 
concern is only about simple types with representation text WITHOUT specified 
length.

There will also be these issues for lengthKind 'pattern' because of the way 
regex matching uses buffers of decoded characters of some adapting size, then 
tries to fill them in from the data stream. This is also going to use text 
decoding, and so should run into similar difficulties.

 

> Parse must behave properly for reading data from TCP sockets
> ------------------------------------------------------------
>
>                 Key: DAFFODIL-2502
>                 URL: https://issues.apache.org/jira/browse/DAFFODIL-2502
>             Project: Daffodil
>          Issue Type: Bug
>          Components: API, Back End
>    Affects Versions: 3.0.0
>            Reporter: Mike Beckerle
>            Assignee: Mike Beckerle
>            Priority: Major
>          Time Spent: 1h 10m
>  Remaining Estimate: 0h
>
> Daffodil assumes the input streams are like files - reads are always blocking 
> for either 1 or more bytes of data, or End-of-data.
> People want to use Daffodil to read data from TCP/IP sockets. These can 
> return 0 bytes from a read because there is no data available, but that does 
> NOT mean the end of data. It's just a temporary condition. More data may come 
> along.
> Daffodil's InputSourceDataInputStream is wrapped around a regular Java input 
> stream, and enables us to support incoming messages which do not conform to 
> byte-boundaries.
> The problem is that there's no way for users to wrap an 
> InputSourceDataInputStream around a TCP/IP socket, and have it behave 
> properly when a read() call temporarily says 0 bytes available.
> Obviously we don't want to sit in a tight loop just retrying the read until 
> we get either some bytes or end-of-data.
> The right API here is that if the read() of the underlying java stream 
> returns 0 bytes, that a hook function supplied by the API user is called.
> One obvious thing a user can do is put a call to Thread.yield() in the hook. 
> (That might even want to be the default behavior if they supply no hook.) 
> Then if they have a separate thread parsing the data with daffodil, that 
> thread will at least yield the CPU, i.e., behave politely in a multi-threaded 
> world.
> More advanced usage could start a Daffodil parse using co-routines, returning 
> control to the caller when the parse must pause due to read() of the Java 
> input stream returning 0 bytes.
>  
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

Reply via email to