Does that solve the issue? Not all data sources are files. A file centric path 
could be used as an optimization, but we also need to behave correctly on 
non-file data sources.

Also, if we go this route, we need to be careful about what we consider to be a 
"file". An anonymous or named pipe are commonly used types of files which do 
provide a file-length.
________________________________
From: Beckerle, Mike <mbecke...@owlcyberdefense.com>
Sent: Monday, September 14, 2020 10:40 AM
To: dev@daffodil.apache.org <dev@daffodil.apache.org>
Subject: Re: Large dfdl:length values on complex types

Is it possible to create a separate i/o path that is file centric, so that 
isDefinedForLength can answer without actually accessing any data at all?

The notion that there is no tunable access suggests we do need an API change to 
a factory pattern so that the I/O abstractions that need them have access to 
tunables. Efficient I/O is just one of those areas where tuning is generally 
needed.

...mikeb


________________________________
From: Steve Lawrence <slawre...@apache.org>
Sent: Monday, September 14, 2020 8:32:34 AM
To: dev@daffodil.apache.org <dev@daffodil.apache.org>
Subject: Re: Large dfdl:length values on complex types

> The issue isn't so much backtracking, as fortracking. At the moment,
> we need to read in all of dfdl:length before we begin parsing the
> first time, so we get a problem even if we never attempt to
> backtrack.

This is correct. I think this is the core issue. As best I can come up
with, the right fix is the removal (or very selective use) of
isDefinedForLength along with parsers handling some new end of stream
error. Not sure how big of a change that is or if maybe there's another
approach.

I've created DAFFODIL-2395 to track this issue. I think this is probably
a blocker for the next release.

Note that this isn't actually an issue for blobs. The BlobLengthParser
reads blobs in chunks with a default size of 4KB. And it only calls
isDefinedForLength for individual chunks, so never comes close the 256MB
limit. So large blobs are never an issue. The only issue here is with
complex or simple types with a specified length larger than 256MB. A
simple type of 256MB seems unlikely, but is very possible for complex types.

Also note that the upcoming streaming changes won't help with this
issue. That is really just about outputting the infoset while we parse
rather than waiting until the end of parse, as well as the use of a SAX
API. It won't affect how we handle the data at all.

On 9/11/20 3:04 PM, Sloane, Brandon wrote:
> We should be careful about making the cache too big due to the memory 
> overhead. I'm not sure where we are with support for streaming parses (beyond 
> the --stream option in the CLI, which assumes multiple distinct documents 
> that have been concatted), so in most cases, a multiple of this overhead 
> would be needed by the infoset itself anyway.
>
> The reason I bring up blobs is that they bypass the requirement to have a 
> large infoset. Imagine a 4GB document, consisting of a length prefix, 3.99GB 
> worth of binary blobs, and metadata. There is no good reason parsing this 
> document should require anywhere near 4GB memory overhead.
>
>> The DFDL spec allows implementations to have limits on backtracking 
>> capability, it just requires us to document them.
>
> The issue isn't so much backtracking, as fortracking. At the moment, we need 
> to read in all of dfdl:length before we begin parsing the first time, so we 
> get a problem even if we never attempt to backtrack.
>
>
>
> ________________________________
> From: Beckerle, Mike <mbecke...@owlcyberdefense.com>
> Sent: Friday, September 11, 2020 2:51 PM
> To: dev@daffodil.apache.org <dev@daffodil.apache.org>
> Subject: Re: Large dfdl:length values on complex types
>
> We have tunable limits people have to sometimes enlarge. E.g. max size of a 
> regex match is limited.
>
> The DFDL spec allows implementations to have limits on backtracking 
> capability, it just requires us to document them.
>
> That said. We really don't want to have this sort of limit hard coded.
>
> A non-streamable file format that just stores the total length in a header 
> record is pretty common, and people will use those for really big pieces of 
> data these days.
>
> Can this 256MB just be a tunable constant that people can enlarge?
>
> From a security perspective, a stored length field is always something that 
> should be sanity checked - if clobbered, it could contain -1 and that could 
> be interpreted as maxInt or something, resulting in a denial-of-service 
> attack. It would be better to get a processing error in that case.
>
> Can we just check for complex type lengths against a tunable limit, and error 
> out before we even try to read it in?
>
> ________________________________
> From: Sloane, Brandon <bslo...@owlcyberdefense.com>
> Sent: Friday, September 11, 2020 1:53 PM
> To: dev@daffodil.apache.org <dev@daffodil.apache.org>
> Subject: Re: Large dfdl:length values on complex types
>
> 256MB doesn't strike me as that big. We haven't ad filesize limits measured 
> in GB since fat32 and ext1 (depending on settings; even ext4 can have a limit 
> as low as 16gb). All it takes is such a file to have a length prefix, and we 
> can very easily run into the limit. Combined with our support of large binary 
> blobs, it is not unreasonable that someone would want to use DFDL on such a 
> format.
>
> ________________________________
> From: Beckerle, Mike <mbecke...@owlcyberdefense.com>
> Sent: Friday, September 11, 2020 1:24 PM
> To: dev@daffodil.apache.org <dev@daffodil.apache.org>
> Subject: Re: Large dfdl:length values on complex types
>
> Maybe a silly question, but why don't we just hit a tunable size limit 
> immediately before we "try to read" that data? 256MB is very big.
>
> Is this a real format, or a test case designed to push the boundaries?
>
>
> ________________________________
> From: Steve Lawrence <slawre...@apache.org>
> Sent: Friday, September 11, 2020 1:14 PM
> To: dev@daffodil.apache.org <dev@daffodil.apache.org>
> Subject: Large dfdl:length values on complex types
>
> I recently came across an issue where we have something like this:
>
>   <xs:element name="length" type="xs:int" ... />
>   <xs:element name="data"
>     dfdl:lengthKind="explicit" dfdl:length="{ ../length }">
>     <xs:complexType>
>       <xs:sequence>
>         <xs:element name="field1" ... />
>         <xs:element name="field2" ... />
>         ...
>         <xs:element name="fieldN" ... />
>       </xs:sequence>
>     </xs:complexType>
>   </xs:element>
>
> So we have a length element and a complex data field that uses this
> length, and the data field is made up of a bunch of fields.
>
> The issue I come across is related to how we cache bytes in buckets for
> backtracking. As we fill up buckets, we currently limit the total amount
> cache size of the buckets to 256MB. So if someone ever parses more than
> 256MB of data and then tries to backtrack past that, we error. The idea
> being that we don't want to keep an infinite cache for potential
> backtracking and people should have realized that they went down the
> wrong branch much earlier.
>
> Though, a problem occurs with the complex types with a large specified
> length like above. When we have the complex type with expression
> ../length, before trying to parse any of the fields, we read that length
> number of bytes into our cache buckets to confirm that that number of
> bytes exists. The problem occurs if length is more than 256MB. In this
> case, we read length number of bytes, and start removing elements from
> the cache once we read more than 256MB.
>
> But once that succeeds and we read length bytes, we then try to start
> parsing the fields within the complex type, but we've removed those
> early cached bytes, and so we fail with an unhelpful backtracking exception.
>
> I'm not sure of the right solution here.
>
> Perhaps we shouldn't be throwing away these bytes when dealing with
> complex lengths?
>
> Or perhaps we shouldn't even be trying to determine if that many bytes
> are available when we have a specified length. Instead, maybe we should
> just set the bit limit to make sure we don't parse more than than that?
> And if eventually something tries to read a byte and there aren't enough
> and we hit that limit, only then do we fail. This feels like the right
> solution, but wanted to start a discussion to see if maybe there's a
> reason we try to read the full length, or maybe there's another alternative?
>

Reply via email to