[jira] [Commented] (DAFFODIL-2194) buffered data output stream has a chunk limit of 2GB

Steve Lawrence (Jira) Wed, 28 Aug 2019 11:24:22 -0700


    [ 
https://issues.apache.org/jira/browse/DAFFODIL-2194?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16917989#comment-16917989
 ]


Steve Lawrence commented on DAFFODIL-2194:
------------------------------------------

Yep. And as I think about it more, I don't think the bob pull request will 
support parsing blobs larger than the heap size either.

This is because the blob paresr doesn't read directly from the underlying 
stream, but instead reads from an InputSourceDataInputStream. This abstraction 
supports streaming by cacheing data in case Daffodil might need to backtrack 
and reparse the same data. This abstraction periodically discards cached 
buckets of data when it determines Daffodil won't backtrack and need those 
buckets again. But right now, at best, those buckets won't get discarded until 
after the entire blob is read. And at worst they'll never get discarded until 
some PoU before the blob is resolved.

One thing we could do is change the blob parser so that so as the blob is being 
read we attempt to release old cached buckets that can't be backtracked to. 
This should help to support reading more than the heap size.

However, this doesn't solve the problem of not being able to free cached 
buckets due to a PoC before the blob. So maybe a better solution is to set some 
limit for the number of bytes/buckets we will keep around, and thus how far 
back one can backtrack. As long as the number is reasonably large, this seems 
like an okay restriction--it would be odd to parse many GB of data and only 
then realize you took the wrong branch of a choice and need to start over near 
the beginning.

So I think the changes here are:
# Remove 2GB limit on BufferedDataOutputStream. Maybe we insert splits when it 
gets full, or maybe we just implement a bucketing output stream, similar to our 
bucketing input stream.
# Add a new Blob DataOutputStream and split to one of those when we hit Blobs, 
and only read the data when it is ready to be delivered to 
DirectDataOutputStream. Blob data never ends up in memory. This removes the 
heap size limit for unparse.
# Set some maximum number of cached buckets allowed on the 
BucketingInputSource. This removes the heap size limit for parse, but does 
limit how far one can back track.

> buffered data output stream has a chunk limit of 2GB
> ----------------------------------------------------
>
>                 Key: DAFFODIL-2194
>                 URL: https://issues.apache.org/jira/browse/DAFFODIL-2194
>             Project: Daffodil
>          Issue Type: Bug
>          Components: Back End
>            Reporter: Steve Lawrence
>            Assignee: Steve Lawrence
>            Priority: Major
>             Fix For: 2.5.0
>
>
> A buffered data outupt stream is backed by a growable ByteArrayOutputStream, 
> which can only grow to 2GB in size. So if we ever try to write more than 2GB 
> to a buffered output stream during unparse (very possible with large blobs), 
> we'll get an OutOfMemoryError.
> One potential solution is to be aware of the size of a ByteArrayOutputStream 
> when buffering output and automatically create a split when it gets to 2GB in 
> sizes. This will still require a ton of memory since we're buffering these in 
> memoary, but we'll at least be able to unparse more than 2GB of continuous 
> data. 
> Note that we should still be able to unparse more than 2GB of data total, as 
> long as there so single buffer that's more than 2GB.



--
This message was sent by Atlassian Jira
(v8.3.2#803003)

[jira] [Commented] (DAFFODIL-2194) buffered data output stream has a chunk limit of 2GB

Reply via email to