[
https://issues.apache.org/jira/browse/DAFFODIL-2194?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16917989#comment-16917989
]
Steve Lawrence commented on DAFFODIL-2194:
------------------------------------------
Yep. And as I think about it more, I don't think the bob pull request will
support parsing blobs larger than the heap size either.
This is because the blob paresr doesn't read directly from the underlying
stream, but instead reads from an InputSourceDataInputStream. This abstraction
supports streaming by cacheing data in case Daffodil might need to backtrack
and reparse the same data. This abstraction periodically discards cached
buckets of data when it determines Daffodil won't backtrack and need those
buckets again. But right now, at best, those buckets won't get discarded until
after the entire blob is read. And at worst they'll never get discarded until
some PoU before the blob is resolved.
One thing we could do is change the blob parser so that so as the blob is being
read we attempt to release old cached buckets that can't be backtracked to.
This should help to support reading more than the heap size.
However, this doesn't solve the problem of not being able to free cached
buckets due to a PoC before the blob. So maybe a better solution is to set some
limit for the number of bytes/buckets we will keep around, and thus how far
back one can backtrack. As long as the number is reasonably large, this seems
like an okay restriction--it would be odd to parse many GB of data and only
then realize you took the wrong branch of a choice and need to start over near
the beginning.
So I think the changes here are:
# Remove 2GB limit on BufferedDataOutputStream. Maybe we insert splits when it
gets full, or maybe we just implement a bucketing output stream, similar to our
bucketing input stream.
# Add a new Blob DataOutputStream and split to one of those when we hit Blobs,
and only read the data when it is ready to be delivered to
DirectDataOutputStream. Blob data never ends up in memory. This removes the
heap size limit for unparse.
# Set some maximum number of cached buckets allowed on the
BucketingInputSource. This removes the heap size limit for parse, but does
limit how far one can back track.
> buffered data output stream has a chunk limit of 2GB
> ----------------------------------------------------
>
> Key: DAFFODIL-2194
> URL: https://issues.apache.org/jira/browse/DAFFODIL-2194
> Project: Daffodil
> Issue Type: Bug
> Components: Back End
> Reporter: Steve Lawrence
> Assignee: Steve Lawrence
> Priority: Major
> Fix For: 2.5.0
>
>
> A buffered data outupt stream is backed by a growable ByteArrayOutputStream,
> which can only grow to 2GB in size. So if we ever try to write more than 2GB
> to a buffered output stream during unparse (very possible with large blobs),
> we'll get an OutOfMemoryError.
> One potential solution is to be aware of the size of a ByteArrayOutputStream
> when buffering output and automatically create a split when it gets to 2GB in
> sizes. This will still require a ton of memory since we're buffering these in
> memoary, but we'll at least be able to unparse more than 2GB of continuous
> data.
> Note that we should still be able to unparse more than 2GB of data total, as
> long as there so single buffer that's more than 2GB.
--
This message was sent by Atlassian Jira
(v8.3.2#803003)