Github user paul-rogers commented on a diff in the pull request:
https://github.com/apache/drill/pull/750#discussion_r102650409
--- Diff:
exec/java-exec/src/main/java/org/apache/drill/exec/store/easy/text/compliant/CompliantTextRecordReader.java
---
@@ -118,12 +118,21 @@ public boolean apply(@Nullable SchemaPath path) {
* @param outputMutator Used to create the schema in the output record
batch
* @throws ExecutionSetupException
*/
+ @SuppressWarnings("resource")
@Override
public void setup(OperatorContext context, OutputMutator outputMutator)
throws ExecutionSetupException {
oContext = context;
- readBuffer = context.getManagedBuffer(READ_BUFFER);
- whitespaceBuffer = context.getManagedBuffer(WHITE_SPACE_BUFFER);
+ // Note: DO NOT use managed buffers here. They remain in existence
+ // until the fragment is shut down. The buffers here are large.
--- End diff --
The reason is a bit different. The original call allocates a managed
buffer: it is freed only when the fragment context shuts down at the end of
query execution. But, if we read many files (5000 in one test case), then we
leave 5000 buffers in existence for the whole query.
Instead, we want to take control over buffer lifetime. We allocate a
regular (not managed) buffer ourselves, and then release it when this reader
closes.
That way, instead of accumulating 5000 buffers of 1 MB each, we have only
one 1 MB buffer in existence at any one time.
Of course, a further refinement would be to allocate the buffer on the
ScanBatch and have all 5000 readers sequentially share that same buffer. But, I
was not sure that any performance benefit was worth the cost in extra code
complexity...
---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at [email protected] or file a JIRA ticket
with INFRA.
---