[
https://issues.apache.org/jira/browse/DRILL-5272?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15871199#comment-15871199
]
Paul Rogers commented on DRILL-5272:
------------------------------------
The text reader allocates new vectors multiple times. While doing so, it calls
allocateNew on each vector. As it turns out, for one use case, it allocates a
new vector the same size as the existing one. The vector allocation code should
handle this: if we ask to reallocate a vector with a size the same as the
current size, just skip the reallocation and reuse the existing memory.
Better, of course, is for the scan batch to not reallocate vectors that are
already allocated.
> Text file reader is inefficient
> -------------------------------
>
> Key: DRILL-5272
> URL: https://issues.apache.org/jira/browse/DRILL-5272
> Project: Apache Drill
> Issue Type: Bug
> Affects Versions: 1.10
> Reporter: Paul Rogers
> Priority: Minor
>
> From inspection of the ScanBatch and CompliantTextReader.
> Every batch holds about five implicit vectors. These are repeated for every
> row, which can greatly increase incoming data size.
> When populating the vectors, the allocation starts at 8 bytes and grows to 16
> bytes, causing a (slow) memory reallocation for every vector:
> {code}
> [org.apache.drill.exec.vector.UInt4Vector] -
> Reallocating vector [$offsets$(UINT4:REQUIRED)]. # of bytes: [8] -> [16]
> {code}
> Whether due to the above, or a different issues is causing memory growth in
> the scan batch:
> {code}
> Entry Memory: 6,456,448
> Exit Memory: 7,636,312
> Entry Memory: 7570560
> Exit Memory: 8750424
> ...
> {code}
> Evidently the implicit vectors are added in response to a "SELECT *" query.
> Perhaps provide them only if actually requested.
> The vectors are populated for every row, making a copy of a potentially long
> file name and path for every record. Since the values are common to every
> record, perhaps we can use the same data copy for each, but have the offset
> vector for each record just point to the single copy.
--
This message was sent by Atlassian JIRA
(v6.3.15#6346)