[
https://issues.apache.org/jira/browse/DRILL-5272?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Paul Rogers resolved DRILL-5272.
--------------------------------
Resolution: Fixed
This issue was fixed when converting the text readers to use the result set
loader framework.
> Text file reader is inefficient
> -------------------------------
>
> Key: DRILL-5272
> URL: https://issues.apache.org/jira/browse/DRILL-5272
> Project: Apache Drill
> Issue Type: Bug
> Affects Versions: 1.10.0
> Reporter: Paul Rogers
> Assignee: Paul Rogers
> Priority: Minor
>
> From inspection of the ScanBatch and CompliantTextReader.
> Every batch holds about five implicit vectors. These are repeated for every
> row, which can greatly increase incoming data size.
> When populating the vectors, the allocation starts at 8 bytes and grows to 16
> bytes, causing a (slow) memory reallocation for every vector:
> {code}
> [org.apache.drill.exec.vector.UInt4Vector] -
> Reallocating vector [$offsets$(UINT4:REQUIRED)]. # of bytes: [8] -> [16]
> {code}
> Whether due to the above, or a different issues is causing memory growth in
> the scan batch:
> {code}
> Entry Memory: 6,456,448
> Exit Memory: 7,636,312
> Entry Memory: 7570560
> Exit Memory: 8750424
> ...
> {code}
> Evidently the implicit vectors are added in response to a "SELECT *" query.
> Perhaps provide them only if actually requested.
> The vectors are populated for every row, making a copy of a potentially long
> file name and path for every record. Since the values are common to every
> record, perhaps we can use the same data copy for each, but have the offset
> vector for each record just point to the single copy.
--
This message was sent by Atlassian Jira
(v8.3.4#803005)