[ https://issues.apache.org/jira/browse/DRILL-5272?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Paul Rogers resolved DRILL-5272. -------------------------------- Resolution: Fixed This issue was fixed when converting the text readers to use the result set loader framework. > Text file reader is inefficient > ------------------------------- > > Key: DRILL-5272 > URL: https://issues.apache.org/jira/browse/DRILL-5272 > Project: Apache Drill > Issue Type: Bug > Affects Versions: 1.10.0 > Reporter: Paul Rogers > Assignee: Paul Rogers > Priority: Minor > > From inspection of the ScanBatch and CompliantTextReader. > Every batch holds about five implicit vectors. These are repeated for every > row, which can greatly increase incoming data size. > When populating the vectors, the allocation starts at 8 bytes and grows to 16 > bytes, causing a (slow) memory reallocation for every vector: > {code} > [org.apache.drill.exec.vector.UInt4Vector] - > Reallocating vector [$offsets$(UINT4:REQUIRED)]. # of bytes: [8] -> [16] > {code} > Whether due to the above, or a different issues is causing memory growth in > the scan batch: > {code} > Entry Memory: 6,456,448 > Exit Memory: 7,636,312 > Entry Memory: 7570560 > Exit Memory: 8750424 > ... > {code} > Evidently the implicit vectors are added in response to a "SELECT *" query. > Perhaps provide them only if actually requested. > The vectors are populated for every row, making a copy of a potentially long > file name and path for every record. Since the values are common to every > record, perhaps we can use the same data copy for each, but have the offset > vector for each record just point to the single copy. -- This message was sent by Atlassian Jira (v8.3.4#803005)