[jira] [Commented] (DRILL-5272) Text file reader is inefficient

Paul Rogers (JIRA) Thu, 16 Feb 2017 21:33:06 -0800

    [ 
https://issues.apache.org/jira/browse/DRILL-5272?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15871207#comment-15871207
 ]


Paul Rogers commented on DRILL-5272:
------------------------------------

Implicit vectors have more issues. When transitioning from one record reader to 
the next, we do:

{code}
          currentReader = readers.next();
          ...
          currentReader.setup(oContext, mutator);
          ...
            currentReader.allocate(fieldVectorMap);
          ...
          addImplicitVectors();
{code}

Since this is the second pass, the implicit vectors have already been added to 
the container. We allocate them in the call to {{allocate}} above. Then, we 
clear them in {{addImplicitVectors}}, discard the map of vectors, build a new 
map, and allocate them all again. Net result is double allocation of the 
implicit vectors.

Further, if the implicit vectors are the same for all readers, there is no 
reason to keep adding and deleting exactly the same vectors.

> Text file reader is inefficient
> -------------------------------
>
>                 Key: DRILL-5272
>                 URL: https://issues.apache.org/jira/browse/DRILL-5272
>             Project: Apache Drill
>          Issue Type: Bug
>    Affects Versions: 1.10
>            Reporter: Paul Rogers
>            Priority: Minor
>
> From inspection of the ScanBatch and CompliantTextReader.
> Every batch holds about five implicit vectors. These are repeated for every 
> row, which can greatly increase incoming data size.
> When populating the vectors, the allocation starts at 8 bytes and grows to 16 
> bytes, causing a (slow) memory reallocation for every vector:
> {code}
> [org.apache.drill.exec.vector.UInt4Vector] - 
> Reallocating vector [$offsets$(UINT4:REQUIRED)]. # of bytes: [8] -> [16]
> {code}
> Whether due to the above, or a different issues is causing memory growth in 
> the scan batch:
> {code}
> Entry Memory: 6,456,448
> Exit Memory: 7,636,312
> Entry Memory: 7570560
> Exit Memory: 8750424
> ...
> {code}
> Evidently the implicit vectors are added in response to a "SELECT *" query. 
> Perhaps provide them only if actually requested.
> The vectors are populated for every row, making a copy of a potentially long 
> file name and path for every record. Since the values are common to every 
> record, perhaps we can use the same data copy for each, but have the offset 
> vector for each record just point to the single copy.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

[jira] [Commented] (DRILL-5272) Text file reader is inefficient

Reply via email to