Tim Armstrong created IMPALA-5307:
-------------------------------------

             Summary: Consider always copying-out Disk I/O buffers instead of 
attaching to RowBatches
                 Key: IMPALA-5307
                 URL: https://issues.apache.org/jira/browse/IMPALA-5307
             Project: IMPALA
          Issue Type: Sub-task
          Components: Backend
            Reporter: Tim Armstrong
            Assignee: Tim Armstrong


IMPALA-4835 would be greatly simplified if we don't have to attach disk I/O 
buffers to RowBatches and handle the resultant complexity.

Disk I/O buffers currently need to be attached to RowBatches if the row batches 
directly reference var-len data in the buffer. The cases when this can occur 
are as follows:
* The column being read contains strings
* The string data is not dictionary encoded in Parquet (since we copy out the 
dictionary data in Parquet)
* The string data is not compressed with a general-purpose compression 
algorithm (GZip, snappy, etc).

This includes the following cases: plain-encoded strings in uncompressed 
Parquet; any strings in uncompressed text, RCFile, Avro, or sequence file.

In those cases the copy avoidance could provide some performance benefits. 
However it's unclear that any of those file formats are/should be used in 
performance-critical use cases, because the storage density of uncompressed 
strings is almost always terrible.

We should evaluate the performance impact of the additional copies, but I 
suspect that it is not severe and does not impact any important use cases.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

Reply via email to