mapleFU commented on issue #37559: URL: https://github.com/apache/arrow/issues/37559#issuecomment-1711023671
Yeah. I think during read. The original logic is: * for column c1...cn, trying to read them, this might including decompress, decode, memcpy... * Build arrow `RecordBatch` above the decoded columns When it turns to filter pushdown, we might need some Late materialization techniques. This might change the procedure to: * read column c1, filtering on c1 * using the selector to read the remaing columns * Build arrow `RecordBatch` above the decoded columns [1] https://issues.apache.org/jira/browse/SPARK-36527 [2] https://docs.cloudera.com/cdw-runtime/cloud/impala-reference/topics/impala-lazy-materialization.html The link above uses this technique. Note that I guess it not always improve the CPU performance. e.g: For filter output like. `0 1 0 1...`, it's not easy to make use of these filter to save CPU time -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected]
