pavibhai commented on a change in pull request #635:
URL: https://github.com/apache/orc/pull/635#discussion_r578772205
##########
File path: site/develop/design/lazy_filter.md
##########
@@ -0,0 +1,352 @@
+* [Lazy Filter](#LazyFilter)
+ * [Background](#Background)
+ * [Design](#Design)
+ * [SArg to Filter](#SArgtoFilter)
+ * [Read](#Read)
+ * [Configuration](#Configuration)
+ * [Tests](#Tests)
+ * [Appendix](#Appendix)
+ * [Benchmarks](#Benchmarks)
+ * [Row vs Vector](#RowvsVector)
+ * [Filter](#Filter)
+
+# Lazy Filter <a id="LazyFilter"></a>
+
+## Background <a id="Background"></a>
+
+This feature request started as a result of a large search that is performed
with the following characteristics:
+
+* The search fields are not part of partition, bucket or sort fields.
+* The table is a very large table.
+* The predicates result in very few rows compared to the scan size.
+* The search columns are a significant subset of selection columns in the
query.
+
+Initial analysis showed that we could have a significant benefit by lazily
reading the non-search columns only when we
+have a match. We explore the design and some benchmarks in subsequent sections.
+
+## Design <a id="Design"></a>
+
+This builds further on [ORC-577][ORC-577] which currently only restricts
deserialization for some selected data types
+but does not improve on IO.
+
+On a high level the design includes the following components:
+
+```text
+┌──────────────┐ ┌────────────────────────┐
+│ │ │ Read │
+│ │ │ │
+│ │ │ ┌────────────┐ │
+│SArg to Filter│─────────▶│ │Read Filter │ │
+│ │ │ │ Columns │ │
+│ │ │ └────────────┘ │
+│ │ │ │ │
+└──────────────┘ │ ▼ │
+ │ ┌────────────┐ │
+ │ │Apply Filter│ │
+ │ └────────────┘ │
+ │ │ │
+ │ ▼ │
+ │ ┌────────────┐ │
+ │ │Read Select │ │
+ │ │ Columns │ │
+ │ └────────────┘ │
+ │ │
+ │ │
+ └────────────────────────┘
+```
+
+* **SArg to Filter**: Converts Search Arguments passed down into filters for
efficient application during scans.
+* **Read**: Performs the lazy read using the filters.
+ * **Read Filter Columns**: Read the filter columns from the file.
+ * **Apply Filter**: Apply the filter on the read filter columns.
+ * **Read Select Columns**: If filter selects at least a row then read the
remaining columns.
+
+### SArg to Filter <a id="SArgtoFilter"></a>
+
+SArg to Filter converts the passed SArg into a filter. This enables automatic
compatibility with both Spark and Hive as
+they already push down Search Arguments down to ORC.
+
+The SArg is automatically converted into a [Vector Filter][vfilter]. Which is
applied during the read process. Two
+filter types were evaluated:
+
+* [Row Filter][rfilter] that evaluates each row across all the predicates once.
+* [Vector Filter][vfilter] that evaluates each filter across the entire vector
and adjusts the subsequent evaluation.
+
+While a row based filter is easier to code, it is much [slower][rowvvector] to
process. We also see a significant
+[performance gain][rowvvector] in the absence of normalization.
+
+The builder for search argument should allow skipping normalization during the
[build][build]. This has already been
+proposed as part of [HIVE-24458][HIVE-24458].
+
+### Read <a id="Read"></a>
+
+The read process has the following changes:
+
+```text
+ │
+ │
+ │
+┌────────────────────────▼────────────────────────┐
+│ ┏━━━━━━━━━━━━━━━━┓ │
+│ ┃Plan ++Search++ ┃ │
+│ ┃ Columns ┃ │
+│ ┗━━━━━━━━━━━━━━━━┛ │
+│ Read │Stripe │
+└────────────────────────┼────────────────────────┘
+ │
+ ▼
+
+
+ │
+ │
+┌────────────────────────▼────────────────────────┐
+│ ┏━━━━━━━━━━━━━━━━┓ │
+│ ┃Read ++Search++ ┃ │
+│ ┃ Columns ┃◀─────────┐ │
+│ ┗━━━━━━━━━━━━━━━━┛ │ │
+│ │ Size = 0 │
+│ ▼ │ │
+│ ┏━━━━━━━━━━━━━━━━┓ │ │
+│ ┃ Apply Filter ┃──────────┘ │
+│ ┗━━━━━━━━━━━━━━━━┛ │
+│ Size > 0 │
+│ │ │
+│ ▼ │
+│ ┏━━━━━━━━━━━━━━━━┓ │
+│ ┃ Plan Select ┃ │
+│ ┃ Columns ┃ │
+│ ┗━━━━━━━━━━━━━━━━┛ │
+│ │ │
+│ ▼ │
+│ ┏━━━━━━━━━━━━━━━━┓ │
+│ ┃ Read Select ┃ │
+│ ┃ Columns ┃ │
+│ ┗━━━━━━━━━━━━━━━━┛ │
+│ Next │Batch │
+└────────────────────────┼────────────────────────┘
+ │
+ ▼
+```
+
+The read process changes:
+
+* **Read Stripe** used to plan the read of all (search + select) columns. This
is enhanced to plan and fetch only the
+ search columns. The rest of the stripe planning process optimizations remain
unchanged e.g. partial read planning of
+ the stripe based on RowGroup statistics.
+* **Next Batch** identifies the processing that takes place when
`RecordReader.nextBatch` is invoked.
+ * **Read Search Columns** takes place instead of reading all the selected
columns. This is in sync with the planning
+ that has taken place during **Read Stripe** where only the search columns
have been planned.
+ * **Apply Filter** on the batch that at this point only includes search
columns. Evaluate the result of the filter:
+ * **Size = 0** indicates all records have been filtered out. Given this we
proceed to the next batch of search
+ columns.
+ * **Size > 0** indicates that at least one record accepted by the filter.
This record needs to be substantiated with
+ other columns.
+ * **Plan Select Columns** is invoked to perform read of the select columns.
The planning happens as follows:
+ * Determine the current position of the read within the stripe and plan
the read for the select columns from this
+ point forward to the end of the stripe.
+ * The Read planning of select columns respects the row groups filtered out
as a result of the stripe planning.
+ * Fetch the select columns using the above plan.
+ * **Read Select Columns** into the vectorized row batch
+ * Return this batch.
+
+The current implementation performs a single read for the select columns in a
stripe.
+
+```text
+┌──────────────────────────────────────────────────┐
+│ ┌────┐ ┌────┐ ┌────┐ ┌────┐ ┌────┐ ┌────┐ ┌────┐ │
+│ │RG0 │ │RG1 │ │RG2■│ │RG3 │ │RG4 │ │RG5■│ │RG6 │ │
+│ └────┘ └────┘ └────┘ └────┘ └────┘ └────┘ └────┘ │
+│ Stripe │
+└──────────────────────────────────────────────────┘
+```
+
+The above diagram depicts a stripe with 7 Row Groups out of which **RG2** and
**RG5** are selected by the filter. The
+current implementation does the following:
+
+* Start the read planning process from the first match RG2
+* Read to the end of the stripe that includes RG6
+* Based on the above fetch skips RG0 and RG1 subject to compression block
boundaries
+
+The above logic could be enhanced to perform say **2 or n** reads before
reading to the end of stripe. The current
+implementation allows 0 reads before reading to the end of the stripe. The
value of **n** could be configurable but
+should avoid too many short reads.
+
+The read behavior changes as follows with multiple reads being allowed within
a stripe for select columns:
+
+```text
+┌──────────────────────────────────────────────────┐
+│ ┌────┐ ┌────┐ ┌────┐ ┌────┐ ┌────┐ ┌────┐ ┌────┐ │
+│ │ │ │ │ │■■■■│ │■■■■│ │■■■■│ │■■■■│ │■■■■│ │
+│ └────┘ └────┘ └────┘ └────┘ └────┘ └────┘ └────┘ │
+│ Current implementation │
+└──────────────────────────────────────────────────┘
+┌──────────────────────────────────────────────────┐
+│ ┌────┐ ┌────┐ ┌────┐ ┌────┐ ┌────┐ ┌────┐ ┌────┐ │
+│ │ │ │ │ │■■■■│ │ │ │ │ │■■■■│ │■■■■│ │
+│ └────┘ └────┘ └────┘ └────┘ └────┘ └────┘ └────┘ │
+│ Allow 1 partial read │
+└──────────────────────────────────────────────────┘
+```
+
+The figure shows that we could read significantly fewer bytes by performing an
additional read before reading to the end
Review comment:
No partial read configuration is not implemented.
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
[email protected]