This is an automated email from the ASF dual-hosted git repository. planka pushed a commit to branch main in repository https://gitbox.apache.org/repos/asf/orc.git
The following commit(s) were added to refs/heads/main by this push: new 9e903c2b6 ORC-1140: Documented background and benchmarks for ORC-1136 (#1149) 9e903c2b6 is described below commit 9e903c2b6f0b426472a0a6dece013b02d76ee034 Author: Pavan Lanka <pla...@apple.com> AuthorDate: Tue Jun 14 16:02:00 2022 -0700 ORC-1140: Documented background and benchmarks for ORC-1136 (#1149) ### What changes were proposed in this pull request? This includes benchmark results and documentation for ORC-1136 ### Why are the changes needed? For documentation ### How was this patch tested? Documentation related changes only --- site/develop/design/index.md | 3 +- site/develop/design/io.md | 172 +++++++++++++++++++++++++++++++++++++++++++ site/img/seekvsread.png | Bin 0 -> 32924 bytes 3 files changed, 174 insertions(+), 1 deletion(-) diff --git a/site/develop/design/index.md b/site/develop/design/index.md index 3112255f9..c3d6cad08 100644 --- a/site/develop/design/index.md +++ b/site/develop/design/index.md @@ -3,4 +3,5 @@ layout: page title: Design --- -* [Lazy Filters](lazy_filter) \ No newline at end of file +* [Lazy Filters](lazy_filter) +* [IO](io) \ No newline at end of file diff --git a/site/develop/design/io.md b/site/develop/design/io.md new file mode 100644 index 000000000..95e9985fa --- /dev/null +++ b/site/develop/design/io.md @@ -0,0 +1,172 @@ +--- +layout: page +title: IO +--- + +* [Background](#Background) + * [Seek vs Read](#SeekvsRead) + * [ORC Read](#ORCRead) +* [Read Optimization](#ReadOptimization) + * [Approach](#Approach) + * [Scope](#Scope) + * [Benchmarks](#Benchmarks) + * [Local FS](#LocalFS) + * [AWS S3](#AWSS3) + * [Summary](#Summary) + +## Background <a id="Background"></a> + +We are moving our workloads from HDFS to AWS S3. As part of this activity we wanted to understand the performance +characteristics and costs of using S3. + +### Seek vs Read <a id="SeekvsRead"></a> + +One particular scenario that stood out in our performance testing was Seek vs Read when dealing with S3. + +In this test we are trying to read through a file + +* Seek to Point A in the file read X bytes +* Move to Point B in the file that is A + X + Y + * This is accomplished as another seek or as a read + * We will leave Y variable to determine when this is best +* Read X bytes + + + +Observations: + +* We could clearly see that a read is more performant than seek when dealing with steps/gaps smaller than 4 MB. + * At 4 MB read is faster by ~ 11% + * At 1 MB read is faster by ~ 20% +* Reads are also cheaper as we perform a single GET instead of multiple GETs from [AWS S3 Pricing][s3_pricing] + * Cost for GET: $0.0004 + * Cost for Data Retrieval to the same region AWS EKS: $0.0000 + +### ORC Read <a id="ORCRead"></a> + +Based on the above performance penalty when dealing with multiple seeks over small gaps, we measured the performance of +ORC read on a file. + +File details: + +* Size ~ 21 MB +* Column Count: ~ 400 +* Row Count: ~ 65K + +|Read Type |Duration|Unit| +|:--- | ---:|:---| +|All Columns | 1.075|s | +|Alternate Columns| 6.489|s | + +Observations: + +* We can clearly see that we pay a significant penalty when reading alternate columns, which in the current + implementation of ORC translates to multiple GET calls on AWS S3 +* While the impact of penalty will be less significant in large reads, it will incur overheads both in terms of time and + cost + +## Read Optimization <a id="ReadOptimization"></a> + +### Approach <a id="Approach"></a> + +The following optimizations are planned: + +* **orc.min.disk.seek.size** is a value in bytes: When trying to determine a single read, if the gap between two reads + is smaller than this then it is combined into a single read. +* **orc.min.disk.seek.size.tolerance** is a fractional input: If the extra bytes read is greater than this fraction of + the required bytes, then we drop the extra bytes from memory. +* We can further consider adding an optimization for the complete stripe in case the stripe size is smaller than + `orc.min.disk.seek.size` + +### Scope <a id="Scope"></a> + +Different types of IO takes place in ORC today. + +* Reading of File Footer: Unchanged +* Reading of Stripe Footer: Unchanged +* Reading of Stripe Index information: Optimized +* Reading of Stripe Data: Optimized + +Each of the above happens at different stages of the read. The current implementation optimizes reads that happen using +the [DataReader][dr] interface. + +This does not: + +* Optimize the read of the file/stripe footer +* Reads across multiple stripes + +### Benchmarks <a id="Benchmarks"></a> + +#### Local FS <a id="LocalFS"></a> + +This benchmark is run on the local filesystem with NVMe SSD, so it has very different performance characteristics to AWS +S3. + +The purpose of this benchmark is to ascertain if we have added any significant penalties in the ORC code by adding +`minSeekSize` and `extraByteTolerance`. + +```bash +java -jar java/bench/core/target/orc-benchmarks-core-*-uber.jar chunk_read +``` + +|(alt)|(cols)| (byteTol) |(minSeek)|Mode| Cnt|Score|Sign|Error|Units| +|:--- | ---:|----------:| ---:|:---|---:| ---:|:---| ---:|:--- | +|true | 128| 0.0 | 0|avgt| 20|0.352|± |0.006|s/op | +|true | 128| 0.0 | 4194304|avgt| 20|0.357|± |0.002|s/op | +|true | 128| 10.0 | 4194304|avgt| 20|0.349|± |0.002|s/op | +|false| 128| 0.0 | 0|avgt| 20|0.667|± |0.007|s/op | +|false| 128| 0.0 | 4194304|avgt| 20|0.673|± |0.004|s/op | +|false| 128| 10.0 | 4194304|avgt| 20|0.671|± |0.005|s/op | + +Observations/Details: + +* **Input File details**: + * Rows: 65536 + * Columns: 128 + * FileSize: ~ 72 MB +* Full Read (alternate = false) + * No significant difference between the options as expected +* Alternate Read (alternate = true) + * No significant difference between the options given the small file size and performance of local disk + * This also calls out that the recommended minSeekSize should be determined for each platform e.g. HDFS, S3, etc + +#### AWS S3 <a id="AWSS3"></a> + +In this benchmark we brought up an EKS Container in the same region as the AWS S3 bucket to test the performance of the +patch. + +|(alternate)| (byteTol) |(minSeekSize)|Mode| Cnt|Score|Sign|Error|Units| +|:--- |----------:| ---:|:---|---:| ---:|:---| ---:|:--- | +|FALSE | 0.0 | 0|avgt| 5|1.837|± |0.089|s/op | +|FALSE | 0.0 | 4194304|avgt| 5|1.919|± | 0.11|s/op | +|FALSE | 10.0 | 4194304|avgt| 5|1.895|± |0.191|s/op | +|TRUE | 0.0 | 0|avgt| 5| 5.8|± |1.132|s/op | +|TRUE | 0.0 | 4194304|avgt| 5|1.479|± |0.197|s/op | +|TRUE | 10.0 | 4194304|avgt| 5|1.435|± |0.176|s/op | + +Observations/Details: + +* **Input File details**: + * Rows: 65536 + * Columns: 128 + * FileSize: ~ 72 MB +* Full Read (alternate = false) + * No significant difference between the options as expected +* Alternate Read (alternate = true) + * We get a significant boost in performance 5.8s without optimization to 1.5s with optimization giving us a time + reduction of ~ 75 % + * This also gives us a cost saving as 64 GET one for each column per stripe have been replaced with a single GET + * We can see a marginal improvement ~ 3% when choosing to retain extra bytes (extraByteTolerance=10.0) as compared to + (extraByteTolerance=0.0) which performs additional work of dropping the extra bytes from memory. + +### Summary <a id="Summary"></a> + +Based on the benchmarks the following is recommended for ORC in AWS S3: + +* `orc.min.disk.seek.size` is set to `4194304` (4 MB) +* `orc.min.disk.seek.size.tolerance` is set to value that is acceptable based on the memory usage constraints. When set + to `0.0` it will always do the extra work of dropping the extra bytes. + +[s3_pricing]: https://aws.amazon.com/s3/pricing/ + +[dr]: {{ site.repository }}/tree/main/java/core/src/java/org/apache/orc/DataReader.java \ No newline at end of file diff --git a/site/img/seekvsread.png b/site/img/seekvsread.png new file mode 100644 index 000000000..6491a5492 Binary files /dev/null and b/site/img/seekvsread.png differ