cnauroth opened a new pull request, #7291: URL: https://github.com/apache/hadoop/pull/7291
### Description of PR `hadoop fs -text` reads Avro files and sequence files by internally wrapping the stream in [`AvroFileInputStream`](https://github.com/apache/hadoop/blob/rel/release-3.4.1/hadoop-common-project/hadoop-common/src/main/java/org/apache/hadoop/fs/shell/Display.java#L270) or [`TextRecordInputStream`](https://github.com/apache/hadoop/blob/rel/release-3.4.1/hadoop-common-project/hadoop-common/src/main/java/org/apache/hadoop/fs/shell/Display.java#L217). These classes implement the required single-byte [`read()`](https://docs.oracle.com/en/java/javase/11/docs/api/java.base/java/io/InputStream.html#read()), but not the optional multi-byte buffered [`read(byte[], int, int)`](https://docs.oracle.com/en/java/javase/11/docs/api/java.base/java/io/InputStream.html#read(byte%5B%5D,int,int)). The default implementation in the JDK is a [loop over single-byte read](https://github.com/openjdk/jdk11u-dev/blob/a47c72fad455bfdf9053cb8e94c99e73965ab50d/src/java.base/share/classes/java/io/InputStream.java#L 280), which causes sub-optimal I/O and method call overhead. We can optimize this by overriding the multi-byte read method. ### How was this patch tested? Multiple new unit tests cover expectations for both single-byte and multi-byte read. I ran these tests both before and after the code change to make sure there is no unexpected behavioral change. Here is some benchmarking I ran to test the effects of this patch. This shows a ~1.7x throughput improvement for sequence files. ``` # Generate random text data with block Snappy compression (sequence file format framing). hadoop jar /usr/lib/hadoop-mapreduce/hadoop-mapreduce-examples.jar randomtextwriter \ -D mapreduce.output.fileoutputformat.compress=true \ -D mapreduce.output.fileoutputformat.compress.codec=org.apache.hadoop.io.compress.SnappyCodec \ -D mapreduce.output.fileoutputformat.compress.type=BLOCK \ random-snappy # Install pv utility for benchmarking throughput of data through a command pipeline. sudo apt-get install pv # Benchmark before and after the patch. # Also calculate checksums to confirm returned data is identical. # Before: hadoop fs -text 'random-snappy/*' | pv | sha256sum 20.1GiB 0:04:29 [76.2MiB/s] [ <=> ] 9ec540014192ba9522c58a6fa8e26c912707493566a42e8a22de9edce474a6f4 - # After: hadoop fs -text 'random-snappy/*' | pv | sha256sum 20.1GiB 0:02:36 [ 131MiB/s] [ <=> ] 9ec540014192ba9522c58a6fa8e26c912707493566a42e8a22de9edce474a6f4 - ``` The improvement is more modest for Avro, where there are unrelated overheads from deserializing and reserializing to JSON: ``` # Generate random Avro data. This yields a ~1 GB file, and I staged 10 copies into a directory. java -jar avro-tools-1.11.4.jar random users-4000000.avro \ --schema-file user.asvc \ --count 40000000 # Before: hadoop fs -text 'random-avro/*' | pv | sha256sum 36.4GiB 0:15:38 [39.7MiB/s] [ <=> ] 41d4ed96ef1fc3f74c0a6faf3ad3e94202ed4863a7d405a6a555aae2fe89df65 - # After: hadoop fs -text 'random-avro/*' | pv | sha256sum 36.4GiB 0:13:36 [45.7MiB/s] [ <=> ] 41d4ed96ef1fc3f74c0a6faf3ad3e94202ed4863a7d405a6a555aae2fe89df65 - ``` ### For code changes: - [X] Does the title or this PR starts with the corresponding JIRA issue id (e.g. 'HADOOP-17799. Your PR title ...')? - [ ] Object storage: have the integration tests been executed and the endpoint declared according to the connector-specific documentation? - [ ] If adding new dependencies to the code, are these dependencies licensed in a way that is compatible for inclusion under [ASF 2.0](http://www.apache.org/legal/resolved.html#category-a)? - [ ] If applicable, have you updated the `LICENSE`, `LICENSE-binary`, `NOTICE-binary` files? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected] --------------------------------------------------------------------- To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
