[PR] HADOOP-19389: Optimize shell -text command I/O with multi-byte read. [hadoop]

via GitHub Wed, 15 Jan 2025 14:21:23 -0800


cnauroth opened a new pull request, #7291:
URL: https://github.com/apache/hadoop/pull/7291


   ### Description of PR
   
   `hadoop fs -text` reads Avro files and sequence files by internally wrapping 
the stream in 
[`AvroFileInputStream`](https://github.com/apache/hadoop/blob/rel/release-3.4.1/hadoop-common-project/hadoop-common/src/main/java/org/apache/hadoop/fs/shell/Display.java#L270)
 or 
[`TextRecordInputStream`](https://github.com/apache/hadoop/blob/rel/release-3.4.1/hadoop-common-project/hadoop-common/src/main/java/org/apache/hadoop/fs/shell/Display.java#L217).
 These classes implement the required single-byte 
[`read()`](https://docs.oracle.com/en/java/javase/11/docs/api/java.base/java/io/InputStream.html#read()),
 but not the optional multi-byte buffered [`read(byte[], int, 
int)`](https://docs.oracle.com/en/java/javase/11/docs/api/java.base/java/io/InputStream.html#read(byte%5B%5D,int,int)).
 The default implementation in the JDK is a [loop over single-byte 
read](https://github.com/openjdk/jdk11u-dev/blob/a47c72fad455bfdf9053cb8e94c99e73965ab50d/src/java.base/share/classes/java/io/InputStream.java#L
 280), which causes sub-optimal I/O and method call overhead. We can optimize 
this by overriding the multi-byte read method.
   
   ### How was this patch tested?
   
   Multiple new unit tests cover expectations for both single-byte and 
multi-byte read. I ran these tests both before and after the code change to 
make sure there is no unexpected behavioral change.
   
   Here is some benchmarking I ran to test the effects of this patch. This 
shows a ~1.7x throughput improvement for sequence files.
   
   ```
   # Generate random text data with block Snappy compression (sequence file 
format framing).
   hadoop jar /usr/lib/hadoop-mapreduce/hadoop-mapreduce-examples.jar 
randomtextwriter \
       -D mapreduce.output.fileoutputformat.compress=true \
       -D 
mapreduce.output.fileoutputformat.compress.codec=org.apache.hadoop.io.compress.SnappyCodec
 \
       -D mapreduce.output.fileoutputformat.compress.type=BLOCK \
       random-snappy
   
   # Install pv utility for benchmarking throughput of data through a command 
pipeline.
   sudo apt-get install pv
   
   # Benchmark before and after the patch.
   # Also calculate checksums to confirm returned data is identical.
   
   # Before:
   hadoop fs -text 'random-snappy/*' | pv | sha256sum
   20.1GiB 0:04:29 [76.2MiB/s] [                                                
                      <=>                                     ]
   9ec540014192ba9522c58a6fa8e26c912707493566a42e8a22de9edce474a6f4  -
   
   # After:
   hadoop fs -text 'random-snappy/*' | pv | sha256sum
   20.1GiB 0:02:36 [ 131MiB/s] [                                                
                                                  <=>         ]
   9ec540014192ba9522c58a6fa8e26c912707493566a42e8a22de9edce474a6f4  -
   ```
   
   The improvement is more modest for Avro, where there are unrelated overheads 
from deserializing and reserializing to JSON:
   
   ```
   # Generate random Avro data. This yields a ~1 GB file, and I staged 10 
copies into a directory.
   java -jar avro-tools-1.11.4.jar random users-4000000.avro \
       --schema-file user.asvc \
       --count 40000000
   
   # Before:
   hadoop fs -text 'random-avro/*' | pv | sha256sum
   36.4GiB 0:15:38 [39.7MiB/s] [                                                
                             <=>                              ]
   41d4ed96ef1fc3f74c0a6faf3ad3e94202ed4863a7d405a6a555aae2fe89df65  -
   
   # After:
   hadoop fs -text 'random-avro/*' | pv | sha256sum
   36.4GiB 0:13:36 [45.7MiB/s] [                             <=>                
                                                              ]
   41d4ed96ef1fc3f74c0a6faf3ad3e94202ed4863a7d405a6a555aae2fe89df65  -
   ```
   
   ### For code changes:
   
   - [X] Does the title or this PR starts with the corresponding JIRA issue id 
(e.g. 'HADOOP-17799. Your PR title ...')?
   - [ ] Object storage: have the integration tests been executed and the 
endpoint declared according to the connector-specific documentation?
   - [ ] If adding new dependencies to the code, are these dependencies 
licensed in a way that is compatible for inclusion under [ASF 
2.0](http://www.apache.org/legal/resolved.html#category-a)?
   - [ ] If applicable, have you updated the `LICENSE`, `LICENSE-binary`, 
`NOTICE-binary` files?
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[PR] HADOOP-19389: Optimize shell -text command I/O with multi-byte read. [hadoop]

Reply via email to