[jira] [Commented] (HADOOP-19389) Optimize shell -text command I/O with multi-byte read.

ASF GitHub Bot (Jira) Wed, 15 Jan 2025 14:22:04 -0800


    [ 
https://issues.apache.org/jira/browse/HADOOP-19389?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17913470#comment-17913470
 ]


ASF GitHub Bot commented on HADOOP-19389:
-----------------------------------------

cnauroth opened a new pull request, #7291:
URL: https://github.com/apache/hadoop/pull/7291

   ### Description of PR
   
   `hadoop fs -text` reads Avro files and sequence files by internally wrapping 
the stream in 
[`AvroFileInputStream`](https://github.com/apache/hadoop/blob/rel/release-3.4.1/hadoop-common-project/hadoop-common/src/main/java/org/apache/hadoop/fs/shell/Display.java#L270)
 or 
[`TextRecordInputStream`](https://github.com/apache/hadoop/blob/rel/release-3.4.1/hadoop-common-project/hadoop-common/src/main/java/org/apache/hadoop/fs/shell/Display.java#L217).
 These classes implement the required single-byte 
[`read()`](https://docs.oracle.com/en/java/javase/11/docs/api/java.base/java/io/InputStream.html#read()),
 but not the optional multi-byte buffered [`read(byte[], int, 
int)`](https://docs.oracle.com/en/java/javase/11/docs/api/java.base/java/io/InputStream.html#read(byte%5B%5D,int,int)).
 The default implementation in the JDK is a [loop over single-byte 
read](https://github.com/openjdk/jdk11u-dev/blob/a47c72fad455bfdf9053cb8e94c99e73965ab50d/src/java.base/share/classes/java/io/InputStream.java#L280),
 which causes sub-optimal I/O and method call overhead. We can optimize this by 
overriding the multi-byte read method.
   
   ### How was this patch tested?
   
   Multiple new unit tests cover expectations for both single-byte and 
multi-byte read. I ran these tests both before and after the code change to 
make sure there is no unexpected behavioral change.
   
   Here is some benchmarking I ran to test the effects of this patch. This 
shows a ~1.7x throughput improvement for sequence files.
   
   ```
   # Generate random text data with block Snappy compression (sequence file 
format framing).
   hadoop jar /usr/lib/hadoop-mapreduce/hadoop-mapreduce-examples.jar 
randomtextwriter \
       -D mapreduce.output.fileoutputformat.compress=true \
       -D 
mapreduce.output.fileoutputformat.compress.codec=org.apache.hadoop.io.compress.SnappyCodec
 \
       -D mapreduce.output.fileoutputformat.compress.type=BLOCK \
       random-snappy
   
   # Install pv utility for benchmarking throughput of data through a command 
pipeline.
   sudo apt-get install pv
   
   # Benchmark before and after the patch.
   # Also calculate checksums to confirm returned data is identical.
   
   # Before:
   hadoop fs -text 'random-snappy/*' | pv | sha256sum
   20.1GiB 0:04:29 [76.2MiB/s] [                                                
                      <=>                                     ]
   9ec540014192ba9522c58a6fa8e26c912707493566a42e8a22de9edce474a6f4  -
   
   # After:
   hadoop fs -text 'random-snappy/*' | pv | sha256sum
   20.1GiB 0:02:36 [ 131MiB/s] [                                                
                                                  <=>         ]
   9ec540014192ba9522c58a6fa8e26c912707493566a42e8a22de9edce474a6f4  -
   ```
   
   The improvement is more modest for Avro, where there are unrelated overheads 
from deserializing and reserializing to JSON:
   
   ```
   # Generate random Avro data. This yields a ~1 GB file, and I staged 10 
copies into a directory.
   java -jar avro-tools-1.11.4.jar random users-4000000.avro \
       --schema-file user.asvc \
       --count 40000000
   
   # Before:
   hadoop fs -text 'random-avro/*' | pv | sha256sum
   36.4GiB 0:15:38 [39.7MiB/s] [                                                
                             <=>                              ]
   41d4ed96ef1fc3f74c0a6faf3ad3e94202ed4863a7d405a6a555aae2fe89df65  -
   
   # After:
   hadoop fs -text 'random-avro/*' | pv | sha256sum
   36.4GiB 0:13:36 [45.7MiB/s] [                             <=>                
                                                              ]
   41d4ed96ef1fc3f74c0a6faf3ad3e94202ed4863a7d405a6a555aae2fe89df65  -
   ```
   
   ### For code changes:
   
   - [X] Does the title or this PR starts with the corresponding JIRA issue id 
(e.g. 'HADOOP-17799. Your PR title ...')?
   - [ ] Object storage: have the integration tests been executed and the 
endpoint declared according to the connector-specific documentation?
   - [ ] If adding new dependencies to the code, are these dependencies 
licensed in a way that is compatible for inclusion under [ASF 
2.0](http://www.apache.org/legal/resolved.html#category-a)?
   - [ ] If applicable, have you updated the `LICENSE`, `LICENSE-binary`, 
`NOTICE-binary` files?
   




> Optimize shell -text command I/O with multi-byte read.
> ------------------------------------------------------
>
>                 Key: HADOOP-19389
>                 URL: https://issues.apache.org/jira/browse/HADOOP-19389
>             Project: Hadoop Common
>          Issue Type: Improvement
>          Components: fs
>            Reporter: Chris Nauroth
>            Assignee: Chris Nauroth
>            Priority: Minor
>
> {{hadoop fs -text}} reads Avro files and sequence files by internally 
> wrapping the stream in 
> [{{AvroFileInputStream}}|https://github.com/apache/hadoop/blob/rel/release-3.4.1/hadoop-common-project/hadoop-common/src/main/java/org/apache/hadoop/fs/shell/Display.java#L270]
>  or 
> [{{TextRecordInputStream}}|https://github.com/apache/hadoop/blob/rel/release-3.4.1/hadoop-common-project/hadoop-common/src/main/java/org/apache/hadoop/fs/shell/Display.java#L217].
>  These classes implement the required single-byte 
> [{{read()}}|https://docs.oracle.com/en/java/javase/11/docs/api/java.base/java/io/InputStream.html#read()],
>  but not the optional multi-byte buffered [{{read(byte[], int, 
> int)}}|https://docs.oracle.com/en/java/javase/11/docs/api/java.base/java/io/InputStream.html#read(byte%5B%5D,int,int)].
>  The default implementation in the JDK is a [loop over single-byte 
> read|https://github.com/openjdk/jdk11u-dev/blob/a47c72fad455bfdf9053cb8e94c99e73965ab50d/src/java.base/share/classes/java/io/InputStream.java#L280],
>  which causes sub-optimal I/O and method call overhead. We can optimize this 
> by overriding the multi-byte read method.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[jira] [Commented] (HADOOP-19389) Optimize shell -text command I/O with multi-byte read.

Reply via email to