[jira] [Commented] (HADOOP-19389) Optimize shell -text command I/O with multi-byte read.

ASF GitHub Bot (Jira) Fri, 17 Jan 2025 09:10:27 -0800


    [ 
https://issues.apache.org/jira/browse/HADOOP-19389?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17914189#comment-17914189
 ]


ASF GitHub Bot commented on HADOOP-19389:
-----------------------------------------

cnauroth commented on code in PR #7291:
URL: https://github.com/apache/hadoop/pull/7291#discussion_r1920495312


##########
hadoop-common-project/hadoop-common/src/test/java/org/apache/hadoop/fs/shell/TestTextCommand.java:
##########
@@ -67,39 +75,194 @@ public void testDisplayForAvroFiles() throws Exception {
     assertEquals(expectedOutput, output);
   }
 
+  @Test
+  public void testEmptyAvroFile() throws Exception {

Review Comment:
   I'm unclear about bringing this into a contract test. These tests are 
covering semantics very specific to `hadoop fs -text`. The expectation is that 
the command can decode Avro and sequence files and print them in JSON/text 
representation. At the `FileSystem` layer, it would just be returning the raw 
bytes. Existing contract tests cover "input bytes == output bytes". If 
hdfs/s3a/azure/gcs get that part right, then the `hadoop fs -text` behavior 
falls into place regardless of the FS.
   
   However, maybe something we can do is expand contract coverage of reads in a 
more generic way. For example, we could cover things like the bad argument 
checks and "same data returned from read() vs. read(byte[]...)` during 
`AbstractContractOpenTest`. Maybe some updates to the `FSDataInputStream` 
specification docs too.
   
   I'd recommend driving all this in a separate patch though.





> Optimize shell -text command I/O with multi-byte read.
> ------------------------------------------------------
>
>                 Key: HADOOP-19389
>                 URL: https://issues.apache.org/jira/browse/HADOOP-19389
>             Project: Hadoop Common
>          Issue Type: Improvement
>          Components: command, fs, fs/azure, fs/gcs, fs/s3
>    Affects Versions: 3.4.1
>            Reporter: Chris Nauroth
>            Assignee: Chris Nauroth
>            Priority: Minor
>              Labels: pull-request-available
>
> {{hadoop fs -text}} reads Avro files and sequence files by internally 
> wrapping the stream in 
> [{{AvroFileInputStream}}|https://github.com/apache/hadoop/blob/rel/release-3.4.1/hadoop-common-project/hadoop-common/src/main/java/org/apache/hadoop/fs/shell/Display.java#L270]
>  or 
> [{{TextRecordInputStream}}|https://github.com/apache/hadoop/blob/rel/release-3.4.1/hadoop-common-project/hadoop-common/src/main/java/org/apache/hadoop/fs/shell/Display.java#L217].
>  These classes implement the required single-byte 
> [{{read()}}|https://docs.oracle.com/en/java/javase/11/docs/api/java.base/java/io/InputStream.html#read()],
>  but not the optional multi-byte buffered [{{read(byte[], int, 
> int)}}|https://docs.oracle.com/en/java/javase/11/docs/api/java.base/java/io/InputStream.html#read(byte%5B%5D,int,int)].
>  The default implementation in the JDK is a [loop over single-byte 
> read|https://github.com/openjdk/jdk11u-dev/blob/a47c72fad455bfdf9053cb8e94c99e73965ab50d/src/java.base/share/classes/java/io/InputStream.java#L280],
>  which causes sub-optimal I/O and method call overhead. We can optimize this 
> by overriding the multi-byte read method.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[jira] [Commented] (HADOOP-19389) Optimize shell -text command I/O with multi-byte read.

Reply via email to