[ https://issues.apache.org/jira/browse/HADOOP-14965?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17512529#comment-17512529 ]
Steve Loughran commented on HADOOP-14965: ----------------------------------------- [~yzhangal] what is the application reading the data doing? * is it doing random IO, sequential, or forward-with * which APIs is it using (seek and read, readFully, etc) "normal" is sequentia; except on a backwards seek, when it switches to random. so the only way this would be worse is if you read something at the end of a file, then went back and did a large read to the end of the file. Is this what is happening? also, what is the value of "fs.s3a.readahead.range"? try a larger number (say 512k) and see what it does > s3a input stream "normal" fadvise mode to be adaptive > ----------------------------------------------------- > > Key: HADOOP-14965 > URL: https://issues.apache.org/jira/browse/HADOOP-14965 > Project: Hadoop Common > Issue Type: Sub-task > Components: fs/s3 > Affects Versions: 2.8.1 > Reporter: Steve Loughran > Assignee: Steve Loughran > Priority: Major > Fix For: 3.1.0, 3.0.1 > > Attachments: HADOOP-14965-001.patch, HADOOP-14965-002.patch, > HADOOP-14965-003.patch, HADOOP-14965-004.patch > > > HADOOP-14535 added seek optimisation to wasb, but rather than require the > caller to declare sequential vs random, it works out for itself. > # defaults to sequential, lazy seek > # if the caller ever seeks backwards, switches to random IO. > This means that on the use pattern of columnar stores: of go to end of file, > read summary, then go to columns and work forwards, will switch to random IO > after that first seek back (cost: one aborted HTTP connection)/. > Where this should benefit the most is in downstream apps where you are > working with different data sources in the same object store/running of the > same app config, but have different read patterns. I'm seeing exactly this in > some of my spark tests, where it's near impossible to set things up so that > .gz files are read sequentially, but ORC data is read in random IO > I propose the "normal" fadvise => adaptive, sequential==sequential always, > random => random from the outset. -- This message was sent by Atlassian Jira (v8.20.1#820001) --------------------------------------------------------------------- To unsubscribe, e-mail: common-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: common-issues-h...@hadoop.apache.org