[jira] [Commented] (HADOOP-14535) Support for random access and seek of block blobs

2017-07-10 Thread Steve Loughran (JIRA)

[ 
https://issues.apache.org/jira/browse/HADOOP-14535?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16081133#comment-16081133
 ] 

Steve Loughran commented on HADOOP-14535:
-

Patch 006. This is patch 005 with all the changes I suggested, particularly the 
tests.

The original test suite has a couple of operational flaws
# its slow
#  it leaves 128MB files around. This can be expensive.

I've reworked it to use the same style as {{AbstractSTestS3AHugeFiles}}; using 
ordered names to guarantee the test cases are run in sequence; the final test 
deletes the file. And downsized the file. 
This is lined up for HADOOP-14553, which ports a copy of the same test into 
Azure, and runs tests in parallel. The tests in this method should be something 
which can be merged in to that test, and make it a {{scale}} test for 
configurable size of dataset.

Tested: new suite, yes. Remainder: in progress

{code}
---
Running org.apache.hadoop.fs.azure.TestBlockBlobInputStream
Tests run: 19, Failures: 0, Errors: 0, Skipped: 0, Time elapsed: 212.423 sec - 
in org.apache.hadoop.fs.azure.TestBlockBlobInputStream

Results :

Tests run: 19, Failures: 0, Errors: 0, Skipped: 0

[INFO] 
[INFO] BUILD SUCCESS
[INFO] 
[INFO] Total time: 03:37 min (Wall Clock)
[INFO] Finished at: 2017-07-10T21:46:59+01:00
[INFO] Final Memory: 46M/820M
[INFO] 
{code}


> Support for random access and seek of block blobs
> -
>
> Key: HADOOP-14535
> URL: https://issues.apache.org/jira/browse/HADOOP-14535
> Project: Hadoop Common
>  Issue Type: Improvement
>  Components: fs/azure
>Reporter: Thomas
>Assignee: Thomas
> Attachments: 
> 0001-Random-access-and-seek-imporvements-to-azure-file-system.patch, 
> 0003-Random-access-and-seek-imporvements-to-azure-file-system.patch, 
> 0004-Random-access-and-seek-imporvements-to-azure-file-system.patch, 
> 0005-Random-access-and-seek-imporvements-to-azure-file-system.patch, 
> HADOOP-14535-006.patch
>
>
> This change adds a seek-able stream for reading block blobs to the wasb:// 
> file system.
> If seek() is not used or if only forward seek() is used, the behavior of 
> read() is unchanged.
> That is, the stream is optimized for sequential reads by reading chunks (over 
> the network) in
> the size specified by "fs.azure.read.request.size" (default is 4 megabytes).
> If reverse seek() is used, the behavior of read() changes in favor of reading 
> the actual number
> of bytes requested in the call to read(), with some constraints.  If the size 
> requested is smaller
> than 16 kilobytes and cannot be satisfied by the internal buffer, the network 
> read will be 16
> kilobytes.  If the size requested is greater than 4 megabytes, it will be 
> satisfied by sequential
> 4 megabyte reads over the network.
> This change improves the performance of FSInputStream.seek() by not closing 
> and re-opening the
> stream, which for block blobs also involves a network operation to read the 
> blob metadata. Now
> NativeAzureFsInputStream.seek() checks if the stream is seek-able and moves 
> the read position.
> [^attachment-name.zip]



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: common-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: common-issues-h...@hadoop.apache.org



[jira] [Commented] (HADOOP-14535) Support for random access and seek of block blobs

2017-07-10 Thread Steve Loughran (JIRA)

[ 
https://issues.apache.org/jira/browse/HADOOP-14535?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16080668#comment-16080668
 ] 

Steve Loughran commented on HADOOP-14535:
-

This is getting pretty close to going in; most of my feedback is test related. 
1+ iteration should be enough. Once this and HADOOP-14598 are in, I can do some 
downstream testing with real data.

h3. {{AzureNativeFileSystemStore}}

* How about naming the key of {{KEY_INPUT_STREAM_VERSION}} to 
"fs.azure.experimental.stream.version"? That's be consistent with the 
"fs.s3a.experimental" term? 
* log @ debug choice of algorithm, to aid diagnostics
* {{retrieve()}} L2066: the {{PageBlobInputStream}} constructor already wraps 
StorageException with IOE. Retrieve doesn't need to catch and translate them, 
so should catch and then rethrow IOEs as is.

h3. {{BlockBlobInputStream}}

* a seek to the current position can be downgraded to a no-op; no need to close 
& reopen the stream
* you don't need to go {{this.}} when referencing fields. We expect our IDEs to 
colour code fields these days.
* can you have the {{else}} and the {{catch}} statements on the same line as 
the previous clauses closing "}".
* {{read(byte[] buffer, ..)}}. Use {{FSInputStream.validatePositionedReadArgs}} 
for validation, or at least as much of it as is relevant. FWIW, the order of 
checks matches that in InputStream.
* {{closeBlobInputStream}}: should {{blobInputStream=null}} be done in a 
{{finally}} clause so that it is guaranteed to be set (so making the call 
non-reentrant)

h3. {{NativeAzureFileSystem}}

* L625: accidental typo in comment

h3. {{ContractTestUtils.java}}

revert move of {{elapsedTime()}} to a single line method, use multiline style 
for the new {{elapsedTimeMs()}}. 


h3. {{TestBlockBlobInputStream}}

# I like the idea of using the ratio as a way of comparing performance; it 
makes it independent of bandwidth.
# And I agree, you can't reliably assess real-world perf. But it would seem 
faster.
# Once HADOOP-14553 is in, this test would be uprated to a scale test; only 
executed with the -Dscale option, 
and configurable for larger sizes of data. No need to worry about it. I think 
the tests could perhaps even be moved into the 
[ITestAzureHugeFiles|https://github.com/steveloughran/hadoop/blob/azure/HADOOP-14553-testing/hadoop-tools/hadoop-azure/src/test/java/org/apache/hadoop/fs/azure/integration/ITestAzureHugeFiles.java]
 test, which forces a strict ordering of tests in junit, so can have one test 
to upload a file, one to delete, and some in between to play with reading and 
seeking.

for now

* {{TestBlockBlobInputStream}} to extend {{AbstractWasbTestBase}}. This will 
aid migration to the parallel test runner of HADOOP-14553
* {{TestBlockBlobInputStream}} teardown only closes one of the input streams.
* {{toMbps()}}:  would it be better or worse to do the *8 before the / 1000.0? 
Or, given these are floating point, moot?
* split {{testMarkSupported()}} into a separate test for each separate stream; 
assertion in {{validateMarkSupported}} to include some text.
* same for {{testSkipAndAvailableAndPosition}}
* {{testSequentialReadPerformance}} are we confident that the {{v2ElapsedMs}} 
read time will always be >0? Otherewise that division will fail.
* {{testRandomRead}} and {{testSequentialRead}} to always close the output 
stream. Or save a refernce to the stream into a field and have the @After 
teardown close it (quietly)
* {{validateMarkAndReset, validateSkipBounds}} to use 
{{GenericTestUtils.assertExceptionContains}} to validate caught exception, or
 {{LambdaTestUtils.intercept}} to structure expected failure. Have a look at 
other uses in the code for details. +Same in other tests.

{code}
try {
  seekCheck(in, dataLen + 3);
  Assert.fail("Seek after EOF should fail.");
} catch (IOException e) {
  GenericTestUtils.assertExceptionContains("Cannot seek after EOF", e);
}
{code}

LambdaTestUtils may seem a bit more convoluted

{code}
intercept(IOException.class, expected,
new Callable() {
  @Override
  public S3AEncryptionMethods call() throws Exception {
return getAlgorithm(alg, key);
  }
});
{code}

But it really comes out to play in Java 8:

{code}
intercept(IOException.class, expected,
() -> getAlgorithm(alg, key));
{code}


That's why I'd recommend adopting it now.





Other

h3. {{AzureBlobStorageTestAccount}}

* L96; think some tabs have snuck in.
* I have problem in that every test run leaks wasb containers. Does this patch 
continue or even worsen the tradition?



> Support for random access and seek of block blobs
> -
>
> Key: HADOOP-14535
> URL: https://issues.apache.org/jira/browse/HADOOP-14535
> Project: Hadoop Common
>  Issue Type: Improvement
> 

[jira] [Commented] (HADOOP-14535) Support for random access and seek of block blobs

2017-07-01 Thread Hadoop QA (JIRA)

[ 
https://issues.apache.org/jira/browse/HADOOP-14535?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16071171#comment-16071171
 ] 

Hadoop QA commented on HADOOP-14535:


| (/) *{color:green}+1 overall{color}* |
\\
\\
|| Vote || Subsystem || Runtime || Comment ||
| {color:blue}0{color} | {color:blue} reexec {color} | {color:blue}  0m 
15s{color} | {color:blue} Docker mode activated. {color} |
| {color:green}+1{color} | {color:green} @author {color} | {color:green}  0m  
0s{color} | {color:green} The patch does not contain any @author tags. {color} |
| {color:green}+1{color} | {color:green} test4tests {color} | {color:green}  0m 
 0s{color} | {color:green} The patch appears to include 5 new or modified test 
files. {color} |
| {color:blue}0{color} | {color:blue} mvndep {color} | {color:blue}  1m 
26s{color} | {color:blue} Maven dependency ordering for branch {color} |
| {color:green}+1{color} | {color:green} mvninstall {color} | {color:green} 15m 
34s{color} | {color:green} trunk passed {color} |
| {color:green}+1{color} | {color:green} compile {color} | {color:green} 17m 
21s{color} | {color:green} trunk passed {color} |
| {color:green}+1{color} | {color:green} checkstyle {color} | {color:green}  1m 
56s{color} | {color:green} trunk passed {color} |
| {color:green}+1{color} | {color:green} mvnsite {color} | {color:green}  1m 
38s{color} | {color:green} trunk passed {color} |
| {color:green}+1{color} | {color:green} findbugs {color} | {color:green}  2m 
17s{color} | {color:green} trunk passed {color} |
| {color:green}+1{color} | {color:green} javadoc {color} | {color:green}  1m  
6s{color} | {color:green} trunk passed {color} |
| {color:blue}0{color} | {color:blue} mvndep {color} | {color:blue}  0m 
15s{color} | {color:blue} Maven dependency ordering for patch {color} |
| {color:green}+1{color} | {color:green} mvninstall {color} | {color:green}  1m 
 0s{color} | {color:green} the patch passed {color} |
| {color:green}+1{color} | {color:green} compile {color} | {color:green} 10m 
56s{color} | {color:green} the patch passed {color} |
| {color:green}+1{color} | {color:green} javac {color} | {color:green} 10m 
56s{color} | {color:green} the patch passed {color} |
| {color:orange}-0{color} | {color:orange} checkstyle {color} | {color:orange}  
1m 54s{color} | {color:orange} root: The patch generated 28 new + 135 unchanged 
- 7 fixed = 163 total (was 142) {color} |
| {color:green}+1{color} | {color:green} mvnsite {color} | {color:green}  1m 
34s{color} | {color:green} the patch passed {color} |
| {color:green}+1{color} | {color:green} whitespace {color} | {color:green}  0m 
 0s{color} | {color:green} The patch has no whitespace issues. {color} |
| {color:green}+1{color} | {color:green} findbugs {color} | {color:green}  5m 
37s{color} | {color:green} the patch passed {color} |
| {color:green}+1{color} | {color:green} javadoc {color} | {color:green}  1m 
11s{color} | {color:green} the patch passed {color} |
| {color:green}+1{color} | {color:green} unit {color} | {color:green}  7m 
50s{color} | {color:green} hadoop-common in the patch passed. {color} |
| {color:green}+1{color} | {color:green} unit {color} | {color:green}  1m 
38s{color} | {color:green} hadoop-azure in the patch passed. {color} |
| {color:green}+1{color} | {color:green} asflicense {color} | {color:green}  0m 
32s{color} | {color:green} The patch does not generate ASF License warnings. 
{color} |
| {color:black}{color} | {color:black} {color} | {color:black} 95m 27s{color} | 
{color:black} {color} |
\\
\\
|| Subsystem || Report/Notes ||
| Docker |  Image:yetus/hadoop:14b5c93 |
| JIRA Issue | HADOOP-14535 |
| JIRA Patch URL | 
https://issues.apache.org/jira/secure/attachment/12875336/0005-Random-access-and-seek-imporvements-to-azure-file-system.patch
 |
| Optional Tests |  asflicense  compile  javac  javadoc  mvninstall  mvnsite  
unit  findbugs  checkstyle  |
| uname | Linux edf1f3040ee2 4.4.0-43-generic #63-Ubuntu SMP Wed Oct 12 
13:48:03 UTC 2016 x86_64 x86_64 x86_64 GNU/Linux |
| Build tool | maven |
| Personality | /testptch/hadoop/patchprocess/precommit/personality/provided.sh 
|
| git revision | trunk / fa1aaee |
| Default Java | 1.8.0_131 |
| findbugs | v3.1.0-RC1 |
| checkstyle | 
https://builds.apache.org/job/PreCommit-HADOOP-Build/12696/artifact/patchprocess/diff-checkstyle-root.txt
 |
|  Test Results | 
https://builds.apache.org/job/PreCommit-HADOOP-Build/12696/testReport/ |
| modules | C: hadoop-common-project/hadoop-common hadoop-tools/hadoop-azure U: 
. |
| Console output | 
https://builds.apache.org/job/PreCommit-HADOOP-Build/12696/console |
| Powered by | Apache Yetus 0.5.0-SNAPSHOT   http://yetus.apache.org |


This message was automatically generated.



> Support for random access and seek of block blobs
> -
>
> Key: HADOOP-14535
> URL: https://issues.apache.org/jira/browse/HADOOP-14535
>   

[jira] [Commented] (HADOOP-14535) Support for random access and seek of block blobs

2017-06-26 Thread Thomas (JIRA)

[ 
https://issues.apache.org/jira/browse/HADOOP-14535?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16063246#comment-16063246
 ] 

Thomas commented on HADOOP-14535:
-

Thanks for the feedback.  I will do the following and resubmit the patch in a 
few days:

1) Add logic so reverse seek followed by sequential read performs very well.  I 
will add internal buffering and remove the dependency on BufferedFSInputStream 
so the read buffer can grow back to the "fs.azure.read.request.size" (default 4 
MB) after a reverse seek.

2) Add tests to ensure functional and performance coverage of seek, read, and 
skip.

3) If required for testing, I will add metrics/instrumentation.

> Support for random access and seek of block blobs
> -
>
> Key: HADOOP-14535
> URL: https://issues.apache.org/jira/browse/HADOOP-14535
> Project: Hadoop Common
>  Issue Type: Improvement
>  Components: fs/azure
>Reporter: Thomas
>Assignee: Thomas
> Attachments: 
> 0001-Random-access-and-seek-imporvements-to-azure-file-system.patch, 
> 0003-Random-access-and-seek-imporvements-to-azure-file-system.patch, 
> 0004-Random-access-and-seek-imporvements-to-azure-file-system.patch
>
>
> This change adds a seek-able stream for reading block blobs to the wasb:// 
> file system.
> If seek() is not used or if only forward seek() is used, the behavior of 
> read() is unchanged.
> That is, the stream is optimized for sequential reads by reading chunks (over 
> the network) in
> the size specified by "fs.azure.read.request.size" (default is 4 megabytes).
> If reverse seek() is used, the behavior of read() changes in favor of reading 
> the actual number
> of bytes requested in the call to read(), with some constraints.  If the size 
> requested is smaller
> than 16 kilobytes and cannot be satisfied by the internal buffer, the network 
> read will be 16
> kilobytes.  If the size requested is greater than 4 megabytes, it will be 
> satisfied by sequential
> 4 megabyte reads over the network.
> This change improves the performance of FSInputStream.seek() by not closing 
> and re-opening the
> stream, which for block blobs also involves a network operation to read the 
> blob metadata. Now
> NativeAzureFsInputStream.seek() checks if the stream is seek-able and moves 
> the read position.
> [^attachment-name.zip]



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: common-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: common-issues-h...@hadoop.apache.org



[jira] [Commented] (HADOOP-14535) Support for random access and seek of block blobs

2017-06-26 Thread Steve Loughran (JIRA)

[ 
https://issues.apache.org/jira/browse/HADOOP-14535?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16063006#comment-16063006
 ] 

Steve Loughran commented on HADOOP-14535:
-

Having delved into the Azure codebase, I think a test could be fitted into 
{{TestReadAndSeekPageBlobAfterWrite}}, hopefully just by re-using the file 
generated. Is that the same kind of blob you want to work with?

I don't see any uses of readFully() in that test BTW. Rather than seek/read 
sequences, a sequence of readFully() operations is more representative of 
column store access. Doing something there to mimic seek-near-end and then some 
near start would match that and line up for any other optimisations of readFully

FWIW, here's a trace of some TCP-DS benchmark IO:

https://raw.githubusercontent.com/rajeshbalamohan/hadoop-aws-wrapper/master/stream_access_query_27_tpcds_200gb.log

a like like
{code}
.../98_0,readFully,17113131,0,0,17111727,342,44181435
{code}
means "in file 98_0 17113131 readFully(offset=17111727, bytes=342) duration 
= 44,181,435 nS 

That's the seek pattern that this optimisation is clearly targeting, the 
regression we need to avoid is "byte 0 to EOF", which is what .gz processing 
involves.

I'll set up some of my downstream tests in 
https://github.com/hortonworks-spark/cloud-integration to do this in spark & 
going from .gz to ORC & parquet and then scanning; as this uses the actual 
libraries, it's a full integration test of the seek() code

> Support for random access and seek of block blobs
> -
>
> Key: HADOOP-14535
> URL: https://issues.apache.org/jira/browse/HADOOP-14535
> Project: Hadoop Common
>  Issue Type: Improvement
>  Components: fs/azure
>Reporter: Thomas
>Assignee: Thomas
> Attachments: 
> 0001-Random-access-and-seek-imporvements-to-azure-file-system.patch, 
> 0003-Random-access-and-seek-imporvements-to-azure-file-system.patch, 
> 0004-Random-access-and-seek-imporvements-to-azure-file-system.patch
>
>
> This change adds a seek-able stream for reading block blobs to the wasb:// 
> file system.
> If seek() is not used or if only forward seek() is used, the behavior of 
> read() is unchanged.
> That is, the stream is optimized for sequential reads by reading chunks (over 
> the network) in
> the size specified by "fs.azure.read.request.size" (default is 4 megabytes).
> If reverse seek() is used, the behavior of read() changes in favor of reading 
> the actual number
> of bytes requested in the call to read(), with some constraints.  If the size 
> requested is smaller
> than 16 kilobytes and cannot be satisfied by the internal buffer, the network 
> read will be 16
> kilobytes.  If the size requested is greater than 4 megabytes, it will be 
> satisfied by sequential
> 4 megabyte reads over the network.
> This change improves the performance of FSInputStream.seek() by not closing 
> and re-opening the
> stream, which for block blobs also involves a network operation to read the 
> blob metadata. Now
> NativeAzureFsInputStream.seek() checks if the stream is seek-able and moves 
> the read position.
> [^attachment-name.zip]



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: common-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: common-issues-h...@hadoop.apache.org



[jira] [Commented] (HADOOP-14535) Support for random access and seek of block blobs

2017-06-26 Thread Steve Loughran (JIRA)

[ 
https://issues.apache.org/jira/browse/HADOOP-14535?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16062840#comment-16062840
 ] 

Steve Loughran commented on HADOOP-14535:
-

I've cut the HADOOP-14553 dependency as getting that test running is harder 
than expected

> Support for random access and seek of block blobs
> -
>
> Key: HADOOP-14535
> URL: https://issues.apache.org/jira/browse/HADOOP-14535
> Project: Hadoop Common
>  Issue Type: Improvement
>  Components: fs/azure
>Reporter: Thomas
>Assignee: Thomas
> Attachments: 
> 0001-Random-access-and-seek-imporvements-to-azure-file-system.patch, 
> 0003-Random-access-and-seek-imporvements-to-azure-file-system.patch, 
> 0004-Random-access-and-seek-imporvements-to-azure-file-system.patch
>
>
> This change adds a seek-able stream for reading block blobs to the wasb:// 
> file system.
> If seek() is not used or if only forward seek() is used, the behavior of 
> read() is unchanged.
> That is, the stream is optimized for sequential reads by reading chunks (over 
> the network) in
> the size specified by "fs.azure.read.request.size" (default is 4 megabytes).
> If reverse seek() is used, the behavior of read() changes in favor of reading 
> the actual number
> of bytes requested in the call to read(), with some constraints.  If the size 
> requested is smaller
> than 16 kilobytes and cannot be satisfied by the internal buffer, the network 
> read will be 16
> kilobytes.  If the size requested is greater than 4 megabytes, it will be 
> satisfied by sequential
> 4 megabyte reads over the network.
> This change improves the performance of FSInputStream.seek() by not closing 
> and re-opening the
> stream, which for block blobs also involves a network operation to read the 
> blob metadata. Now
> NativeAzureFsInputStream.seek() checks if the stream is seek-able and moves 
> the read position.
> [^attachment-name.zip]



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: common-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: common-issues-h...@hadoop.apache.org



[jira] [Commented] (HADOOP-14535) Support for random access and seek of block blobs

2017-06-23 Thread Steve Loughran (JIRA)

[ 
https://issues.apache.org/jira/browse/HADOOP-14535?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16060926#comment-16060926
 ] 

Steve Loughran commented on HADOOP-14535:
-

in HADOOP-14553 I've added a new test which can create a large file and then do 
seeks around it. I'd like that test to be used as the regression test here.


> Support for random access and seek of block blobs
> -
>
> Key: HADOOP-14535
> URL: https://issues.apache.org/jira/browse/HADOOP-14535
> Project: Hadoop Common
>  Issue Type: Improvement
>  Components: fs/azure
>Reporter: Thomas
>Assignee: Thomas
> Attachments: 
> 0001-Random-access-and-seek-imporvements-to-azure-file-system.patch, 
> 0003-Random-access-and-seek-imporvements-to-azure-file-system.patch, 
> 0004-Random-access-and-seek-imporvements-to-azure-file-system.patch
>
>
> This change adds a seek-able stream for reading block blobs to the wasb:// 
> file system.
> If seek() is not used or if only forward seek() is used, the behavior of 
> read() is unchanged.
> That is, the stream is optimized for sequential reads by reading chunks (over 
> the network) in
> the size specified by "fs.azure.read.request.size" (default is 4 megabytes).
> If reverse seek() is used, the behavior of read() changes in favor of reading 
> the actual number
> of bytes requested in the call to read(), with some constraints.  If the size 
> requested is smaller
> than 16 kilobytes and cannot be satisfied by the internal buffer, the network 
> read will be 16
> kilobytes.  If the size requested is greater than 4 megabytes, it will be 
> satisfied by sequential
> 4 megabyte reads over the network.
> This change improves the performance of FSInputStream.seek() by not closing 
> and re-opening the
> stream, which for block blobs also involves a network operation to read the 
> blob metadata. Now
> NativeAzureFsInputStream.seek() checks if the stream is seek-able and moves 
> the read position.
> [^attachment-name.zip]



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: common-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: common-issues-h...@hadoop.apache.org



[jira] [Commented] (HADOOP-14535) Support for random access and seek of block blobs

2017-06-19 Thread Thomas (JIRA)

[ 
https://issues.apache.org/jira/browse/HADOOP-14535?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16054670#comment-16054670
 ] 

Thomas commented on HADOOP-14535:
-

Thanks for the feedback.  I looked at ITestS3AInputStreamPerformance and will 
do something similar.  I do not have an Azure account with which I can share a 
file publicly, but I can write a test to generate the source for the test. I am 
currently working on a few other things, so won't be able to jump on this 
immediately.  Would you like to hold off on this change until the 
instrumentation and unit test is complete, or would end-to-end test results be 
sufficient motivation to move forward on this task while I continue to work on 
the other tasks?  

By the way, this work was done to address 
https://issues.apache.org/jira/browse/HADOOP-14478, which has a dependency on a 
change in the Azure Storage SDK for Java.  The ask was for the SDK to use 
InputStream.mark(readLimit) as a hint to disregard the default network read 
size and use readLimit instead.  Since this is not the intended use of mark, 
rather than pursue unusual dependencies between these two projects I provided 
the implementation in the patch as a solution.

> Support for random access and seek of block blobs
> -
>
> Key: HADOOP-14535
> URL: https://issues.apache.org/jira/browse/HADOOP-14535
> Project: Hadoop Common
>  Issue Type: Improvement
>  Components: fs/azure
>Reporter: Thomas
>Assignee: Thomas
> Attachments: 
> 0001-Random-access-and-seek-imporvements-to-azure-file-system.patch, 
> 0003-Random-access-and-seek-imporvements-to-azure-file-system.patch, 
> 0004-Random-access-and-seek-imporvements-to-azure-file-system.patch
>
>
> This change adds a seek-able stream for reading block blobs to the wasb:// 
> file system.
> If seek() is not used or if only forward seek() is used, the behavior of 
> read() is unchanged.
> That is, the stream is optimized for sequential reads by reading chunks (over 
> the network) in
> the size specified by "fs.azure.read.request.size" (default is 4 megabytes).
> If reverse seek() is used, the behavior of read() changes in favor of reading 
> the actual number
> of bytes requested in the call to read(), with some constraints.  If the size 
> requested is smaller
> than 16 kilobytes and cannot be satisfied by the internal buffer, the network 
> read will be 16
> kilobytes.  If the size requested is greater than 4 megabytes, it will be 
> satisfied by sequential
> 4 megabyte reads over the network.
> This change improves the performance of FSInputStream.seek() by not closing 
> and re-opening the
> stream, which for block blobs also involves a network operation to read the 
> blob metadata. Now
> NativeAzureFsInputStream.seek() checks if the stream is seek-able and moves 
> the read position.
> [^attachment-name.zip]



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: common-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: common-issues-h...@hadoop.apache.org



[jira] [Commented] (HADOOP-14535) Support for random access and seek of block blobs

2017-06-19 Thread Steve Loughran (JIRA)

[ 
https://issues.apache.org/jira/browse/HADOOP-14535?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16054280#comment-16054280
 ] 

Steve Loughran commented on HADOOP-14535:
-

seek optimisation is the key way to speed up data input, especially for 
columnar storage formats


The lessons from our work on S3AInputStream (Which I recommend you look at were)

* ignore no-ops

* lazy-seek, where you don't seek() until a read() call, really boosts 
performance when the IO is a series of readFully(pos, ) calls.

* any forward seeks you can do by discarding data is cost effective for tends 
to hundreds of KB (> 512KB long haul). It looks like you do this.

* optimisations you may do for random IO can be pathologically bad for full 
document reads (e.g. scanning an entire .csv.gz file). You have to measure 
performance on these formats as well as random IO. We rely in S3 for amazon 
serving some 20MB public CSV.gz files which we can abuse for this.

* it's really useful to instrument your streams to count how many bytes you 
discard in closing streams, skip in seeks, how many forward & backward seeks, 
etc. In S3A we track this in the stream (and toString()) prints, then we pull 
it back to the FS instrumentation (which is actually copied out of the Azure 
code originally). This helps us diagnose our tests, and make some assertions. 
Have a look at {{ITestS3AInputStreamPerformance}} for this.


At a quick glance of this patch, I can see you are starting on this. I'd 
recommend starting off with that instrumentation and setting up a test which 
anyone can use to test performance by working with a public (free, read-only) 
file. Why? Saves setup time, eliminates cost, very useful later on. With the 
measurements, then we can look at what's best to target in improving seek.

How about you create/own an uber JIRA on this topic, say "speed up Azure 
input", and add the tasks underneath?

> Support for random access and seek of block blobs
> -
>
> Key: HADOOP-14535
> URL: https://issues.apache.org/jira/browse/HADOOP-14535
> Project: Hadoop Common
>  Issue Type: Improvement
>  Components: fs/azure
>Reporter: Thomas
>Assignee: Thomas
> Attachments: 
> 0001-Random-access-and-seek-imporvements-to-azure-file-system.patch, 
> 0003-Random-access-and-seek-imporvements-to-azure-file-system.patch, 
> 0004-Random-access-and-seek-imporvements-to-azure-file-system.patch
>
>
> This change adds a seek-able stream for reading block blobs to the wasb:// 
> file system.
> If seek() is not used or if only forward seek() is used, the behavior of 
> read() is unchanged.
> That is, the stream is optimized for sequential reads by reading chunks (over 
> the network) in
> the size specified by "fs.azure.read.request.size" (default is 4 megabytes).
> If reverse seek() is used, the behavior of read() changes in favor of reading 
> the actual number
> of bytes requested in the call to read(), with some constraints.  If the size 
> requested is smaller
> than 16 kilobytes and cannot be satisfied by the internal buffer, the network 
> read will be 16
> kilobytes.  If the size requested is greater than 4 megabytes, it will be 
> satisfied by sequential
> 4 megabyte reads over the network.
> This change improves the performance of FSInputStream.seek() by not closing 
> and re-opening the
> stream, which for block blobs also involves a network operation to read the 
> blob metadata. Now
> NativeAzureFsInputStream.seek() checks if the stream is seek-able and moves 
> the read position.
> [^attachment-name.zip]



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: common-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: common-issues-h...@hadoop.apache.org



[jira] [Commented] (HADOOP-14535) Support for random access and seek of block blobs

2017-06-16 Thread Hadoop QA (JIRA)

[ 
https://issues.apache.org/jira/browse/HADOOP-14535?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16052700#comment-16052700
 ] 

Hadoop QA commented on HADOOP-14535:


| (/) *{color:green}+1 overall{color}* |
\\
\\
|| Vote || Subsystem || Runtime || Comment ||
| {color:blue}0{color} | {color:blue} reexec {color} | {color:blue}  0m 
13s{color} | {color:blue} Docker mode activated. {color} |
| {color:green}+1{color} | {color:green} @author {color} | {color:green}  0m  
0s{color} | {color:green} The patch does not contain any @author tags. {color} |
| {color:green}+1{color} | {color:green} test4tests {color} | {color:green}  0m 
 0s{color} | {color:green} The patch appears to include 2 new or modified test 
files. {color} |
| {color:green}+1{color} | {color:green} mvninstall {color} | {color:green} 13m 
 9s{color} | {color:green} trunk passed {color} |
| {color:green}+1{color} | {color:green} compile {color} | {color:green}  0m 
19s{color} | {color:green} trunk passed {color} |
| {color:green}+1{color} | {color:green} checkstyle {color} | {color:green}  0m 
14s{color} | {color:green} trunk passed {color} |
| {color:green}+1{color} | {color:green} mvnsite {color} | {color:green}  0m 
21s{color} | {color:green} trunk passed {color} |
| {color:green}+1{color} | {color:green} findbugs {color} | {color:green}  0m 
27s{color} | {color:green} trunk passed {color} |
| {color:green}+1{color} | {color:green} javadoc {color} | {color:green}  0m 
14s{color} | {color:green} trunk passed {color} |
| {color:green}+1{color} | {color:green} mvninstall {color} | {color:green}  0m 
17s{color} | {color:green} the patch passed {color} |
| {color:green}+1{color} | {color:green} compile {color} | {color:green}  0m 
16s{color} | {color:green} the patch passed {color} |
| {color:green}+1{color} | {color:green} javac {color} | {color:green}  0m 
16s{color} | {color:green} the patch passed {color} |
| {color:orange}-0{color} | {color:orange} checkstyle {color} | {color:orange}  
0m 12s{color} | {color:orange} hadoop-tools/hadoop-azure: The patch generated 
11 new + 107 unchanged - 7 fixed = 118 total (was 114) {color} |
| {color:green}+1{color} | {color:green} mvnsite {color} | {color:green}  0m 
18s{color} | {color:green} the patch passed {color} |
| {color:green}+1{color} | {color:green} whitespace {color} | {color:green}  0m 
 0s{color} | {color:green} The patch has no whitespace issues. {color} |
| {color:green}+1{color} | {color:green} findbugs {color} | {color:green}  0m 
32s{color} | {color:green} the patch passed {color} |
| {color:green}+1{color} | {color:green} javadoc {color} | {color:green}  0m 
11s{color} | {color:green} the patch passed {color} |
| {color:green}+1{color} | {color:green} unit {color} | {color:green}  1m 
19s{color} | {color:green} hadoop-azure in the patch passed. {color} |
| {color:green}+1{color} | {color:green} asflicense {color} | {color:green}  0m 
17s{color} | {color:green} The patch does not generate ASF License warnings. 
{color} |
| {color:black}{color} | {color:black} {color} | {color:black} 19m 31s{color} | 
{color:black} {color} |
\\
\\
|| Subsystem || Report/Notes ||
| Docker |  Image:yetus/hadoop:14b5c93 |
| JIRA Issue | HADOOP-14535 |
| JIRA Patch URL | 
https://issues.apache.org/jira/secure/attachment/12873365/0004-Random-access-and-seek-imporvements-to-azure-file-system.patch
 |
| Optional Tests |  asflicense  compile  javac  javadoc  mvninstall  mvnsite  
unit  findbugs  checkstyle  |
| uname | Linux ea4b5e638758 3.13.0-116-generic #163-Ubuntu SMP Fri Mar 31 
14:13:22 UTC 2017 x86_64 x86_64 x86_64 GNU/Linux |
| Build tool | maven |
| Personality | /testptch/hadoop/patchprocess/precommit/personality/provided.sh 
|
| git revision | trunk / 6460df2 |
| Default Java | 1.8.0_131 |
| findbugs | v3.1.0-RC1 |
| checkstyle | 
https://builds.apache.org/job/PreCommit-HADOOP-Build/12554/artifact/patchprocess/diff-checkstyle-hadoop-tools_hadoop-azure.txt
 |
|  Test Results | 
https://builds.apache.org/job/PreCommit-HADOOP-Build/12554/testReport/ |
| modules | C: hadoop-tools/hadoop-azure U: hadoop-tools/hadoop-azure |
| Console output | 
https://builds.apache.org/job/PreCommit-HADOOP-Build/12554/console |
| Powered by | Apache Yetus 0.5.0-SNAPSHOT   http://yetus.apache.org |


This message was automatically generated.



> Support for random access and seek of block blobs
> -
>
> Key: HADOOP-14535
> URL: https://issues.apache.org/jira/browse/HADOOP-14535
> Project: Hadoop Common
>  Issue Type: Improvement
>  Components: fs/azure
>Reporter: Thomas
>Assignee: Thomas
> Attachments: 
> 0001-Random-access-and-seek-imporvements-to-azure-file-system.patch, 
> 0003-Random-access-and-seek-imporvements-to-azure-file-system.patch, 
> 0004-Random-access-and-seek-imporvements-to-azure-file-system.patch
>
>
> 

[jira] [Commented] (HADOOP-14535) Support for random access and seek of block blobs

2017-06-16 Thread Hadoop QA (JIRA)

[ 
https://issues.apache.org/jira/browse/HADOOP-14535?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16052682#comment-16052682
 ] 

Hadoop QA commented on HADOOP-14535:


| (x) *{color:red}-1 overall{color}* |
\\
\\
|| Vote || Subsystem || Runtime || Comment ||
| {color:blue}0{color} | {color:blue} reexec {color} | {color:blue}  0m 
19s{color} | {color:blue} Docker mode activated. {color} |
| {color:green}+1{color} | {color:green} @author {color} | {color:green}  0m  
0s{color} | {color:green} The patch does not contain any @author tags. {color} |
| {color:green}+1{color} | {color:green} test4tests {color} | {color:green}  0m 
 0s{color} | {color:green} The patch appears to include 2 new or modified test 
files. {color} |
| {color:green}+1{color} | {color:green} mvninstall {color} | {color:green} 15m 
55s{color} | {color:green} trunk passed {color} |
| {color:green}+1{color} | {color:green} compile {color} | {color:green}  0m 
18s{color} | {color:green} trunk passed {color} |
| {color:green}+1{color} | {color:green} checkstyle {color} | {color:green}  0m 
15s{color} | {color:green} trunk passed {color} |
| {color:green}+1{color} | {color:green} mvnsite {color} | {color:green}  0m 
21s{color} | {color:green} trunk passed {color} |
| {color:green}+1{color} | {color:green} findbugs {color} | {color:green}  0m 
28s{color} | {color:green} trunk passed {color} |
| {color:green}+1{color} | {color:green} javadoc {color} | {color:green}  0m 
14s{color} | {color:green} trunk passed {color} |
| {color:green}+1{color} | {color:green} mvninstall {color} | {color:green}  0m 
17s{color} | {color:green} the patch passed {color} |
| {color:green}+1{color} | {color:green} compile {color} | {color:green}  0m 
16s{color} | {color:green} the patch passed {color} |
| {color:green}+1{color} | {color:green} javac {color} | {color:green}  0m 
16s{color} | {color:green} the patch passed {color} |
| {color:orange}-0{color} | {color:orange} checkstyle {color} | {color:orange}  
0m 12s{color} | {color:orange} hadoop-tools/hadoop-azure: The patch generated 
11 new + 107 unchanged - 7 fixed = 118 total (was 114) {color} |
| {color:green}+1{color} | {color:green} mvnsite {color} | {color:green}  0m 
18s{color} | {color:green} the patch passed {color} |
| {color:red}-1{color} | {color:red} whitespace {color} | {color:red}  0m  
0s{color} | {color:red} The patch has 1 line(s) that end in whitespace. Use git 
apply --whitespace=fix <>. Refer https://git-scm.com/docs/git-apply 
{color} |
| {color:green}+1{color} | {color:green} findbugs {color} | {color:green}  0m 
32s{color} | {color:green} the patch passed {color} |
| {color:green}+1{color} | {color:green} javadoc {color} | {color:green}  0m 
11s{color} | {color:green} the patch passed {color} |
| {color:green}+1{color} | {color:green} unit {color} | {color:green}  1m 
19s{color} | {color:green} hadoop-azure in the patch passed. {color} |
| {color:green}+1{color} | {color:green} asflicense {color} | {color:green}  0m 
19s{color} | {color:green} The patch does not generate ASF License warnings. 
{color} |
| {color:black}{color} | {color:black} {color} | {color:black} 22m 30s{color} | 
{color:black} {color} |
\\
\\
|| Subsystem || Report/Notes ||
| Docker |  Image:yetus/hadoop:14b5c93 |
| JIRA Issue | HADOOP-14535 |
| JIRA Patch URL | 
https://issues.apache.org/jira/secure/attachment/12873363/0003-Random-access-and-seek-imporvements-to-azure-file-system.patch
 |
| Optional Tests |  asflicense  compile  javac  javadoc  mvninstall  mvnsite  
unit  findbugs  checkstyle  |
| uname | Linux 18f4aaf639cf 3.13.0-116-generic #163-Ubuntu SMP Fri Mar 31 
14:13:22 UTC 2017 x86_64 x86_64 x86_64 GNU/Linux |
| Build tool | maven |
| Personality | /testptch/hadoop/patchprocess/precommit/personality/provided.sh 
|
| git revision | trunk / 6460df2 |
| Default Java | 1.8.0_131 |
| findbugs | v3.1.0-RC1 |
| checkstyle | 
https://builds.apache.org/job/PreCommit-HADOOP-Build/12553/artifact/patchprocess/diff-checkstyle-hadoop-tools_hadoop-azure.txt
 |
| whitespace | 
https://builds.apache.org/job/PreCommit-HADOOP-Build/12553/artifact/patchprocess/whitespace-eol.txt
 |
|  Test Results | 
https://builds.apache.org/job/PreCommit-HADOOP-Build/12553/testReport/ |
| modules | C: hadoop-tools/hadoop-azure U: hadoop-tools/hadoop-azure |
| Console output | 
https://builds.apache.org/job/PreCommit-HADOOP-Build/12553/console |
| Powered by | Apache Yetus 0.5.0-SNAPSHOT   http://yetus.apache.org |


This message was automatically generated.



> Support for random access and seek of block blobs
> -
>
> Key: HADOOP-14535
> URL: https://issues.apache.org/jira/browse/HADOOP-14535
> Project: Hadoop Common
>  Issue Type: Improvement
>  Components: fs/azure
>Reporter: Thomas
>Assignee: Thomas
> Attachments: 
> 

[jira] [Commented] (HADOOP-14535) Support for random access and seek of block blobs

2017-06-16 Thread Hadoop QA (JIRA)

[ 
https://issues.apache.org/jira/browse/HADOOP-14535?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16052596#comment-16052596
 ] 

Hadoop QA commented on HADOOP-14535:


| (x) *{color:red}-1 overall{color}* |
\\
\\
|| Vote || Subsystem || Runtime || Comment ||
| {color:blue}0{color} | {color:blue} reexec {color} | {color:blue}  0m 
13s{color} | {color:blue} Docker mode activated. {color} |
| {color:green}+1{color} | {color:green} @author {color} | {color:green}  0m  
0s{color} | {color:green} The patch does not contain any @author tags. {color} |
| {color:green}+1{color} | {color:green} test4tests {color} | {color:green}  0m 
 0s{color} | {color:green} The patch appears to include 2 new or modified test 
files. {color} |
| {color:green}+1{color} | {color:green} mvninstall {color} | {color:green} 13m 
49s{color} | {color:green} trunk passed {color} |
| {color:green}+1{color} | {color:green} compile {color} | {color:green}  0m 
19s{color} | {color:green} trunk passed {color} |
| {color:green}+1{color} | {color:green} checkstyle {color} | {color:green}  0m 
15s{color} | {color:green} trunk passed {color} |
| {color:green}+1{color} | {color:green} mvnsite {color} | {color:green}  0m 
21s{color} | {color:green} trunk passed {color} |
| {color:green}+1{color} | {color:green} findbugs {color} | {color:green}  0m 
30s{color} | {color:green} trunk passed {color} |
| {color:green}+1{color} | {color:green} javadoc {color} | {color:green}  0m 
15s{color} | {color:green} trunk passed {color} |
| {color:green}+1{color} | {color:green} mvninstall {color} | {color:green}  0m 
18s{color} | {color:green} the patch passed {color} |
| {color:green}+1{color} | {color:green} compile {color} | {color:green}  0m 
18s{color} | {color:green} the patch passed {color} |
| {color:green}+1{color} | {color:green} javac {color} | {color:green}  0m 
18s{color} | {color:green} the patch passed {color} |
| {color:orange}-0{color} | {color:orange} checkstyle {color} | {color:orange}  
0m 11s{color} | {color:orange} hadoop-tools/hadoop-azure: The patch generated 
11 new + 107 unchanged - 7 fixed = 118 total (was 114) {color} |
| {color:green}+1{color} | {color:green} mvnsite {color} | {color:green}  0m 
19s{color} | {color:green} the patch passed {color} |
| {color:red}-1{color} | {color:red} whitespace {color} | {color:red}  0m  
0s{color} | {color:red} The patch has 2 line(s) that end in whitespace. Use git 
apply --whitespace=fix <>. Refer https://git-scm.com/docs/git-apply 
{color} |
| {color:red}-1{color} | {color:red} findbugs {color} | {color:red}  0m 
36s{color} | {color:red} hadoop-tools/hadoop-azure generated 3 new + 0 
unchanged - 0 fixed = 3 total (was 0) {color} |
| {color:green}+1{color} | {color:green} javadoc {color} | {color:green}  0m 
12s{color} | {color:green} the patch passed {color} |
| {color:green}+1{color} | {color:green} unit {color} | {color:green}  1m 
21s{color} | {color:green} hadoop-azure in the patch passed. {color} |
| {color:green}+1{color} | {color:green} asflicense {color} | {color:green}  0m 
17s{color} | {color:green} The patch does not generate ASF License warnings. 
{color} |
| {color:black}{color} | {color:black} {color} | {color:black} 20m 28s{color} | 
{color:black} {color} |
\\
\\
|| Reason || Tests ||
| FindBugs | module:hadoop-tools/hadoop-azure |
|  |  Dead store to numberOfBytesRead in 
org.apache.hadoop.fs.azure.BlockBlobInputStream.read()  At 
BlockBlobInputStream.java:org.apache.hadoop.fs.azure.BlockBlobInputStream.read()
  At BlockBlobInputStream.java:[line 212] |
|  |  Inconsistent synchronization of 
org.apache.hadoop.fs.azure.BlockBlobInputStream$MemoryOutputStream.writePosition;
 locked 83% of time  Unsynchronized access at BlockBlobInputStream.java:83% of 
time  Unsynchronized access at BlockBlobInputStream.java:[line 268] |
|  |  Should org.apache.hadoop.fs.azure.BlockBlobInputStream$MemoryOutputStream 
be a _static_ inner class?  At BlockBlobInputStream.java:inner class?  At 
BlockBlobInputStream.java:[lines 251-299] |
\\
\\
|| Subsystem || Report/Notes ||
| Docker |  Image:yetus/hadoop:14b5c93 |
| JIRA Issue | HADOOP-14535 |
| JIRA Patch URL | 
https://issues.apache.org/jira/secure/attachment/12873347/0001-Random-access-and-seek-imporvements-to-azure-file-system.patch
 |
| Optional Tests |  asflicense  compile  javac  javadoc  mvninstall  mvnsite  
unit  findbugs  checkstyle  |
| uname | Linux b71679b993f2 3.13.0-116-generic #163-Ubuntu SMP Fri Mar 31 
14:13:22 UTC 2017 x86_64 x86_64 x86_64 GNU/Linux |
| Build tool | maven |
| Personality | /testptch/hadoop/patchprocess/precommit/personality/provided.sh 
|
| git revision | trunk / 82bbcbf |
| Default Java | 1.8.0_131 |
| findbugs | v3.1.0-RC1 |
| checkstyle | 
https://builds.apache.org/job/PreCommit-HADOOP-Build/12551/artifact/patchprocess/diff-checkstyle-hadoop-tools_hadoop-azure.txt
 |
| whitespace |