[ 
https://issues.apache.org/jira/browse/HADOOP-15229?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16725355#comment-16725355
 ] 

Steve Loughran commented on HADOOP-15229:
-----------------------------------------

* CLI adds "-inputformat" "-outputformat " options, both of which *only* accept 
csv as the valid argument. This is to set things up for future

 * SelectInputStream uses Invoker.once() to translate exceptions raised in the 
read() operation of the wrapped stream. Because now its not just an HTTP 
connection, it's a chain of paged results from the S3 select call, and it can 
fail in new ways...even socket exceptions get translated into AWS SDK 
exceptions. These are now converted back to some form of IOE.
 * {{SelectInputStream.seek()}} supports forward seeks of arbitrary distance by 
calling read().

The reason for forward seeking is if the results of a select are split up (e.g. 
TextInputFormat splits up the source), the standard file IO path is: 
{{open(file).seek(start)}}. Implementing seek(), even very inefficiently, stops 
that code from breaking. And it is inefficient, it's just reading the data, so 
is O(data), and, for remote access, someone gets to pay for those bytes. But at 
least things will now work. The count of bytes skipped is collected in the 
stream output statistics, and hence into the FS statistics itself.
 * The (private/unstable) FutureIOSupport code adds a WrappedIOException 
extends RuntimeException class, for anyone in a jav-8 lambda expression to 
catch an IOE and rethrow it; the FutureIOSupport unwrap code knows that this 
WrappedIOException is always to be unwrapped, so does that. It makes calling IO 
operations from the new java futures APIs slightly easier, but still ugly.

 * fsdatainputstream.md: states that after a failed attempt to seek somewhere, 
the current position in the file may be different. It was never written down 
before, but I'm sure this isn't the first. I also called out that lazy seek is 
a thing: the new position must be compared to the current known file limits, 
but there's no need to check that the file still exists or that its length is 
unchanged.

h2. Test enhancements

ITestS3SelectCLI: tests the new command arguments.

ITestS3Select: adds a test for seeking around and past the end of a file.
h3. ITestS3SelectLandsat.

Adds a new test which is only run in a -Dscale run: it seeks through the 
landsat file. The first few MB are read into a byte array for the reference 
data, then a new read is executed which compares values at offsets to verify 
that seek() really is reading to the right place. Then it does a series of 1MB 
forward seeks until the original file length is reached, at which point it 
stops. the unzipped file is much bigger than the .gz one: giving up early saves 
test time.
h3. ITestS3SelectMRJob

This is a new test which uses the landsat data as the input for an MR job: 
Takes about 5-6s over a long haul link BTW, a lot less than it would take if 
you have to download and expand all the data locally. It projects only one 
column, then to count the instances of the different column values, just uses 
the normal wordcount tokenizer. And all the data is written back to 
 S3 via the S3A staging committer, just to round everything out.
 * SQL: SELECT s.processingLevel from S3OBJECT
 * mapper: WC tokenizer
 * reducer: WC tokenizer
 * committer: s3a staging

Output
{code:java}
L1GT 370231
L1T  689526
{code}
Takes under 10 seconds BTW, over a long-haul link. Shows how the slower the 
link, the more S3 Select offers benefits.

Other
 * fix up the TestHDFSContractOpen test javadoc header. This ensures that 
jenkins will test it.

Tested: s3 ireland. This patch contains  HADOOP-16015, so the S3 MR tests work 
again; all is well

> Add FileSystem builder-based openFile() API to match createFile()
> -----------------------------------------------------------------
>
>                 Key: HADOOP-15229
>                 URL: https://issues.apache.org/jira/browse/HADOOP-15229
>             Project: Hadoop Common
>          Issue Type: New Feature
>          Components: fs, fs/azure, fs/s3
>    Affects Versions: 3.0.0
>            Reporter: Steve Loughran
>            Assignee: Steve Loughran
>            Priority: Major
>         Attachments: HADOOP-15229-001.patch, HADOOP-15229-002.patch, 
> HADOOP-15229-003.patch, HADOOP-15229-004.patch, HADOOP-15229-004.patch, 
> HADOOP-15229-005.patch, HADOOP-15229-006.patch, HADOOP-15229-007.patch, 
> HADOOP-15229-009.patch, HADOOP-15229-010.patch, HADOOP-15229-011.patch, 
> HADOOP-15229-012.patch, HADOOP-15229-013.patch
>
>
> Replicate HDFS-1170 and HADOOP-14365 with an API to open files.
> A key requirement of this is not HDFS, it's to put in the fadvise policy for 
> working with object stores, where getting the decision to do a full GET and 
> TCP abort on seek vs smaller GETs is fundamentally different: the wrong 
> option can cost you minutes. S3A and Azure both have adaptive policies now 
> (first backward seek), but they still don't do it that well.
> Columnar formats (ORC, Parquet) should be able to say "fs.input.fadvise" 
> "random" as an option when they open files; I can imagine other options too.
> The Builder model of [~eddyxu] is the one to mimic, method for method. 
> Ideally with as much code reuse as possible



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to