[ 
https://issues.apache.org/jira/browse/HADOOP-15364?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16726843#comment-16726843
 ] 

Steve Loughran commented on HADOOP-15364:
-----------------------------------------

[~sameer.chouhdary]

BTW, the active dev branch is 
https://github.com/steveloughran/hadoop/tree/filesystem/HADOOP-15229-openfile

h2. > We need to improve error handling and resource closer.

S3A Invoker is used to wrap IO

 the initial [select 
call|https://github.com/steveloughran/hadoop/blob/filesystem/HADOOP-15229-openfile/hadoop-tools/hadoop-aws/src/main/java/org/apache/hadoop/fs/s3a/S3AFileSystem.java#L3527]
 is retried with the [normal retry 
policy|https://github.com/steveloughran/hadoop/blob/filesystem/HADOOP-15229-openfile/hadoop-tools/hadoop-aws/src/main/java/org/apache/hadoop/fs/s3a/S3ARetryPolicy.java].
 If there are specific errors which need extra retry/fast fail, that'd be handy.

There's no attempt to recover from failures in a read() itself, the SDK 
exceptions are simply mapped/wrapped to IOEs.
https://github.com/steveloughran/hadoop/blob/802fcc9f4d80f6d11582c175efa57ed580d0b25d/hadoop-tools/hadoop-aws/src/main/java/org/apache/hadoop/fs/s3a/select/SelectInputStream.java#L210

looking @ that code, I think we should be collecting stats on failures there



h2. >  Documentation can be improved.

always. See 
https://github.com/steveloughran/hadoop/blob/filesystem/HADOOP-15229-openfile/hadoop-tools/hadoop-aws/src/site/markdown/tools/hadoop-aws/s3_select.md

h2. > Need integration and load tests with real S3 service.

The ITest* Tests do this, [in this 
package|https://github.com/steveloughran/hadoop/tree/filesystem/HADOOP-15229-openfile/hadoop-tools/hadoop-aws/src/test/java/org/apache/hadoop/fs/s3a/select]

Test policy is covered in [S3a test 
policy|https://hadoop.apache.org/docs/r3.1.1/hadoop-aws/tools/hadoop-aws/testing.html].
 

Get set up for test runs first and then think about what more can go in. Goal: 
full test suite takes 6-8 minutes over a long-haul link, with in-config options 
to scale up operations if you want to. For s3 select it'd be good to have some 
other standard source files to test against; landsat.csv.gz only stresses the 
gzip and csv logic, something with timestamps which can be parsed and in bzip2 
would be nice. These should be public & free: our developers don't have the 
time nor necessarily the funding to create/store these themselves.

One thing which isn't tested yet is handling of bad data: what if the CSV has 
inconsistent numbers of rows, mixes tabs and spaces, datatypes inconsistent, 
not all datetimes parseable etc. Someone should really write those, if just to 
have more error messages and stack traces for the docs.



> Add support for S3 Select to S3A
> --------------------------------
>
>                 Key: HADOOP-15364
>                 URL: https://issues.apache.org/jira/browse/HADOOP-15364
>             Project: Hadoop Common
>          Issue Type: Sub-task
>          Components: fs/s3
>            Reporter: Steve Loughran
>            Assignee: Steve Loughran
>            Priority: Major
>         Attachments: HADOOP-15364-001.patch, HADOOP-15364-002.patch, 
> HADOOP-15364-004.patch
>
>
> Expect a PoC patch for this in a couple of days; 
> * it'll depend on an SDK update to work, plus a couple of of other minor 
> changes
> * Adds command line option too 
> {code}
> hadoop s3guard select -header use -compression gzip -limit 100 
> s3a://landsat-pds/scene_list.gz" \
> "SELECT s.entityId FROM S3OBJECT s WHERE s.cloudCover = '0.0' "
> {code}
> For wider use we'll need to implement the HADOOP-15229 so that callers can 
> pass down the expression along with any other parameters



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to