[ 
https://issues.apache.org/jira/browse/HADOOP-15229?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16748145#comment-16748145
 ] 

Gabor Bota commented on HADOOP-15229:
-------------------------------------

Hey [~ste...@apache.org], thanks for working on this, I've just reviewed the 
patch, but I haven't played around with the API yet. Seems like a very useful 
feature, can't wait to see how other components can use what it can provide.

 

*Two issues I've found in s3_select.md*

1. The following text:
{noformat}
+Most of the Hadoop RecordReaders automatically choose a decompressor
+based on the extension of the source file. This causes problems when[...]
{noformat}
is in the docs two times, maybe the second one is not in the right place. After 
subtitle
{noformat}
 +### How to disable the GZip decompressor when querying Gzipped source files. 
{noformat}
and also in
{noformat}
 +### How to Disable Text File Splitting {noformat}
Is this on purpose? (it seems like in {{+### How to Disable Text File 
Splitting}} it's mid-sentence.)

2. Under the subtitle
{noformat}
+### "mid-query" failures on large datasets
{noformat}
there's a sentence without an ending:
{noformat}
+may only surface partway through the read. This does not result in
{noformat}
----
 

*A question on the feature itself and compatibility with object stores*
This won't work with 3rd party object stores with S3 interface like Ceph 
radosgw, which does not support this feature. In the case, if this feature is 
enabled on an object store where the feature is not supported what is the 
expected behavior?
(The configuration should be fs.s3a.select.enabled=false in that case if I'm 
correct). I can test this if needed. 

 

Tested against us-west-2. No new failures (other than what's discussed under 
HADOOP-16057), there was a few timeouts but it was clear after a rerun.

> Add FileSystem builder-based openFile() API to match createFile() + S3 Select
> -----------------------------------------------------------------------------
>
>                 Key: HADOOP-15229
>                 URL: https://issues.apache.org/jira/browse/HADOOP-15229
>             Project: Hadoop Common
>          Issue Type: New Feature
>          Components: fs, fs/azure, fs/s3
>    Affects Versions: 3.2.0
>            Reporter: Steve Loughran
>            Assignee: Steve Loughran
>            Priority: Major
>         Attachments: HADOOP-15229-001.patch, HADOOP-15229-002.patch, 
> HADOOP-15229-003.patch, HADOOP-15229-004.patch, HADOOP-15229-004.patch, 
> HADOOP-15229-005.patch, HADOOP-15229-006.patch, HADOOP-15229-007.patch, 
> HADOOP-15229-009.patch, HADOOP-15229-010.patch, HADOOP-15229-011.patch, 
> HADOOP-15229-012.patch, HADOOP-15229-013.patch, HADOOP-15229-014.patch, 
> HADOOP-15229-015.patch, HADOOP-15229-016.patch, HADOOP-15229-017.patch, 
> HADOOP-15229-018.patch, HADOOP-15229-019.patch
>
>
> Replicate HDFS-1170 and HADOOP-14365 with an API to open files.
> A key requirement of this is not HDFS, it's to put in the fadvise policy for 
> working with object stores, where getting the decision to do a full GET and 
> TCP abort on seek vs smaller GETs is fundamentally different: the wrong 
> option can cost you minutes. S3A and Azure both have adaptive policies now 
> (first backward seek), but they still don't do it that well.
> Columnar formats (ORC, Parquet) should be able to say "fs.input.fadvise" 
> "random" as an option when they open files; I can imagine other options too.
> The Builder model of [~eddyxu] is the one to mimic, method for method. 
> Ideally with as much code reuse as possible



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

---------------------------------------------------------------------
To unsubscribe, e-mail: common-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: common-issues-h...@hadoop.apache.org

Reply via email to