[jira] [Commented] (HADOOP-15229) Add FileSystem builder-based openFile() API to match createFile()

Steve Loughran (JIRA) Fri, 07 Dec 2018 04:11:58 -0800


    [ 
https://issues.apache.org/jira/browse/HADOOP-15229?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16712760#comment-16712760
 ]


Steve Loughran commented on HADOOP-15229:
-----------------------------------------

bq. Being able to change the properties of the stream later is more convenient

bq. I really don't think we should make a convoluted call back structure just 
so that we can have the options available at creation.


I think it depends on the option. If it is, say {{fadvise}} ,yes, that's done 
on the open fd. But looking at Win32's 
[::createFile()|https://docs.microsoft.com/en-us/windows/desktop/api/fileapi/nf-fileapi-createfilea]
 call, you can see how providing params at create time works too.

And if you look at {{FileSystem.create()}} we have 15 overloaded variants each 
with slightly different args, which is why {{createFile()}} went in to say 
"from now on, builder with the option of FS specific flags".

For the S3 select stuff, that openfile flag really does determine what request 
is made to open the file: getFileStatus to get existence/length and then a GET 
offset- once the first read/positioned read is kicked off. Delaying the options 
until later would be a nightmare (what if it gets changed > one time, after a 
single byte is read, etc, etc. Oh and how does the app check so easily for full 
support of options at create time?

Also —and this also significant for object stores, we can execute all HTTP 
requests in a separate thread; letting the caller do things while it kicks off 
the open. Yes, you apps could do that anyway, but they have to know that open 
time can be v. slow against an object store, have their own threadpool etc. And 
with a move to async IO in the underlying SDKs (e.g. AWS SDK 2.0), the 
filestore connectors can keep their thread pool sizes under control.

Now, you could add both: the ability to set params on open (here) and the 
ability to check/set args on a stream, but you'd be a lot more restricted as to 
what you could do.

> Add FileSystem builder-based openFile() API to match createFile()
> -----------------------------------------------------------------
>
>                 Key: HADOOP-15229
>                 URL: https://issues.apache.org/jira/browse/HADOOP-15229
>             Project: Hadoop Common
>          Issue Type: New Feature
>          Components: fs, fs/azure, fs/s3
>    Affects Versions: 3.0.0
>            Reporter: Steve Loughran
>            Assignee: Steve Loughran
>            Priority: Major
>         Attachments: HADOOP-15229-001.patch, HADOOP-15229-002.patch, 
> HADOOP-15229-003.patch, HADOOP-15229-004.patch, HADOOP-15229-004.patch, 
> HADOOP-15229-005.patch, HADOOP-15229-006.patch, HADOOP-15229-007.patch, 
> HADOOP-15229-009.patch
>
>
> Replicate HDFS-1170 and HADOOP-14365 with an API to open files.
> A key requirement of this is not HDFS, it's to put in the fadvise policy for 
> working with object stores, where getting the decision to do a full GET and 
> TCP abort on seek vs smaller GETs is fundamentally different: the wrong 
> option can cost you minutes. S3A and Azure both have adaptive policies now 
> (first backward seek), but they still don't do it that well.
> Columnar formats (ORC, Parquet) should be able to say "fs.input.fadvise" 
> "random" as an option when they open files; I can imagine other options too.
> The Builder model of [~eddyxu] is the one to mimic, method for method. 
> Ideally with as much code reuse as possible



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[jira] [Commented] (HADOOP-15229) Add FileSystem builder-based openFile() API to match createFile()

Reply via email to