[jira] [Commented] (HADOOP-13007) cherry pick s3 ehancements from PrestoS3FileSystem

2024-01-04 Thread Shilun Fan (Jira)


[ 
https://issues.apache.org/jira/browse/HADOOP-13007?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17802984#comment-17802984
 ] 

Shilun Fan commented on HADOOP-13007:
-

Bulk update: moved all 3.4.0 non-blocker issues, please move back if it is a 
blocker. Retarget 3.5.0.

> cherry pick s3 ehancements from PrestoS3FileSystem
> --
>
> Key: HADOOP-13007
> URL: https://issues.apache.org/jira/browse/HADOOP-13007
> Project: Hadoop Common
>  Issue Type: Sub-task
>  Components: fs/s3
>Affects Versions: 2.9.0
>Reporter: Steve Loughran
>Priority: Minor
>
> Looking at 
> [https://github.com/prestodb/presto/blob/master/presto-hive/src/main/java/com/facebook/presto/hive/s3/PrestoS3FileSystem.java],
>  they've done some interesting things: configurable connection timeouts and, 
> retry options, statistics to count exceptions caught/re-opened, and more
> review them, if there is good stuff there, add it to S3a



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: common-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: common-issues-h...@hadoop.apache.org



[jira] [Commented] (HADOOP-13007) cherry pick s3 ehancements from PrestoS3FileSystem

2020-03-12 Thread Steve Loughran (Jira)


[ 
https://issues.apache.org/jira/browse/HADOOP-13007?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17058166#comment-17058166
 ] 

Steve Loughran commented on HADOOP-13007:
-

h2. Presto features

I just had a big look through the code to see what is directory logic was.

* docs: 
https://prestosql.io/docs/current/connector/hive.html#amazon-s3-configuration
* source 
https://github.com/prestodb/presto/blob/master/presto-hive/src/main/java/com/facebook/presto/hive/s3/PrestoS3FileSystem.java

h3. Interesting points

* glacier objects can be skipped
* list always adds the trailing / for non-empty and then does the scan
* it doesn't create directories, so no need to recreate them


(I was about to say "demand create AWS client", but I misread it. Interesting 
thought though

h2. listLocatedStatus

does a list under path p + /

{code}
String key = keyFromPath(path);
if (!key.isEmpty()) {
key += PATH_SEPARATOR;
}

ListObjectsRequest request = new ListObjectsRequest()
.withBucketName(getBucketName(uri))
.withPrefix(key)
.withDelimiter(PATH_SEPARATOR);

{code}


mapping to stats can skip "GLACIER" == getStorageClass; they aren't visible at 
all.

{code}
private Iterator statusFromObjects(List 
objects)
{
// NOTE: for encrypted objects, S3ObjectSummary.size() used below is 
NOT correct,
// however, to get the correct size we'd need to make an additional 
request to get
// user metadata, and in this case it doesn't matter.
return objects.stream()
.filter(object -> !object.getKey().endsWith(PATH_SEPARATOR))
.filter(object -> !skipGlacierObjects || 
!isGlacierObject(object))
.map(object -> new FileStatus(
object.getSize(),
false,
1,
BLOCK_SIZE.toBytes(),
object.getLastModified().getTime(),
qualifiedPath(new Path(PATH_SEPARATOR + 
object.getKey()
.map(this::createLocatedFileStatus)
.iterator();
}
{code}

Note: this feeds into delete(); intentional or not

{code}
public boolean mkdirs(Path f, FsPermission permission)
{
// no need to do anything for S3
return true;
} 
{code}

* but getFileStatus does do: HEAD, HEAD + /, LIST + /

{code}
return new FileStatus(
getObjectSize(path, metadata),
S3_DIRECTORY_OBJECT_CONTENT_TYPE.equals(metadata.getContentType()),
1,
BLOCK_SIZE.toBytes(),
lastModifiedTime(metadata),
qualifiedPath(path));

{code}


* file length inferred from metadata -"x-amz-unencrypted-content-length" 
parsed. docs acknowlege that the results of a list are not consistent.
* dir marker content type used for "is this a dir"

h3. input stream

optimised for forward reads; 

* doesn't do a HEAD at open
* lazy seek
* skips range (default 1MB) before (non-draining) abort() and reopen
* simply opens with initial range in GET, but no limit
* maps 416 ->  EOFException 




h3. delete()

- doesn't do bulk deletes, just lists children and recursively calls delete on 
them (very, very inefficient)
-for recursive delete, if glacier files are skipped, they don't get deleted. 
Interesting idea


h3. Metrics

implements AWS SDK stats collection in 
com.facebook.presto.hive.s3.PrestoS3FileSystemMetricCollector; these feed back 
to the Presto metrics

h3. create()

if overwrite=true, skips all checks (even for dest being a dir). 
{code}
if ((!overwrite) && exists(path)) {
throw new IOException("File already exists:" + path);
}
{code}
Should we do that? it's, well, aggressive, but for apps which know what they 
are doing...

1. A new transfer manager is created for each output stream
1. The entire file is written to the staging dir
1. the transfer manager is given the file to upload; it will partition and 
upload how it chooses

so: no incremental writes, disk always used. But good simplicitly and no risk 
of partial uploads remaining around.
omitting all existence checks does make for a faster write and avoids all 404s 
sneaking in.

h2. Overall analysis

It's nicely minimal; optimised for directory trees with no markers. 
* input stream seems a lot less optimised than our code, which works better for 
backwards seeking clients.
* output stream avoids 404s by omitting all probes. Something to consider
* Metric wireup from AWS SDK looks simple enough for us to copy
* it does handle CSE where actual length < list length,  but caller has to do a 
HEAD to see this; if it runs off the result of listLocatedStatus and then did a 
seek off EOF-16, things would fail. Similarly: splitting.


What can we adopt?

# metrics
# maybe: dest is dir for overwrite=true
# dir content type get/check (ie 


{code}
isDir == (path.endswith"/ && len==0) || 
content_type==_"application/x-directory")
{code}



> 

[jira] [Commented] (HADOOP-13007) cherry pick s3 ehancements from PrestoS3FileSystem

2018-07-02 Thread Steve Loughran (JIRA)


[ 
https://issues.apache.org/jira/browse/HADOOP-13007?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16529981#comment-16529981
 ] 

Steve Loughran commented on HADOOP-13007:
-

thanks. 
Reviewing that code now, our retry logic is >= theirs, mostly.

They collect metrics from the AWS internals, which HADOOP-13551 has proposed

> cherry pick s3 ehancements from PrestoS3FileSystem
> --
>
> Key: HADOOP-13007
> URL: https://issues.apache.org/jira/browse/HADOOP-13007
> Project: Hadoop Common
>  Issue Type: Sub-task
>  Components: fs/s3
>Affects Versions: 2.9.0
>Reporter: Steve Loughran
>Priority: Minor
>
> Looking at 
> [https://github.com/prestodb/presto/blob/master/presto-hive/src/main/java/com/facebook/presto/hive/s3/PrestoS3FileSystem.java],
>  they've done some interesting things: configurable connection timeouts and, 
> retry options, statistics to count exceptions caught/re-opened, and more
> review them, if there is good stuff there, add it to S3a



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: common-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: common-issues-h...@hadoop.apache.org



[jira] [Commented] (HADOOP-13007) cherry pick s3 ehancements from PrestoS3FileSystem

2018-07-02 Thread Tsuyoshi Ozawa (JIRA)


[ 
https://issues.apache.org/jira/browse/HADOOP-13007?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16529827#comment-16529827
 ] 

Tsuyoshi Ozawa commented on HADOOP-13007:
-

I fixed the link to the PrestoS3FileSystem.java. Now, it is under s3 directory.

> cherry pick s3 ehancements from PrestoS3FileSystem
> --
>
> Key: HADOOP-13007
> URL: https://issues.apache.org/jira/browse/HADOOP-13007
> Project: Hadoop Common
>  Issue Type: Sub-task
>  Components: fs/s3
>Affects Versions: 2.9.0
>Reporter: Steve Loughran
>Priority: Minor
>
> Looking at 
> [https://github.com/prestodb/presto/blob/master/presto-hive/src/main/java/com/facebook/presto/hive/s3/PrestoS3FileSystem.java],
>  they've done some interesting things: configurable connection timeouts and, 
> retry options, statistics to count exceptions caught/re-opened, and more
> review them, if there is good stuff there, add it to S3a



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: common-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: common-issues-h...@hadoop.apache.org