[jira] [Comment Edited] (HADOOP-18884) [ABFS] Support VectorIO in ABFS Input Stream

Arnaud Nauwynck (Jira) Sat, 23 Nov 2024 09:25:47 -0800


    [ 
https://issues.apache.org/jira/browse/HADOOP-18884?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17900604#comment-17900604
 ]


Arnaud Nauwynck edited comment on HADOOP-18884 at 11/23/24 5:24 PM:
--------------------------------------------------------------------

Please see comment in (duplicate) jira issue [#HADOOP-19345]

Even if abfss did not support multiple GET requests, it is NOT a problem of 
merging almost consecutive read requests, and ignore data wholes it it. 
Indeed, it is much more efficient to read 8Mo more in an Azure request than to 
open a new Https connection(TCP-IP connection + TLS handshake + small Request 
even of 0 byte)

Notice also that azure request are limited to 16Mo ( ? ), but are billed by 
multiple of 4 Mo.
So if you read only 1 byte, you are billed anyway for the 4Mo.

See Azure doc
[https://azure.microsoft.com/en-us/pricing/details/storage/blobs/|https://azure.microsoft.com/en-us/pricing/details/storage/blobs/]
{noformat}
When using ADLS Gen2 API for transactions, read and write transactions occur 
for every 4 MB of data.
{noformat}





was (Author: arnaud.nauwynck):
Please see comment in (duplicate) jira issue [#HADOOP-19345]

Even if abfss did not support multiple GET requests, it is NOT a problem of 
merging almost consecutive read requests, and ignore data wholes it it. 
Indeed, it is much more efficient to read 8Mo more in an Azure request than to 
open a new Https connection(TCP-IP connection + TLS handshake + small Request 
even of 0 byte)

Notice also that azure request are limited to 16Mo (?), but are billed by 
multiple of 4 Mo.
So if you read only 1 byte, you are billed anyway for the 4Mo.

See Azure doc
[https://azure.microsoft.com/en-us/pricing/details/storage/blobs/|https://azure.microsoft.com/en-us/pricing/details/storage/blobs/]
{noformat}
When using ADLS Gen2 API for transactions, read and write transactions occur 
for every 4 MB of data.
{noformat}




> [ABFS] Support VectorIO in ABFS Input Stream
> --------------------------------------------
>
>                 Key: HADOOP-18884
>                 URL: https://issues.apache.org/jira/browse/HADOOP-18884
>             Project: Hadoop Common
>          Issue Type: Sub-task
>          Components: fs/azure
>    Affects Versions: 3.3.9
>            Reporter: Steve Loughran
>            Assignee: Anmol Asrani
>            Priority: Major
>
> the hadoop vector IO APIs are supported in file;// and s3a://; there's a hive 
> ORC patch for this and PARQUET-2171 adds it for parquet -after which all apps 
> using the library with a matching hadoop version and the feature enabled will 
> get a significant speedup.
> abfs needs to support too, which needs support for parallel GET requests for 
> different ranges



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[jira] [Comment Edited] (HADOOP-18884) [ABFS] Support VectorIO in ABFS Input Stream

Reply via email to