[jira] [Updated] (SPARK-48571) Reduce the number of accesses to S3 object storage

Steve Loughran (Jira) Wed, 21 Aug 2024 07:11:07 -0700


     [ 
https://issues.apache.org/jira/browse/SPARK-48571?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]


Steve Loughran updated SPARK-48571:
-----------------------------------
    Description: 
If we access a Spark table on an object storage file system with parquet files, 
the object storage suffers many requests that seem to be unnecessary. To 
explain this I will do it with an example:

I have created a simple table, with 3 files:

*business/t_filter/country=ES/data_date_part=2023-09-27/part-00000-0f52aae9-2db8-415e-93f3-8331539c0ead.c000*
*business/t_filter/country=ES/data_date_part=2023-06-01/part-00000-0f52aae9-2db8-415e-93f3-8331539c0ead.c000*
    
*business/t_filter/country=ES/data_date_part=2023-09-27/part-00000-f10096c1-53bc-4e2f-bc56-eba65acfa44a.c000*
    

and I have put a table that represents business/t_filter with country and 
data_date_part partitions, you have the following requests.

If you use versions prior to Spark 3.5 or Hadoop 3.4, in my case it is exactly 
Spark 3.2 and Hadoop 3.1, the number of requests you have are the following -> 
IMAGE Spark 3.2 Hadoop 3.1

In this image we can see all the requests where we can find the following 
errors:
 * Two HEAD and two LIST are made with the implementation of S3, of the folders 
where the files are located, which could only be resolved with a single list. 
This bug has already been resolved in -> HADOOP-18073 -> Result : IMAGE 2 Spark 
3.2 Hadoop 3.4
 * For each file, the parquet footing is listed twice. This bug is resolved in 
->SPARK-42388 -> Result : IMAGE Spark 3.5 Hadoop 3.1
 * A Head Object is launched twice each time a file is read, this could be 
reduced by implementing the FileSystem interface so that it could receive the 
FileStatus that has already been calculated above.
 ** HADOOP-15229
 ** PARQUET-2493
 * The requests could be reduced when reading the parquet footer, since first 
you have to read the size of the schema and then the schema, which implies two 
HTTP/HTTPS requests to S3. It would be nice if there was a minimum threshold, 
for example 100KB, in which, if the file is smaller than that, it would not 
have to make two requests, and the entire file would be brought, since bringing 
100 KB will take less time in one request to bring 8 B in a request and then 
another request for x KB. Even so, I don't know if this task makes sense.
 ** It would be to change this implementation, with an environment variable, 
that if it is set to -1 it does the same, but if it has a threshold set, up to 
that threshold you do not have to call the seek function twice, which repeats a 
GET Object.  
[https://github.com/apache/parquet-java/blob/apache-parquet-1.14.0/parquet-hadoop/src/main/java/org/apache/parquet/hadoop/ParquetFileReader.java]

 

With all these improvements, updating to the latest version of Spark and Hadoop 
would go from more than 30 requests to 11 in the proposed example.

 

 

 

 

  was:
If we access a Spark table on an object storage file system with parquet files, 
the object storage suffers many requests that seem to be unnecessary. To 
explain this I will do it with an example:

I have created a simple table, with 3 files:

*business/t_filter/country=ES/data_date_part=2023-09-27/part-00000-0f52aae9-2db8-415e-93f3-8331539c0ead.c000*
*business/t_filter/country=ES/data_date_part=2023-06-01/part-00000-0f52aae9-2db8-415e-93f3-8331539c0ead.c000*
    
*business/t_filter/country=ES/data_date_part=2023-09-27/part-00000-f10096c1-53bc-4e2f-bc56-eba65acfa44a.c000*
    

and I have put a table that represents business/t_filter with country and 
data_date_part partitions, you have the following requests.

If you use versions prior to Spark 3.5 or Hadoop 3.4, in my case it is exactly 
Spark 3.2 and Hadoop 3.1, the number of requests you have are the following -> 
IMAGE Spark 3.2 Hadoop 3.1

In this image we can see all the requests where we can find the following 
errors:
 * Two HEAD and two LIST are made with the implementation of S3, of the folders 
where the files are located, which could only be resolved with a single list. 
This bug has already been resolved in -> HADOOP-18073 -> Result : IMAGE 2 Spark 
3.2 Hadoop 3.4
 * For each file, the parquet footing is listed twice. This bug is resolved in 
->SPARK-42388 -> Result : IMAGE Spark 3.5 Hadoop 3.1
 * A Head Object is launched twice each time a file is read, this could be 
reduced by implementing the FileSystem interface so that it could receive the 
FileStatus that has already been calculated above.
 ** HADOOP-15229
 ** PARQUET-2493
 ** HADOOP-19200
 * The requests could be reduced when reading the parquet footer, since first 
you have to read the size of the schema and then the schema, which implies two 
HTTP/HTTPS requests to S3. It would be nice if there was a minimum threshold, 
for example 100KB, in which, if the file is smaller than that, it would not 
have to make two requests, and the entire file would be brought, since bringing 
100 KB will take less time in one request to bring 8 B in a request and then 
another request for x KB. Even so, I don't know if this task makes sense.
 ** It would be to change this implementation, with an environment variable, 
that if it is set to -1 it does the same, but if it has a threshold set, up to 
that threshold you do not have to call the seek function twice, which repeats a 
GET Object.  
[https://github.com/apache/parquet-java/blob/apache-parquet-1.14.0/parquet-hadoop/src/main/java/org/apache/parquet/hadoop/ParquetFileReader.java]

 

With all these improvements, updating to the latest version of Spark and Hadoop 
would go from more than 30 requests to 11 in the proposed example.

 

 

 

 


> Reduce the number of accesses to S3 object storage
> --------------------------------------------------
>
>                 Key: SPARK-48571
>                 URL: https://issues.apache.org/jira/browse/SPARK-48571
>             Project: Spark
>          Issue Type: Task
>          Components: Spark Core
>    Affects Versions: 3.5.0
>            Reporter: Oliver Caballero Alvarez
>            Priority: Major
>         Attachments: Spark 3.2 Hadoop-aws 3.1.PNG, Spark 3.2 Hadoop-aws 
> 3.4.PNG, Spark 3.5 Hadoop-aws 3.1.PNG
>
>
> If we access a Spark table on an object storage file system with parquet 
> files, the object storage suffers many requests that seem to be unnecessary. 
> To explain this I will do it with an example:
> I have created a simple table, with 3 files:
> *business/t_filter/country=ES/data_date_part=2023-09-27/part-00000-0f52aae9-2db8-415e-93f3-8331539c0ead.c000*
> *business/t_filter/country=ES/data_date_part=2023-06-01/part-00000-0f52aae9-2db8-415e-93f3-8331539c0ead.c000*
>     
> *business/t_filter/country=ES/data_date_part=2023-09-27/part-00000-f10096c1-53bc-4e2f-bc56-eba65acfa44a.c000*
>     
> and I have put a table that represents business/t_filter with country and 
> data_date_part partitions, you have the following requests.
> If you use versions prior to Spark 3.5 or Hadoop 3.4, in my case it is 
> exactly Spark 3.2 and Hadoop 3.1, the number of requests you have are the 
> following -> IMAGE Spark 3.2 Hadoop 3.1
> In this image we can see all the requests where we can find the following 
> errors:
>  * Two HEAD and two LIST are made with the implementation of S3, of the 
> folders where the files are located, which could only be resolved with a 
> single list. This bug has already been resolved in -> HADOOP-18073 -> Result 
> : IMAGE 2 Spark 3.2 Hadoop 3.4
>  * For each file, the parquet footing is listed twice. This bug is resolved 
> in ->SPARK-42388 -> Result : IMAGE Spark 3.5 Hadoop 3.1
>  * A Head Object is launched twice each time a file is read, this could be 
> reduced by implementing the FileSystem interface so that it could receive the 
> FileStatus that has already been calculated above.
>  ** HADOOP-15229
>  ** PARQUET-2493
>  * The requests could be reduced when reading the parquet footer, since first 
> you have to read the size of the schema and then the schema, which implies 
> two HTTP/HTTPS requests to S3. It would be nice if there was a minimum 
> threshold, for example 100KB, in which, if the file is smaller than that, it 
> would not have to make two requests, and the entire file would be brought, 
> since bringing 100 KB will take less time in one request to bring 8 B in a 
> request and then another request for x KB. Even so, I don't know if this task 
> makes sense.
>  ** It would be to change this implementation, with an environment variable, 
> that if it is set to -1 it does the same, but if it has a threshold set, up 
> to that threshold you do not have to call the seek function twice, which 
> repeats a GET Object.  
> [https://github.com/apache/parquet-java/blob/apache-parquet-1.14.0/parquet-hadoop/src/main/java/org/apache/parquet/hadoop/ParquetFileReader.java]
>  
> With all these improvements, updating to the latest version of Spark and 
> Hadoop would go from more than 30 requests to 11 in the proposed example.
>  
>  
>  
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[jira] [Updated] (SPARK-48571) Reduce the number of accesses to S3 object storage

Reply via email to