[jira] [Updated] (HADOOP-17984) Hadoop-aws jar is unable to read file from S3 if used with third party like MINIO

Naresh (Jira) Thu, 28 Oct 2021 14:09:05 -0700


     [ 
https://issues.apache.org/jira/browse/HADOOP-17984?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]


Naresh updated HADOOP-17984:
----------------------------
    Description: 
Unable to read a file from S3 from spark if end point url is pointing to MINIO 
within EKS kubernetes cluster. We are able to do read/write from other clients 
and minio console. But when we read using spark I see empty data frame coming. 
If I use dataframe.show() it displays  like below.

 

++
 
++

++

 

*Spark Config:*

.config("spark.hadoop.fs.s3a.endpoint", "http://127.0.0.1:9000";) // minio url 
or port-forward to local

.config("spark.hadoop.fs.s3a.access.key",<myaccesskey>)

.config("spark.hadoop.fs.s3a.secret.key",<mysecretkey>)

 

"spark.hadoop.fs.s3a.secret.key"

"spark.hadoop.fs.s3a.secret.key"

.config("spark.hadoop.fs.s3a.path.style.access", *true*)

        .config("spark.hadoop.fs.s3a.impl", 
"org.apache.hadoop.fs.s3a.S3AFileSystem")

        .config("spark.hadoop.mapreduce.fileoutputcommitter.algorithm.version", 
"2")

        .config("fs.s3a.committer.staging.conflict-mode", "replace")

        .config("fs.s3a.committer.name", "file")

        .config("fs.s3a.committer.threads", "20")

        .config("fs.s3a.threads.max", "20")

        .config("fs.s3a.fast.upload.buffer", "bytebuffer")

        .config("fs.s3a.fast.upload.active.blocks", "8")

        .config("fs.s3a.block.size", "128M")

        .config("mapred.input.dir.recursive","true")

    .config("spark.sql.parquet.binaryAsString", "true")

 

 

*JAR files:*

hadoop-aws:3.2.0

aws-java-sdk:1.12.30

spark-core_2.12:3.1.2

spark-sql_2.12:3.1.2

 

*Logs:*

DEBUG S3AFileSystem:2121: Getting path status for 
s3a://<mybucket>/<myfolder>/2021/test1_2021-03-23_15_21_31.592.csv  
(2021/test1_2021-03-23_15_21_31.592.csv)

21/10/28 16:52:34 DEBUG S3AStorageStatistics:63: object_metadata_requests += 1  
->  1

21/10/28 16:52:34 DEBUG S3AFileSystem:2189: Found exact file: normal file

21/10/28 16:52:34 DEBUG S3AStorageStatistics:63: op_exists += 1  ->  1

21/10/28 16:52:34 DEBUG S3AStorageStatistics:63: op_get_file_status += 1  ->  2

21/10/28 16:52:34 DEBUG S3AFileSystem:2121: Getting path status for 
s3a://mybbucket/myfolder/test1_2021-03-23_15_21_31.592.csv  
(2021/test1_2021-03-23_15_21_31.592.csv)

21/10/28 16:52:34 DEBUG S3AStorageStatistics:63: object_metadata_requests += 1  
->  2

21/10/28 16:52:34 DEBUG S3AFileSystem:2189: Found exact file: normal file

21/10/28 16:52:34 DEBUG S3AFileSystem:1899: List status for path: 
s3a://mybbucket/myfolder/test1_2021-03-23_15_21_31.592.csv

21/10/28 16:52:34 DEBUG S3AStorageStatistics:63: op_list_status += 1  ->  1

21/10/28 16:52:34 DEBUG S3AStorageStatistics:63: op_get_file_status += 1  ->  3

21/10/28 16:52:34 DEBUG S3AFileSystem:2121: Getting path status for 
s3a://mybbucket/myfolder//test1_2021-03-23_15_21_31.592.csv  
(2021/test1_2021-03-23_15_21_31.592.csv)

21/10/28 16:52:34 DEBUG S3AStorageStatistics:63: object_metadata_requests += 1  
->  3

21/10/28 16:52:34 DEBUG S3AFileSystem:2189: Found exact file: normal file

21/10/28 16:52:34 DEBUG S3AFileSystem:1930: Adding: rd (not a dir): 
s3a://mybbucket/myfolder//test1_2021-03-23_15_21_31.592.csv

21/10/28 16:52:34 DEBUG S3AStorageStatistics:63: op_is_directory += 1  ->  2

21/10/28 16:52:34 DEBUG S3AStorageStatistics:63: op_get_file_status += 1  ->  4

21/10/28 16:52:34 DEBUG S3AFileSystem:2121: Getting path status for 
s3a://mybbucket/myfolder//test1_2021-03-23_15_21_31.592.csv  
(2021/test1_2021-03-23_15_21_31.592.csv)

21/10/28 16:52:34 DEBUG S3AStorageStatistics:63: object_metadata_requests += 1  
->  4

21/10/28 16:52:34 DEBUG S3AFileSystem:2189: Found exact file: normal file

21/10/28 16:52:34 DEBUG S3AFileSystem:1899: List status for path: 
s3a://mybbucket/myfolder//test1_2021-03-23_15_21_31.592.csv

21/10/28 16:52:34 DEBUG S3AStorageStatistics:63: op_list_status += 1  ->  2

21/10/28 16:52:34 DEBUG S3AStorageStatistics:63: op_get_file_status += 1  ->  5

21/10/28 16:52:34 DEBUG S3AFileSystem:2121: Getting path status for 
s3a://mybbucket/myfolder/test1_2021-03-23_15_21_31.592.csv  
(2021/test1_2021-03-23_15_21_31.592.csv)

21/10/28 16:52:34 DEBUG S3AStorageStatistics:63: object_metadata_requests += 1  
->  5

21/10/28 16:52:34 DEBUG S3AFileSystem:2189: Found exact file: normal file

 

++

||

++

++

  was:
Unable to read a file from S3 from spark if end point url is pointing to MINIO 
within EKS kubernetes cluster. We are able to do read/write from other clients 
and minio console. But when we read using spark I see empty data frame coming. 
If I use dataframe.show() it displays  like below.

 

++

||

++

++

 

*Spark Config:*

.config("spark.hadoop.fs.s3a.endpoint", "http://127.0.0.1:9000";) // minio url 
or port-forward to local

.config("spark.hadoop.fs.s3a.access.key",<myaccesskey>)

.config("spark.hadoop.fs.s3a.secret.key",<mysecretkey>)

 

"spark.hadoop.fs.s3a.secret.key"

"spark.hadoop.fs.s3a.secret.key"

.config("spark.hadoop.fs.s3a.path.style.access", *true*)

        .config("spark.hadoop.fs.s3a.impl", 
"org.apache.hadoop.fs.s3a.S3AFileSystem")

        .config("spark.hadoop.mapreduce.fileoutputcommitter.algorithm.version", 
"2")

        .config("fs.s3a.committer.staging.conflict-mode", "replace")

        .config("fs.s3a.committer.name", "file")

        .config("fs.s3a.committer.threads", "20")

        .config("fs.s3a.threads.max", "20")

        .config("fs.s3a.fast.upload.buffer", "bytebuffer")

        .config("fs.s3a.fast.upload.active.blocks", "8")

        .config("fs.s3a.block.size", "128M")

        .config("mapred.input.dir.recursive","true")

    .config("spark.sql.parquet.binaryAsString", "true")

 

 

*JAR files:*

hadoop-aws:3.2.0

aws-java-sdk:1.12.30

spark-core_2.12:3.1.2

spark-sql_2.12:3.1.2


> Hadoop-aws jar is unable to read file from S3 if used with third party like 
> MINIO
> ---------------------------------------------------------------------------------
>
>                 Key: HADOOP-17984
>                 URL: https://issues.apache.org/jira/browse/HADOOP-17984
>             Project: Hadoop Common
>          Issue Type: Bug
>          Components: hadoop-thirdparty
>    Affects Versions: 3.2.0
>            Reporter: Naresh
>            Priority: Major
>
> Unable to read a file from S3 from spark if end point url is pointing to 
> MINIO within EKS kubernetes cluster. We are able to do read/write from other 
> clients and minio console. But when we read using spark I see empty data 
> frame coming. If I use dataframe.show() it displays  like below.
>  
> ++
>  
> ++
> ++
>  
> *Spark Config:*
> .config("spark.hadoop.fs.s3a.endpoint", "http://127.0.0.1:9000";) // minio url 
> or port-forward to local
> .config("spark.hadoop.fs.s3a.access.key",<myaccesskey>)
> .config("spark.hadoop.fs.s3a.secret.key",<mysecretkey>)
>  
> "spark.hadoop.fs.s3a.secret.key"
> "spark.hadoop.fs.s3a.secret.key"
> .config("spark.hadoop.fs.s3a.path.style.access", *true*)
>         .config("spark.hadoop.fs.s3a.impl", 
> "org.apache.hadoop.fs.s3a.S3AFileSystem")
>         
> .config("spark.hadoop.mapreduce.fileoutputcommitter.algorithm.version", "2")
>         .config("fs.s3a.committer.staging.conflict-mode", "replace")
>         .config("fs.s3a.committer.name", "file")
>         .config("fs.s3a.committer.threads", "20")
>         .config("fs.s3a.threads.max", "20")
>         .config("fs.s3a.fast.upload.buffer", "bytebuffer")
>         .config("fs.s3a.fast.upload.active.blocks", "8")
>         .config("fs.s3a.block.size", "128M")
>         .config("mapred.input.dir.recursive","true")
>     .config("spark.sql.parquet.binaryAsString", "true")
>  
>  
> *JAR files:*
> hadoop-aws:3.2.0
> aws-java-sdk:1.12.30
> spark-core_2.12:3.1.2
> spark-sql_2.12:3.1.2
>  
> *Logs:*
> DEBUG S3AFileSystem:2121: Getting path status for 
> s3a://<mybucket>/<myfolder>/2021/test1_2021-03-23_15_21_31.592.csv  
> (2021/test1_2021-03-23_15_21_31.592.csv)
> 21/10/28 16:52:34 DEBUG S3AStorageStatistics:63: object_metadata_requests += 
> 1  ->  1
> 21/10/28 16:52:34 DEBUG S3AFileSystem:2189: Found exact file: normal file
> 21/10/28 16:52:34 DEBUG S3AStorageStatistics:63: op_exists += 1  ->  1
> 21/10/28 16:52:34 DEBUG S3AStorageStatistics:63: op_get_file_status += 1  ->  
> 2
> 21/10/28 16:52:34 DEBUG S3AFileSystem:2121: Getting path status for 
> s3a://mybbucket/myfolder/test1_2021-03-23_15_21_31.592.csv  
> (2021/test1_2021-03-23_15_21_31.592.csv)
> 21/10/28 16:52:34 DEBUG S3AStorageStatistics:63: object_metadata_requests += 
> 1  ->  2
> 21/10/28 16:52:34 DEBUG S3AFileSystem:2189: Found exact file: normal file
> 21/10/28 16:52:34 DEBUG S3AFileSystem:1899: List status for path: 
> s3a://mybbucket/myfolder/test1_2021-03-23_15_21_31.592.csv
> 21/10/28 16:52:34 DEBUG S3AStorageStatistics:63: op_list_status += 1  ->  1
> 21/10/28 16:52:34 DEBUG S3AStorageStatistics:63: op_get_file_status += 1  ->  
> 3
> 21/10/28 16:52:34 DEBUG S3AFileSystem:2121: Getting path status for 
> s3a://mybbucket/myfolder//test1_2021-03-23_15_21_31.592.csv  
> (2021/test1_2021-03-23_15_21_31.592.csv)
> 21/10/28 16:52:34 DEBUG S3AStorageStatistics:63: object_metadata_requests += 
> 1  ->  3
> 21/10/28 16:52:34 DEBUG S3AFileSystem:2189: Found exact file: normal file
> 21/10/28 16:52:34 DEBUG S3AFileSystem:1930: Adding: rd (not a dir): 
> s3a://mybbucket/myfolder//test1_2021-03-23_15_21_31.592.csv
> 21/10/28 16:52:34 DEBUG S3AStorageStatistics:63: op_is_directory += 1  ->  2
> 21/10/28 16:52:34 DEBUG S3AStorageStatistics:63: op_get_file_status += 1  ->  
> 4
> 21/10/28 16:52:34 DEBUG S3AFileSystem:2121: Getting path status for 
> s3a://mybbucket/myfolder//test1_2021-03-23_15_21_31.592.csv  
> (2021/test1_2021-03-23_15_21_31.592.csv)
> 21/10/28 16:52:34 DEBUG S3AStorageStatistics:63: object_metadata_requests += 
> 1  ->  4
> 21/10/28 16:52:34 DEBUG S3AFileSystem:2189: Found exact file: normal file
> 21/10/28 16:52:34 DEBUG S3AFileSystem:1899: List status for path: 
> s3a://mybbucket/myfolder//test1_2021-03-23_15_21_31.592.csv
> 21/10/28 16:52:34 DEBUG S3AStorageStatistics:63: op_list_status += 1  ->  2
> 21/10/28 16:52:34 DEBUG S3AStorageStatistics:63: op_get_file_status += 1  ->  
> 5
> 21/10/28 16:52:34 DEBUG S3AFileSystem:2121: Getting path status for 
> s3a://mybbucket/myfolder/test1_2021-03-23_15_21_31.592.csv  
> (2021/test1_2021-03-23_15_21_31.592.csv)
> 21/10/28 16:52:34 DEBUG S3AStorageStatistics:63: object_metadata_requests += 
> 1  ->  5
> 21/10/28 16:52:34 DEBUG S3AFileSystem:2189: Found exact file: normal file
>  
> ++
> ||
> ++
> ++



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[jira] [Updated] (HADOOP-17984) Hadoop-aws jar is unable to read file from S3 if used with third party like MINIO

Reply via email to