Naresh created HADOOP-17984:
-------------------------------
Summary: Hadoop-aws jar is unable to read file from S3 if used
with third party like MINIO
Key: HADOOP-17984
URL: https://issues.apache.org/jira/browse/HADOOP-17984
Project: Hadoop Common
Issue Type: Bug
Components: hadoop-thirdparty
Affects Versions: 3.2.0
Reporter: Naresh
Unable to read a file from S3 from spark if end point url is pointing to MINIO
within EKS kubernetes cluster. We are able to do read/write from other clients
and minio console. But when we read using spark I see empty data frame coming.
If I use dataframe.show() it displays like below.
++
||
++
++
*Spark Config:*
.config("spark.hadoop.fs.s3a.endpoint", "http://127.0.0.1:9000") // minio url
or port-forward to local
.config("spark.hadoop.fs.s3a.access.key",<myaccesskey>)
.config("spark.hadoop.fs.s3a.secret.key",<mysecretkey>)
"spark.hadoop.fs.s3a.secret.key"
"spark.hadoop.fs.s3a.secret.key"
.config("spark.hadoop.fs.s3a.path.style.access", *true*)
.config("spark.hadoop.fs.s3a.impl",
"org.apache.hadoop.fs.s3a.S3AFileSystem")
.config("spark.hadoop.mapreduce.fileoutputcommitter.algorithm.version",
"2")
.config("fs.s3a.committer.staging.conflict-mode", "replace")
.config("fs.s3a.committer.name", "file")
.config("fs.s3a.committer.threads", "20")
.config("fs.s3a.threads.max", "20")
.config("fs.s3a.fast.upload.buffer", "bytebuffer")
.config("fs.s3a.fast.upload.active.blocks", "8")
.config("fs.s3a.block.size", "128M")
.config("mapred.input.dir.recursive","true")
.config("spark.sql.parquet.binaryAsString", "true")
*JAR files:*
hadoop-aws:3.2.0
aws-java-sdk:1.12.30
spark-core_2.12:3.1.2
spark-sql_2.12:3.1.2
--
This message was sent by Atlassian Jira
(v8.3.4#803005)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]