GitHub user squalud edited a discussion: Does gluten support access S3 based on
'path-style'?
I use Alluxio's proxy to provide S3 interface access.
By setting `spark.hadoop.fs.ks3.endpoint` to
`http://<alluxio-proxy-service-name>:39999/api/v1/s3/` and setting the
`spark.hadoop.fs.s3a.path.style.access` parameter to `true` to use `path-style`
to access S3, I can use pyspark to successfully read csv files through the URL
format of `s3a://data/tmp/file.csv`;
```
from pyspark.sql import SparkSession
spark = SparkSession.builder \
.appName("Read S3 Data in PySpark") \
.remote("sc://xx.xx.xx.xx:15002") \
.config("spark.hadoop.fs.s3a.endpoint",
"http://<alluxio-proxy-service-name>:39999/api/v1/s3/")
.config("spark.hadoop.fs.s3a.path.style.access", "true") \
.getOrCreate()
csv_path = "s3a://data/tmp/file.csv"
df_csv = spark.read.csv(csv_path, header=True)
df_csv.show()
```
But when I change to gluten and setting
`spark.gluten.sql.native.arrow.reader.enabled` to `true` to use arrow's reader
to read, I get an error:
```
from pyspark.sql import SparkSession
spark = SparkSession.builder \
.appName("Read S3 Data in PySpark") \
.remote("sc://xx.xx.xx.xx:15002") \
.config("spark.hadoop.fs.s3a.endpoint",
"http://<alluxio-proxy-service-name>:39999/api/v1/s3/")
.config("spark.hadoop.fs.s3a.path.style.access", "true") \
.config("spark.gluten.enabled", "true") \
.config("spark.gluten.sql.native.arrow.reader.enabled", "true") \
.config("spark.plugins", "org.apache.gluten.GlutenPlugin") \
.getOrCreate()
csv_path = "s3a://data/tmp/file.csv"
df_csv = spark.read.csv(csv_path, header=True)
df_csv.show()
```
```
SparkConnectGrpcException: (org.apache.spark.SparkException) Job aborted due to
stage failure: Task 0 in stage 6.0 failed 4 times, most recent failure: Lost
task 0.3 in stage 6.0 (TID 27) (xx.xx.xx.xx executor 2):
org.apache.gluten.exception.GlutenException:
org.apache.gluten.exception.GlutenException: Error during calling Java code
from native code: org.apache.gluten.exception.GlutenException:
org.apache.gluten.exception.GlutenException: Exception: VeloxRuntimeError
Error Source: RUNTIME
Error Code: INVALID_STATE
Reason: Operator::getOutput failed for [operator: ValueStream, plan node ID:
0]: Error during calling Java code from native code:
java.lang.RuntimeException: When getting information for key 'tmp/file.csv' in
bucket 'data': AWS Error ACCESS_DENIED during HeadObject operation: No response
body.
at
org.apache.arrow.dataset.file.JniWrapper.makeFileSystemDatasetFactory(Native
Method)
at
org.apache.arrow.dataset.file.FileSystemDatasetFactory.createNative(FileSystemDatasetFactory.java:53)
at
org.apache.arrow.dataset.file.FileSystemDatasetFactory.<init>(FileSystemDatasetFactory.java:34)
at
org.apache.gluten.utils.ArrowUtil$.makeArrowDiscovery(ArrowUtil.scala:128)
at
org.apache.gluten.utils.ArrowUtil$.readArrowSchema(ArrowUtil.scala:139)
at
org.apache.gluten.utils.ArrowUtil$.readArrowFileColumnNames(ArrowUtil.scala:152)
at
org.apache.gluten.datasource.ArrowCSVFileFormat.$anonfun$buildReaderWithPartitionValues$3(ArrowCSVFileFormat.scala:128)
at
org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.org$apache$spark$sql$execution$datasources$FileScanRDD$$anon$$readCurrentFile(FileScanRDD.scala:217)
at
org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.nextIterator(FileScanRDD.scala:279)
at
org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.hasNext(FileScanRDD.scala:129)
at
org.apache.spark.sql.execution.ArrowFileSourceScanExec$$anon$1.hasNext(ArrowFileSourceScanExec.scala:48)
at
org.apache.gluten.iterator.IteratorsV1$ReadTimeAccumulator.hasNext(IteratorsV1.scala:127)
at scala.collection.Iterator$$anon$10.hasNext(I...
```
It seems that Arrow's reader treats the first-level path `data` as the bucket,
that is, the configuration `spark.hadoop.fs.s3a.path.style.access` does not
take effect to Gluten/Arrow. How can I use gluten + arrow's reader to access S3
based on `path-style` just like Spark's original reader?
GitHub link: https://github.com/apache/incubator-gluten/discussions/9412
----
This is an automatically sent email for [email protected].
To unsubscribe, please send an email to: [email protected]
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]