Re: [D] Does gluten support access S3 based on 'path-style'? [incubator-gluten]

via GitHub Thu, 24 Apr 2025 17:53:57 -0700


GitHub user squalud edited a discussion: Does gluten support access S3 based on 
'path-style'?


I use Alluxio's proxy to provide S3 interface access. 

By setting `spark.hadoop.fs.ks3.endpoint` to 
`http://<alluxio-proxy-service-name>:39999/api/v1/s3/` and setting the 
`spark.hadoop.fs.s3a.path.style.access` parameter to `true` to use `path-style` 
to access S3, I can use pyspark to successfully read csv files through the URL 
format of `s3a://data/tmp/file.csv`; Note that `data` is a path, not the bucket.
```
from pyspark.sql import SparkSession

spark = SparkSession.builder \
    .appName("Read S3 Data in PySpark") \
    .remote("sc://xx.xx.xx.xx:15002") \
    .config("spark.hadoop.fs.s3a.endpoint", 
"http://<alluxio-proxy-service-name>:39999/api/v1/s3/")
    .config("spark.hadoop.fs.s3a.path.style.access", "true") \
    .getOrCreate()

csv_path = "s3a://data/tmp/file.csv"
df_csv = spark.read.csv(csv_path, header=True)
df_csv.show()
```

But when I change to gluten and setting 
`spark.gluten.sql.native.arrow.reader.enabled` to `true` to use arrow's reader 
to read, I get an error: 

```
from pyspark.sql import SparkSession

spark = SparkSession.builder \
    .appName("Read S3 Data in PySpark") \
    .remote("sc://xx.xx.xx.xx:15002") \
    .config("spark.hadoop.fs.s3a.endpoint", 
"http://<alluxio-proxy-service-name>:39999/api/v1/s3/")
    .config("spark.hadoop.fs.s3a.path.style.access", "true") \
    .config("spark.gluten.enabled", "true") \
    .config("spark.gluten.sql.native.arrow.reader.enabled", "true") \
    .config("spark.plugins", "org.apache.gluten.GlutenPlugin") \
    .getOrCreate()

csv_path = "s3a://data/tmp/file.csv"
df_csv = spark.read.csv(csv_path, header=True)
df_csv.show()
```

```
SparkConnectGrpcException: (org.apache.spark.SparkException) Job aborted due to 
stage failure: Task 0 in stage 6.0 failed 4 times, most recent failure: Lost 
task 0.3 in stage 6.0 (TID 27) (xx.xx.xx.xx executor 2): 
org.apache.gluten.exception.GlutenException: 
org.apache.gluten.exception.GlutenException: Error during calling Java code 
from native code: org.apache.gluten.exception.GlutenException: 
org.apache.gluten.exception.GlutenException: Exception: VeloxRuntimeError
Error Source: RUNTIME
Error Code: INVALID_STATE
Reason: Operator::getOutput failed for [operator: ValueStream, plan node ID: 
0]: Error during calling Java code from native code: 
java.lang.RuntimeException: When getting information for key 'tmp/file.csv' in 
bucket 'data': AWS Error ACCESS_DENIED during HeadObject operation: No response 
body.
        at 
org.apache.arrow.dataset.file.JniWrapper.makeFileSystemDatasetFactory(Native 
Method)
        at 
org.apache.arrow.dataset.file.FileSystemDatasetFactory.createNative(FileSystemDatasetFactory.java:53)
        at 
org.apache.arrow.dataset.file.FileSystemDatasetFactory.<init>(FileSystemDatasetFactory.java:34)
        at 
org.apache.gluten.utils.ArrowUtil$.makeArrowDiscovery(ArrowUtil.scala:128)
        at 
org.apache.gluten.utils.ArrowUtil$.readArrowSchema(ArrowUtil.scala:139)
        at 
org.apache.gluten.utils.ArrowUtil$.readArrowFileColumnNames(ArrowUtil.scala:152)
        at 
org.apache.gluten.datasource.ArrowCSVFileFormat.$anonfun$buildReaderWithPartitionValues$3(ArrowCSVFileFormat.scala:128)
        at 
org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.org$apache$spark$sql$execution$datasources$FileScanRDD$$anon$$readCurrentFile(FileScanRDD.scala:217)
        at 
org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.nextIterator(FileScanRDD.scala:279)
        at 
org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.hasNext(FileScanRDD.scala:129)
        at 
org.apache.spark.sql.execution.ArrowFileSourceScanExec$$anon$1.hasNext(ArrowFileSourceScanExec.scala:48)
        at 
org.apache.gluten.iterator.IteratorsV1$ReadTimeAccumulator.hasNext(IteratorsV1.scala:127)
        at scala.collection.Iterator$$anon$10.hasNext(I...
```


It seems that Arrow's reader treats the first-level path `data` as  the bucket, 
that is, the configuration `spark.hadoop.fs.s3a.path.style.access` does not 
take effect to Gluten/Arrow. How can I use gluten + arrow's reader to access S3 
based on `path-style` just like Spark's original reader?

GitHub link: https://github.com/apache/incubator-gluten/discussions/9412

----
This is an automatically sent email for [email protected].
To unsubscribe, please send an email to: [email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Re: [D] Does gluten support access S3 based on 'path-style'? [incubator-gluten]

Reply via email to