[GitHub] [ozone] xichen01 commented on pull request #4127: HDDS-7710. Support AWS s3 ListObjects API's encodingType request parameter

via GitHub Mon, 06 Feb 2023 08:20:29 -0800


xichen01 commented on PR #4127:
URL: https://github.com/apache/ozone/pull/4127#issuecomment-1419355073


   @neils-dev @DaveTeng0 Another case that can reproduce this bug is using the 
PySpark to access the Ozone key that contains "=" in the key.
   
   - connect the Ozone S3g
   ```bash
   [root@VM-8-3-centos ~]$  bin/pyspark --conf spark.s3a.enabled=true --conf 
spark.hadoop.fs.s3a.path.style.access=true --conf 
spark.hadoop.fs.s3a.bucket.bucket1.access.key=id --conf 
spark.hadoop.fs.s3a.bucket.bucket1.secret.key=secret --conf 
spark.hadoop.fs.s3a.bucket.bucket1.endpoint=http://localhost:9878 --conf 
"spark.driver.extraClassPath=/root/hadoop-3.3.4/share/hadoop/tools/lib/*" 
--conf spark.hadoop.fs.s3a.change.detection.version.required=false --conf 
spark.hadoop.fs.s3a.change.detection.mode=none --conf 
spark.hadoop.fs.s3a.bucket.probe=0
   ```
   
   - Read keys with "=" in the key name
   ```bash
          ____              __
        / __/__  ___ _____/ /__
       _\ \/ _ \/ _ `/ __/  '_/
      /__ / .__/\_,_/_/ /_/\_\   version 3.3.1
         /_/
   
   
   SparkSession available as 'spark'.
   >>> path = "s3a://bucket1/dt=2022"
   >>> df = spark.read.text(path)
   
   23/02/106 21:33:51 WARN MetricsConfig: Cannot locate configuration: tried 
hadoop-metrics2-s3a-file-system.properties,hadoop-metrics2.properties
   Traceback (most recent call last):
     File "<stdin>", line 1, in <module>
     File "/root/spark-3.3.1-bin-hadoop3/python/pyspark/sql/readwriter.py", 
line 421, in text
       return 
self._df(self._jreader.text(self._spark._sc._jvm.PythonUtils.toSeq(paths)))
     File 
"/root/spark-3.3.1-bin-hadoop3/python/lib/py4j-0.10.9.5-src.zip/py4j/java_gateway.py",
 line 1321, in __call__
     File "/root/spark-3.3.1-bin-hadoop3/python/pyspark/sql/utils.py", line 
190, in deco
       return f(*a, **kw)
     File 
"/root/spark-3.3.1-bin-hadoop3/python/lib/py4j-0.10.9.5-src.zip/py4j/protocol.py",
 line 326, in get_return_value
   py4j.protocol.Py4JJavaError: An error occurred while calling o47.text.
   : java.io.FileNotFoundException: No such file or directory: 
s3a://bucket1/dt%3D2022
        at 
org.apache.hadoop.fs.s3a.S3AFileSystem.s3GetFileStatus(S3AFileSystem.java:3866)
        at 
org.apache.hadoop.fs.s3a.S3AFileSystem.innerGetFileStatus(S3AFileSystem.java:3688)
        at 
org.apache.hadoop.fs.s3a.S3AFileSystem.innerListStatus(S3AFileSystem.java:3300)
        at 
org.apache.hadoop.fs.s3a.S3AFileSystem.lambda$null$20(S3AFileSystem.java:3264)
        at org.apache.hadoop.fs.s3a.Invoker.once(Invoker.java:117)
        at 
org.apache.hadoop.fs.s3a.S3AFileSystem.lambda$listStatus$21(S3AFileSystem.java:3263)
        at 
org.apache.hadoop.fs.statistics.impl.IOStatisticsBinding.lambda$trackDurationOfOperation$5(IOStatisticsBinding.java:499)
        at 
org.apache.hadoop.fs.statistics.impl.IOStatisticsBinding.trackDuration(IOStatisticsBinding.java:444)
        at 
org.apache.hadoop.fs.s3a.S3AFileSystem.trackDurationAndSpan(S3AFileSystem.java:2337)
        at 
org.apache.hadoop.fs.s3a.S3AFileSystem.trackDurationAndSpan(S3AFileSystem.java:2356)
        at 
org.apache.hadoop.fs.s3a.S3AFileSystem.listStatus(S3AFileSystem.java:3262)
        at 
org.apache.spark.util.HadoopFSUtils$.listLeafFiles(HadoopFSUtils.scala:225)
        at 
org.apache.spark.util.HadoopFSUtils$.$anonfun$listLeafFiles$7(HadoopFSUtils.scala:281)
        at 
scala.collection.TraversableLike.$anonfun$flatMap$1(TraversableLike.scala:293)
        at 
scala.collection.IndexedSeqOptimized.foreach(IndexedSeqOptimized.scala:36)
        at 
scala.collection.IndexedSeqOptimized.foreach$(IndexedSeqOptimized.scala:33)
        at scala.collection.mutable.ArrayOps$ofRef.foreach(ArrayOps.scala:198)
        at scala.collection.TraversableLike.flatMap(TraversableLike.scala:293)
        at scala.collection.TraversableLike.flatMap$(TraversableLike.scala:290)
        at scala.collection.mutable.ArrayOps$ofRef.flatMap(ArrayOps.scala:198)
        at 
org.apache.spark.util.HadoopFSUtils$.listLeafFiles(HadoopFSUtils.scala:271)
        at 
org.apache.spark.util.HadoopFSUtils$.$anonfun$parallelListLeafFilesInternal$1(HadoopFSUtils.scala:95)
        at 
scala.collection.TraversableLike.$anonfun$map$1(TraversableLike.scala:286)
        at 
scala.collection.mutable.ResizableArray.foreach(ResizableArray.scala:62)
        at 
scala.collection.mutable.ResizableArray.foreach$(ResizableArray.scala:55)
        at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:49)
        at scala.collection.TraversableLike.map(TraversableLike.scala:286)
        at scala.collection.TraversableLike.map$(TraversableLike.scala:279)
        at scala.collection.AbstractTraversable.map(Traversable.scala:108)
        at 
org.apache.spark.util.HadoopFSUtils$.parallelListLeafFilesInternal(HadoopFSUtils.scala:85)
        at 
org.apache.spark.util.HadoopFSUtils$.parallelListLeafFiles(HadoopFSUtils.scala:69)
        at 
org.apache.spark.sql.execution.datasources.InMemoryFileIndex$.bulkListLeafFiles(InMemoryFileIndex.scala:158)
        at 
org.apache.spark.sql.execution.datasources.InMemoryFileIndex.listLeafFiles(InMemoryFileIndex.scala:131)
        at 
org.apache.spark.sql.execution.datasources.InMemoryFileIndex.refresh0(InMemoryFileIndex.scala:94)
        at 
org.apache.spark.sql.execution.datasources.InMemoryFileIndex.<init>(InMemoryFileIndex.scala:66)
        at 
org.apache.spark.sql.execution.datasources.DataSource.createInMemoryFileIndex(DataSource.scala:567)
        at 
org.apache.spark.sql.execution.datasources.DataSource.resolveRelation(DataSource.scala:409)
        at 
org.apache.spark.sql.DataFrameReader.loadV1Source(DataFrameReader.scala:228)
        at 
org.apache.spark.sql.DataFrameReader.$anonfun$load$2(DataFrameReader.scala:210)
        at scala.Option.getOrElse(Option.scala:189)
        at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:210)
        at org.apache.spark.sql.DataFrameReader.text(DataFrameReader.scala:645)
        at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
        at 
sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
        at 
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
        at java.lang.reflect.Method.invoke(Method.java:498)
        at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:244)
        at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:357)
        at py4j.Gateway.invoke(Gateway.java:282)
        at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:132)
        at py4j.commands.CallCommand.execute(CallCommand.java:79)
        at 
py4j.ClientServerConnection.waitForCommands(ClientServerConnection.java:182)
        at py4j.ClientServerConnection.run(ClientServerConnection.java:106)
        at java.lang.Thread.run(Thread.java:750)
   ```
   
   Note the exception message , the root cause is `: 
java.io.FileNotFoundException: No such file or directory: 
s3a://bucket1/dt%3D2022
   `, the root cause is the "=" not being handled correctly.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[GitHub] [ozone] xichen01 commented on pull request #4127: HDDS-7710. Support AWS s3 ListObjects API's encodingType request parameter

Reply via email to