[GitHub] [spark] southernriver opened a new pull request #30707: [SPARK-32208][SQL] Spark SQL throw Illegal character exception when load certain abnormal path of HDFS

GitBox Thu, 10 Dec 2020 05:04:28 -0800


southernriver opened a new pull request #30707:
URL: https://github.com/apache/spark/pull/30707



   ### What changes were proposed in this pull request?
   In the distributed hdfs storage system，Space and other special character are 
allowed in the path：
   
   > 
hdfs://ns1/tmp2/hive-staging/hadoop_hive_2020-07-06_17-31-29_139_7042265710400397740-1/-ext-10000/test_table=2020-06-17
 18%3A00%3A00/part-00000-84396c4e-ba05-4936-afc7-db46c4251bfa.c000
   
   When we load data by using
   
   ```
   org.apache.spark.sql.execution.datasources.parquet.ParquetFileFormat
   org.apache.spark.sql.execution.datasources.orcOrcFileFormat.scala
   org.apache.spark.sql.hive.orc.OrcFileFormat 
   ```
   , exception may throw as below:
   ```
   Caused by: java.net.URISyntaxException: Illegal character in path at index 
136: 
hdfs://ns1/tmp2/hive-staging/hadoop_hive_2020-07-06_17-31-29_139_7042265710400397740-1/-ext-10000/test_table=2020-06-17
 18%3A00%3A00/part-00000-84396c4e-ba05-4936-afc7-db46c4251bfa.c000
   at java.net.URI$Parser.fail(URI.java:2848)
   at java.net.URI$Parser.checkChars(URI.java:3021)
   at java.net.URI$Parser.parseHierarchical(URI.java:3105)
   at java.net.URI$Parser.parse(URI.java:3053)
   at java.net.URI.<init>(URI.java:588)
   at org.apache.spark.sql.execution.datasources.parquet.ParquetFileFormat
   
anonfun$buildReaderWithPartitionValues$1.apply(ParquetFileFormat.scala:356)atorg.apache.spark.sql.execution.datasources.parquet.ParquetFileFormat
   anonfunbuildReaderWithPartitionValues1.apply(ParquetFileFormat.scala:352)
   at 
org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.orgapachesparksqlexecutiondatasourcesFileScanRDD
   anon
   readCurrentFile(FileScanRDD.scala:124)
   at org.apache.spark.sql.execution.datasources.FileScanRDD
   
anon$1.nextIterator(FileScanRDD.scala:177)atorg.apache.spark.sql.execution.datasources.FileScanRDD
   
anon1.hasNext(FileScanRDD.scala:101)atorg.apache.spark.sql.execution.datasources.FileFormatWriteranonfunorgapachesparksqlexecutiondatasourcesFileFormatWriter
   
executeTask$3.apply(FileFormatWriter.scala:252)atorg.apache.spark.sql.execution.datasources.FileFormatWriter
   
anonfunorgapachesparksqlexecutiondatasourcesFileFormatWriterexecuteTask3.apply(FileFormatWriter.scala:250)
   at 
org.apache.spark.util.Utils.tryWithSafeFinallyAndFailureCallbacks(Utils.scala:1394)atorg.apache.spark.sql.execution.datasources.FileFormatWriter.orgapachesparksqlexecutiondatasourcesFileFormatWriter$$executeTask(FileFormatWriter.scala:256)
   ... 10 more
   ```
   
    Hdfs  has provided serveral  construct function to build path:
   
   
https://github.com/apache/hadoop/blob/master/hadoop-common-project/hadoop-common/src/main/java/org/apache/hadoop/fs/Path.java
   
   We could fall back to  construct a path from a String rather than URI.
   
   
   
   
   ### Why are the changes needed?
   It's reasonable to support all path of HDFS for module of ParquetFileFormat 
or OrcFileFormat.
   
   
   ### Does this PR introduce _any_ user-facing change?
   No
   
   
   ### How was this patch tested?
   manual


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
[email protected]



---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[GitHub] [spark] southernriver opened a new pull request #30707: [SPARK-32208][SQL] Spark SQL throw Illegal character exception when load certain abnormal path of HDFS

Reply via email to