lyepustin opened a new issue, #1795:
URL: https://github.com/apache/sedona/issues/1795

   **Description**:
   We have encountered an issue while using Apache Sedona's `ShapefileReader` 
to read shapefiles stored in Azure Data Lake Storage Gen2 (ADLS Gen2). The 
issue occurs specifically when the directory containing the shapefiles starts 
with an underscore (`_`).
   
   When attempting to read a shapefile from a directory named `_SILVER`, Sedona 
fails with no specific error message. 
   
   > org.apache.hadoop.mapreduce.lib.input.InvalidInputException: Input path 
does not exist
   
   
   However, renaming the directory to remove the leading underscore (e.g., 
`SILVER` or `TEST_SILVER`) resolves the issue, and the shapefiles are read 
correctly.
   
   **Steps to Reproduce**:
   1. Store shapefiles in a directory named `_SILVER` in ADLS Gen2.
   2. Attempt to read the shapefiles using the following code:
   
      ```python
      from sedona.core.formatMapper.shapefileParser import ShapefileReader
      from sedona.core.SedonaContext import SedonaContext
      from sedona.utils.adapter import Adapter
   
      # Set up Sedona Context
      SedonaContext.create(spark)
   
      # Define the path to the shapefile directory
      path = "abfss://data...@yourstorageaccount.dfs.core.windows.net/_GEODATA/"
   
      # Read shapefiles
      spatial_rdd = ShapefileReader.readToGeometryRDD(sc, inputPath=path)
      df = Adapter.toDf(spatial_rdd, sparkSession=spark)
   
      # Print DataFrame count
      print(df.count())
      df.show()
      ```
   
   3. Observe that the code fails to read the shapefiles when the directory 
name starts with `_`.
   4. Rename the directory (e.g., to `GEODATA`), rerun the code, and observe 
that it works correctly.
   
   **Workarounds Attempted**:
   - Setting Spark configurations to include hidden files:
     ```python
     spark.conf.set("mapreduce.input.fileinputformat.ignore.hidden.files", 
"false")
     
spark.conf.set("spark.hadoop.mapreduce.input.fileinputformat.input.dir.ignore.hidden.file",
 "false")
     ```
     These configurations did not resolve the issue.
   
   - Copying the `_GEODATA` directory to a new directory without the leading 
underscore (e.g., `TEST_GEODATA`). This workaround was successful but is not 
ideal for production environments.
   
   **Environment Details**:
   - **Apache Sedona Version**: apache-sedona==1.7.0
   - **Pyspark extension**: pyspark-extension==2.12.0.3.5
   - **Databricks Runtime Version**: 15.4 LTS (includes Apache Spark 3.5.0, 
Scala 2.12)
   - **Azure Data Lake Storage Gen2**
   
   **Expected Behavior**:
   The `ShapefileReader` should read shapefiles from directories starting with 
an underscore (`_`) without requiring a directory rename.
   
   **Actual Behavior**:
   The `ShapefileReader` fails to read shapefiles from directories starting 
with an underscore (`_`), but works as expected when the directory is renamed 
to remove the underscore.
   
   **Additional Information**:
   - The `_` prefix in directory names is common for organizational purposes 
(e.g., `_SILVER`, `_GOLD`).
   - Other tools and libraries (e.g., Spark's native file readers) do not 
exhibit this behavior.
   - The issue may be related to how Sedona interacts with the underlying 
Hadoop FileSystem.
   
   **Request**:
   We kindly request the Sedona team to investigate this issue and provide 
guidance on whether this behavior is expected or if a fix is needed.
   
   Thank you for your support!
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: dev-unsubscr...@sedona.apache.org.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

Reply via email to