lyepustin opened a new issue, #1795: URL: https://github.com/apache/sedona/issues/1795
**Description**: We have encountered an issue while using Apache Sedona's `ShapefileReader` to read shapefiles stored in Azure Data Lake Storage Gen2 (ADLS Gen2). The issue occurs specifically when the directory containing the shapefiles starts with an underscore (`_`). When attempting to read a shapefile from a directory named `_SILVER`, Sedona fails with no specific error message. > org.apache.hadoop.mapreduce.lib.input.InvalidInputException: Input path does not exist However, renaming the directory to remove the leading underscore (e.g., `SILVER` or `TEST_SILVER`) resolves the issue, and the shapefiles are read correctly. **Steps to Reproduce**: 1. Store shapefiles in a directory named `_SILVER` in ADLS Gen2. 2. Attempt to read the shapefiles using the following code: ```python from sedona.core.formatMapper.shapefileParser import ShapefileReader from sedona.core.SedonaContext import SedonaContext from sedona.utils.adapter import Adapter # Set up Sedona Context SedonaContext.create(spark) # Define the path to the shapefile directory path = "abfss://data...@yourstorageaccount.dfs.core.windows.net/_GEODATA/" # Read shapefiles spatial_rdd = ShapefileReader.readToGeometryRDD(sc, inputPath=path) df = Adapter.toDf(spatial_rdd, sparkSession=spark) # Print DataFrame count print(df.count()) df.show() ``` 3. Observe that the code fails to read the shapefiles when the directory name starts with `_`. 4. Rename the directory (e.g., to `GEODATA`), rerun the code, and observe that it works correctly. **Workarounds Attempted**: - Setting Spark configurations to include hidden files: ```python spark.conf.set("mapreduce.input.fileinputformat.ignore.hidden.files", "false") spark.conf.set("spark.hadoop.mapreduce.input.fileinputformat.input.dir.ignore.hidden.file", "false") ``` These configurations did not resolve the issue. - Copying the `_GEODATA` directory to a new directory without the leading underscore (e.g., `TEST_GEODATA`). This workaround was successful but is not ideal for production environments. **Environment Details**: - **Apache Sedona Version**: apache-sedona==1.7.0 - **Pyspark extension**: pyspark-extension==2.12.0.3.5 - **Databricks Runtime Version**: 15.4 LTS (includes Apache Spark 3.5.0, Scala 2.12) - **Azure Data Lake Storage Gen2** **Expected Behavior**: The `ShapefileReader` should read shapefiles from directories starting with an underscore (`_`) without requiring a directory rename. **Actual Behavior**: The `ShapefileReader` fails to read shapefiles from directories starting with an underscore (`_`), but works as expected when the directory is renamed to remove the underscore. **Additional Information**: - The `_` prefix in directory names is common for organizational purposes (e.g., `_SILVER`, `_GOLD`). - Other tools and libraries (e.g., Spark's native file readers) do not exhibit this behavior. - The issue may be related to how Sedona interacts with the underlying Hadoop FileSystem. **Request**: We kindly request the Sedona team to investigate this issue and provide guidance on whether this behavior is expected or if a fix is needed. Thank you for your support! -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: dev-unsubscr...@sedona.apache.org.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org