[I] Issue with Sedona ShapefileReader Handling Hidden Directories in ADLS Gen2 [sedona]

via GitHub Mon, 10 Feb 2025 06:19:41 -0800


lyepustin opened a new issue, #1795:
URL: https://github.com/apache/sedona/issues/1795

**Description**:
We have encountered an issue while using Apache Sedona's `ShapefileReader`
to read shapefiles stored in Azure Data Lake Storage Gen2 (ADLS Gen2). The
issue occurs specifically when the directory containing the shapefiles starts
with an underscore (`_`).

When attempting to read a shapefile from a directory named `_SILVER`, Sedona
fails with no specific error message.

> org.apache.hadoop.mapreduce.lib.input.InvalidInputException: Input path
does not exist

However, renaming the directory to remove the leading underscore (e.g.,
`SILVER` or `TEST_SILVER`) resolves the issue, and the shapefiles are read
correctly.

**Steps to Reproduce**:
1. Store shapefiles in a directory named `_SILVER` in ADLS Gen2.
2. Attempt to read the shapefiles using the following code:

```python
from sedona.core.formatMapper.shapefileParser import ShapefileReader
from sedona.core.SedonaContext import SedonaContext
from sedona.utils.adapter import Adapter

# Set up Sedona Context
SedonaContext.create(spark)

# Define the path to the shapefile directory
path = "abfss://data...@yourstorageaccount.dfs.core.windows.net/_GEODATA/"

# Read shapefiles
spatial_rdd = ShapefileReader.readToGeometryRDD(sc, inputPath=path)
df = Adapter.toDf(spatial_rdd, sparkSession=spark)

# Print DataFrame count
print(df.count())
df.show()
```

3. Observe that the code fails to read the shapefiles when the directory
name starts with `_`.
4. Rename the directory (e.g., to `GEODATA`), rerun the code, and observe
that it works correctly.

**Workarounds Attempted**:
- Setting Spark configurations to include hidden files:
```python
spark.conf.set("mapreduce.input.fileinputformat.ignore.hidden.files",
"false")

spark.conf.set("spark.hadoop.mapreduce.input.fileinputformat.input.dir.ignore.hidden.file",
"false")
```
These configurations did not resolve the issue.

- Copying the `_GEODATA` directory to a new directory without the leading
underscore (e.g., `TEST_GEODATA`). This workaround was successful but is not
ideal for production environments.

**Environment Details**:
- **Apache Sedona Version**: apache-sedona==1.7.0
- **Pyspark extension**: pyspark-extension==2.12.0.3.5
- **Databricks Runtime Version**: 15.4 LTS (includes Apache Spark 3.5.0,
Scala 2.12)
- **Azure Data Lake Storage Gen2**

**Expected Behavior**:
The `ShapefileReader` should read shapefiles from directories starting with
an underscore (`_`) without requiring a directory rename.

**Actual Behavior**:
The `ShapefileReader` fails to read shapefiles from directories starting
with an underscore (`_`), but works as expected when the directory is renamed
to remove the underscore.

**Additional Information**:
- The `_` prefix in directory names is common for organizational purposes
(e.g., `_SILVER`, `_GOLD`).
- Other tools and libraries (e.g., Spark's native file readers) do not
exhibit this behavior.
- The issue may be related to how Sedona interacts with the underlying
Hadoop FileSystem.

**Request**:
We kindly request the Sedona team to investigate this issue and provide
guidance on whether this behavior is expected or if a fix is needed.

Thank you for your support!

--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: dev-unsubscr...@sedona.apache.org.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[I] Issue with Sedona ShapefileReader Handling Hidden Directories in ADLS Gen2 [sedona]

Reply via email to