Yunus Emre Gürses created SPARK-45519:
-----------------------------------------

             Summary: cleanSource problem on FileStreamSource for Windows env
                 Key: SPARK-45519
                 URL: https://issues.apache.org/jira/browse/SPARK-45519
             Project: Spark
          Issue Type: Bug
          Components: Structured Streaming
    Affects Versions: 3.4.1
            Reporter: Yunus Emre Gürses


We are using Spark with Scala in Windows environment. While streaming using 
Spark, I give the *{{cleanSource}}* option as "archive" and the 
*{{sourceArchiveDir}}* option as "archived" as in the code below.
{code:java}
spark.readStream
  .option("cleanSource", "archive")
  .option("sourceArchiveDir", "archived"){code}
When I tried this in a Linux environment, I realized that the problem was with 
the paths. Because when I set archive mode to "delete", it works on both Linux 
and Windows. But for the archive mode, it does not work on Windows. 

The problem is related to appending paths in Windows. There is a method

 
{code:java}
override protected def cleanTask(entry: FileEntry): Unit{code}
in the FileStreamSource.scala file in the 
org.apache.spark.sql.execution.streaming package. On line 569, the 
!fileSystem.rename(curPath, newPath) code supposed to move source file to 
archive folder. However, when I debugged, I noticed that the curPath and 
newPath values were as follows in windows:

 
{code:java}
curPath: 
file:/C:/dev/be/data-integration-suite/test-data/streaming-folder/patients/patients-success.csv{code}
{code:java}
newPath: 
file:/C:/dev/be/data-integration-suite/archived/C:/dev/be/data-integration-suite/test-data/streaming-folder/patients/patients-success.csv{code}
It seems that absolute path of csv file were appended when creating newPath 
because there are two *C:/dev/be/data-integration-suite* in the newPath. This 
is the reason probably spark archiving does not work. Instead, newPath should 
be: 
file:/C:/dev/be/data-integration-suite/archived/test-data/streaming-folder/patients/patients-success.csv



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to