[GitHub] [spark] siying opened a new pull request, #41022: [SPARK-43343][SS] FileStreamSource should disable an extra file glob check when creating DataSource

via GitHub Tue, 02 May 2023 15:20:43 -0700


siying opened a new pull request, #41022:
URL: https://github.com/apache/spark/pull/41022


   
   ### What changes were proposed in this pull request?
   When FileStreamSource creates a DataSource for a file, disable globbing in 
the option passed to DataSource.
   
   ### Why are the changes needed?
   It is to fix following bug.
   
   For example, If a directory contains a following file:
   /path/abc[123]
   and users would load spark.readStream.format("text").load("/path") as stream 
input. It throws an exception, saying no matching path /path/abc[123]. Spark 
thinks abc[123] is a regex that only matches file named abc1, abc2 and abc3.
   
   The bug is due to a second glob pattern match within DataSource, against 
files already glob matched by FileStreamSource. This match turns real file name 
into file path match pattern. It is unexpected and we would like to disable it.
   
   
   ### Does this PR introduce _any_ user-facing change?
   No, except fixing the unexpected buggy behavior.
   
   ### How was this patch tested?
   Added unit test scenarios which failed before the fix.
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[GitHub] [spark] siying opened a new pull request, #41022: [SPARK-43343][SS] FileStreamSource should disable an extra file glob check when creating DataSource

Reply via email to