HeartSaVioR commented on pull request #31638:
URL: https://github.com/apache/spark/pull/31638#issuecomment-808072941


   The glob path is valid for the path of data source, but I'm not sure I can 
agree it's also valid for the parameter of `FileStreamSink.hasMetadata()`.
   
   I couldn't imagine the "correct" behavior and return value when the glob 
path is provided. Let's say there're `/output/a` and `/output/b`, and only 
`/output/b` was created with streaming query so having metadata directory.
   
   When we provide `/output/*` as a glob path on path, what would we expect? I 
see three possible approaches:
   
   1) Leverage metadata in `/output/b` for reading `/output/b` and read 
`/output/a` via listing. Sounds ideal but not sure Spark now does it.
   2) Leverage metadata in `/output/b` for reading `/output/b`. `/output/a` is 
silently ignored.
   3) Don't leverage metadata in `/output/b` and read both directories via 
listing. 
   
   Not sure which one Spark does now, but one clear thing for me is that 
reasoning the return value of `FileStreamSink.hasMetadata("/output/*")` for 
above case is very hard if it doesn't work the way it always returns false for 
glob path. If Spark populates the valid paths from glob path and handles these 
paths individually (that said, the method is called with non-glob path) the 
result would be pretty clear. Otherwise, I'm confused what we are expecting.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
[email protected]



---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to