HeartSaVioR commented on pull request #31638:
URL: https://github.com/apache/spark/pull/31638#issuecomment-808072941
The glob path is valid for the path of data source, but I'm not sure I can
agree it's also valid for the parameter of `FileStreamSink.hasMetadata()`.
I couldn't imagine the "correct" behavior and return value when the glob
path is provided. Let's say there're `/output/a` and `/output/b`, and only
`/output/b` was created with streaming query so having metadata directory.
When we provide `/output/*` as a glob path on path, what would we expect? I
see three possible approaches:
1) Leverage metadata in `/output/b` for reading `/output/b` and read
`/output/a` via listing. Sounds ideal but not sure Spark now does it.
2) Leverage metadata in `/output/b` for reading `/output/b`. `/output/a` is
silently ignored.
3) Don't leverage metadata in `/output/b` and read both directories via
listing.
Not sure which one Spark does now, but one clear thing for me is that
reasoning the return value of `FileStreamSink.hasMetadata("/output/*")` for
above case is very hard if it doesn't work the way it always returns false for
glob path. If Spark populates the valid paths from glob path and handles these
paths individually (that said, the method is called with non-glob path) the
result would be pretty clear. Otherwise, I'm confused what we are expecting.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
[email protected]
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]