zsxwing commented on pull request #27694:
URL: https://github.com/apache/spark/pull/27694#issuecomment-651454246
The numbers are pretty impressive. Thanks a lot for your work.
My high level comments regarding the PR:
- The compression codec should not be hardcoded. It's better to allow the
user to config the writer codec like other codec configs in Spark. The reader
should be able to identify the codec and load files correctly. Otherwise, we
would need to bump the version again when adding more codec support.
- The cost of bumping file sink version is much higher than the checkpoint.
The file sink metadata should be compatible with old versions if possible like
other storage formats such as parquet, orc.
- For example, in some company, the team who generates the output may not
be able to convince the downstream to upgrade their Spark version.
- The file sink metadata can be read by an external system. Bumping the
version will need the ecosystem to build a new reader.
- If a user hits a critical issue in new Spark version, they may want to
rollback. Bumping the version will break this workflow.
- It's better to make the default behavior like this:
- Use v1 if an existing directory is using v1, so that we can still
support rollback. The user can set a flag to enable v2 explicitly.
- Use v2 if it's a new directory.
I'm wondering if we can step back to think about whether it is possible
solve the issue without bumping the version. IMO, the major issue is the number
of files increases infinitely over time. This PR increase the upper bound but
doesn't change the complexity. Maybe we should try to see how to reduce the
number of files, such as supporting data compaction? You may notice that
`FileStreamSinkLog.DELETE_ACTION` is never used. This is because it was added
to support data compaction but we didn't implement it yet.
I totally understand that we may not be able to solve the metadata issue
without bumping the file metadata version. Then it's better to design a better
file sink log format to solve all of the following issues so that we don't need
to bump the version again in future:
- Data compaction support.
- Open source format. It would be great that the new file sink log format
can be read by Spark and other systems. An extra benefit is we can use Spark
itself to process the metadata so that we don't have to put all metadata in the
driver memory.
- Better file stream source support. Since we always append files to the
file sink, when reading a file sink using streaming queries, it would be great
we can locate the new appended files quickly. For example, we can use `log file
name + line number` as the streaming source offset instead, then we can jump to
the log file directly.
- Filter push down support when reading a file sink directory. For example,
if a user queries only one partition, we don't need to load metadata of other
partitions into the driver memory.
- Decouple the file sink from the streaming query. Currently, one file sink
output always maps to one streaming query. The user may want to use another
streaming query or even batch query to write to the same file sink. But as we
use the batch id as the file name, this is not supported.
- Use relative path. Copying the file sink directory will break it today. It
would be great that we remember the relative paths so that people can move or
copy the directory.
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
[email protected]
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]