Github user jerryshao commented on the issue:
https://github.com/apache/spark/pull/13513
Thanks a lot @zsxwing and @frreiss for your comments.
For the slow scan problem of compact batch. Originally I planned to to not
merge the latest batch as I did before, also suggested above. but with several
different tries it is hard to implement with small changes. So for now I still
choose the same implementation with a simple cache layer to overcome this
problem, the basic compaction algorithm is still the same as
`FileStreamSinkLog`. I think it is easier to maintain.
For the problem of semantics broken. I realized that it is really a
problem, but current code didn't touch it. So I changed to scan the compacted
batch files to retrieve missing batches. It is a little time-consuming, and the
current logic of `FileStreamSource` will not touch this part.
---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at [email protected] or file a JIRA ticket
with INFRA.
---
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]