Github user jerryshao commented on the issue:

    https://github.com/apache/spark/pull/13513
  
    @zsxwing @frreiss thanks a lot for your comments.
    
    I think  the semantics of `FileStreamSource.getBatch(start: Option[Offset], 
end: Offset)` still keeps the same, since I overrided the `get` method in 
`FileStreamSourceLog` and filter out some compacted data.
    
    Yes it could be slow to get  a batch where it happens to be a compact 
batch. I think we could have 2 solutions:
    
    1. doing compact on the next of latest metadata file (as what I did 
before), then this will help most of the scenarios in `FileStreamSource`.
    2. We could put the data in this patch at beginning when doing compaction, 
so we don't need to scan the whole file to get this batch's metadata.
    
    Both two solutions need extra works, what do you think?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at [email protected] or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to