[ 
https://issues.apache.org/jira/browse/SPARK-30900?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jungtaek Lim reassigned SPARK-30900:
------------------------------------

    Assignee: Jungtaek Lim

> FileStreamSource: Avoid reading compact metadata log twice if the query stops 
> from compact batch and restarts
> -------------------------------------------------------------------------------------------------------------
>
>                 Key: SPARK-30900
>                 URL: https://issues.apache.org/jira/browse/SPARK-30900
>             Project: Spark
>          Issue Type: Improvement
>          Components: Structured Streaming
>    Affects Versions: 3.1.0
>            Reporter: Jungtaek Lim
>            Assignee: Jungtaek Lim
>            Priority: Minor
>
> When restarting the query, there is a case which the query starts from 
> compaction batch, and the batch has source metadata file to read. One case is 
> that the previous query succeeded to read from inputs, but not finalized the 
> batch for various reasons.
> This case FileStreamSource will read the compact metadata file twice, one for 
> retrieving all files to build seen file map, another one for retrieving 
> entries in the batch. If the query processes huge number of inputs so far, 
> compact metadata file becomes considerably bigger, so reading once more adds 
> unnecessary latency on processing startup batch.
> This issue tracks the effort to address this case.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

Reply via email to