[
https://issues.apache.org/jira/browse/SPARK-5221?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14356873#comment-14356873
]
Jem Tucker commented on SPARK-5221:
-----------------------------------
Sorry this is not what I meant. What I mean is that if a large file is move in
from a different file system and therefor must be copied it is possible the
file is created in one window but not available to be read from untill another,
does that make sense?
> FileInputDStream "remember window" in certain situations causes files to be
> ignored
> ------------------------------------------------------------------------------------
>
> Key: SPARK-5221
> URL: https://issues.apache.org/jira/browse/SPARK-5221
> Project: Spark
> Issue Type: Bug
> Components: Streaming
> Affects Versions: 1.1.1, 1.2.0
> Reporter: Jem Tucker
>
> When batch times are greater than 1 minute, if a file begins to be moved into
> a directory just before FileInputDStream.findNewFiles() is called but does
> not become visible untill after it has excecuted and therefore is not
> included in that batch, the file is then ignored in the following batch as
> its mod time is less than the modTimeIgnoreThreshold. This causes data to be
> ignored in spark streaming that shouldnt be, especially when large files are
> being moved into the directory.
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]