[
https://issues.apache.org/jira/browse/SPARK-18974?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15782284#comment-15782284
]
Adam Wang edited comment on SPARK-18974 at 12/28/16 7:30 AM:
-------------------------------------------------------------
Thanks for remind, this is my fault. This bug seems not so easy to solve, could
we define a Set[Path] to check if file had been read by InputDstream? Such as:
@transient private var selectedFiles = new mutable.HashSet[String]()
...
override def start() {
addOldFilesToSelectedFiles()
}
...
private def isNewFile(path: Path, currentTime: Long, modTimeIgnoreThreshold:
Long): Boolean = {
val pathStr = path.toString
..
// Reject file if it was considered earlier
if (selectedFiles.contains(pathStr)) {
return false
}
logDebug(s"$pathStr accepted with mod time $modTime")
return true
}
But it would very inefficient for lot of files directory. Are there any other
way to judge the moved new files?
was (Author: adam wang):
Thanks for remind, this is my fault. This bug seems not so easy to solve, could
we define a Set[Path] to check if file has been read by InputDstream? Such as:
@transient private var selectedFiles = new mutable.HashSet[String]()
...
override def start() {
addOldFilesToSelectedFiles()
}
...
private def isNewFile(path: Path, currentTime: Long, modTimeIgnoreThreshold:
Long): Boolean = {
val pathStr = path.toString
..
// Reject file if it was considered earlier
if (selectedFiles.contains(pathStr)) {
return false
}
logDebug(s"$pathStr accepted with mod time $modTime")
return true
}
But it would very inefficient for lot of files directory. Are there any other
way to judge the moved new files?
> FileInputDStream could not detected files which moved to the directory
> -----------------------------------------------------------------------
>
> Key: SPARK-18974
> URL: https://issues.apache.org/jira/browse/SPARK-18974
> Project: Spark
> Issue Type: Bug
> Components: DStreams
> Affects Versions: 1.6.3, 2.0.2
> Reporter: Adam Wang
>
> FileInputDStream use mod time to find new files, but if a file was moved into
> the directories it's modification time would not be changed, so
> FileInputDStream could not detect these files.
> I think a way to fix this bug is get access_time and do judgment, bug it need
> a Set of files to save all old files, it would very inefficient for lot of
> files directory.
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]