[
https://issues.apache.org/jira/browse/FLINK-1081?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14236585#comment-14236585
]
ASF GitHub Bot commented on FLINK-1081:
---------------------------------------
Github user chiwanpark commented on the pull request:
https://github.com/apache/incubator-flink/pull/226#issuecomment-65885299
I suggest a new implementation of this feature. I hope for many feedback
about this idea. There are two functions for this feature.
1. `FileMonitoringFunction` emits a tuple with 3 parameters. (modified file
path, start offset, end offset) This function implements `NonParallelInput`.
2. `FileMapFunction` (I think that renaming of this function is required)
reads file that have the file path and emits contents in given range. This
function implements `FlatMapFunction` because there is no method to link
between two source functions.
When a user calls `readFileStream` in `StreamExecutionEnvironment`, the
system creates a `FileMonitoringFunction` and `FileMapFunction` and links them
and returns them.
With this implementation, we can fix the problem about parallelism with
monitoring instance. The user can set degree of parallelism of source. In fact,
the user set degree of parallelism of map function. There is only one instance
monitoring file system.
Additionally, we can reuse `FileMapFunction` to substitute
`FileSourceFunction`.
How about this implementation?
> Add HDFS file-stream source for streaming
> -----------------------------------------
>
> Key: FLINK-1081
> URL: https://issues.apache.org/jira/browse/FLINK-1081
> Project: Flink
> Issue Type: Improvement
> Components: Streaming
> Affects Versions: 0.7.0-incubating
> Reporter: Gyula Fora
> Assignee: Chiwan Park
> Labels: starter
>
> Add data stream source that will monitor a slected directory on HDFS (or
> other filesystems as well) and will process all new files created.
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)