Yes, that was intentionally. The whole point of using a parallel engine is to process large datasets. Otherwise you could do it in Python on a single box... Remote reads will severely impact the performance and might cause significant performance regression.
2014-10-17 12:04 GMT+02:00 Robert Metzger <[email protected]>: > Did you intentionally post to the mailing list? > > I'm investigating the issue. > So far, I found that the hostname has never been passed to the input split > assigner. I guess this issue was introduced by the recent jobmanager > changes. > And secondly, Flink is using the fully qualified hostname, whereas HDFS is > using the hostname only. This caused a string-mismatch. > > I wouln't cancel the release because we are at a point where it is faster > to vote a bugfix release. > The issue is not a show stopper for using flink. Its just slow on large > datasets. > > On Fri, Oct 17, 2014 at 11:58 AM, Fabian Hueske <[email protected]> > wrote: > > > This is a critical issue and sounds bit like a release blocker for 0.7 to > > me. > > > > Other opinions? > > > > 2014-10-17 11:25 GMT+02:00 Robert Metzger (JIRA) <[email protected]>: > > > > > Robert Metzger created FLINK-1170: > > > ------------------------------------- > > > > > > Summary: Localization of InputSplits is not working > properly > > > Key: FLINK-1170 > > > URL: https://issues.apache.org/jira/browse/FLINK-1170 > > > Project: Flink > > > Issue Type: Bug > > > Components: Distributed Runtime > > > Reporter: Robert Metzger > > > Assignee: Robert Metzger > > > > > > > > > While running some benchmarks, I found that Flink is not properly > > > assigning the InputSplits. > > > > > > On my testing cluster, ALL splits were assigned to remote HDFS > DataNodes, > > > which causes a lot of network I/O. > > > > > > > > > > > > -- > > > This message was sent by Atlassian JIRA > > > (v6.3.4#6332) > > > > > >
