[jira] [Created] (FLINK-1679) Document how degree of parallelism / parallelism / slots are connected to each other
Robert Metzger created FLINK-1679: - Summary: Document how degree of parallelism / parallelism / slots are connected to each other Key: FLINK-1679 URL: https://issues.apache.org/jira/browse/FLINK-1679 Project: Flink Issue Type: Task Components: Documentation Affects Versions: 0.9 Reporter: Robert Metzger I see too many users being confused about properly setting up Flink with respect to parallelism. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[DISCUSS] Deprecate Spargel API for 0.9
Hi all, I would like your opinion on whether we should deprecate the Spargel API in 0.9. Gelly doesn't depend on Spargel, it actually contains it -- we have copied the relevant classes over. I think it would be a good idea to deprecate Spargel in 0.9, so that we can inform existing Spargel users that we'll eventually remove it. Also, I think the fact that we have 2 Graph APIs in the documentation might be a bit confusing for newcomers. One might wonder why do we have them both and when shall they use one over the other? It might be a good idea to add a note in the Spargel guide that would suggest to use Gelly instead and a corresponding note in the beginning of the Gelly guide to explain that Spargel is part of Gelly now. Or maybe a Gelly or Spargel? section. What do you think? The only thing that worries me is that the Gelly API is not very stable. Of course, we are mostly adding things, but we are planning to make some changes as well and I'm sure more will be needed the more we use it. Looking forward to your thoughts! Cheers, Vasia.
Re: [DISCUSS] Add method for each Akka message
+1 for the idea too. Should make it easier to trace/ debug. - Henry On Tue, Mar 10, 2015 at 5:45 AM, Stephan Ewen se...@apache.org wrote: +1, let's change this lazily whenever we work on an action/message, we pull the handling out into a dedicated method. On Tue, Mar 10, 2015 at 11:49 AM, Ufuk Celebi u...@apache.org wrote: Hey all, I currently find it a little bit frustrating to navigate between different task manager operations like cancel or submit task. Some of these operations are directly done in the event loop (e.g. cancelling), whereas others forward the msg to a method (e.g. submitting). For me, navigating to methods is way easier than manually scanning the event loop. Therefore, I would prefer to forward all messages to a corresponding method. Can I get some opinions on this? Would someone be opposed? [Or is there a way in IntelliJ to do this navigation more efficiently? I couldn't find anything.] – Ufuk
Re: [jira] [Commented] (FLINK-1679) Document how degree of parallelism / parallelism / slots are connected to each other
+1 for consistently calling it parallelism -1 for AUTOMAX as the default On Thu, Mar 12, 2015 at 10:31 AM, Robert Metzger rmetz...@apache.org wrote: We can also make the change non-API breaking by adding an additional method and deprecating the old one. Why would the AUTOMAX parallelism eat up all cluster resources? It would only allocate all slots WITHIN the Flink cluster. Those users (=new users) who would benefit from the AUTOMAX parallelism have probably set the parallelism per TaskManager set to 1 anyways. Advanced users will set their parallelism / slots configuration anyways properly. In my experience, most users: - have exclusive access to a test cluster in the beginning (I don't think anybody who doesn't know the system at all would start Flink on a production cluster) - or use YARN - do not set any parallelism for jobs or slots per TaskManager. From these observations, I would actually set the number of slots on the TaskManagers to the number of available CPUs. And for the CLI frontend, I would by default let a job use all available slots (most users don't know that Flink allows to run multiple jobs at the same time). If users want to change the behavior, they have to look into the documentation. On Thu, Mar 12, 2015 at 10:20 AM, Fabian Hueske fhue...@gmail.com wrote: +1 for going consistently with parallelism. However, these are API-breaking changes and we need to mark them deprecated before throwing them out, IMO. I am not comfortable with using AUTOMAX as a default. This is fine on dedicated setups like YARN sessions, but will consume all available resources of a cluster if a user forgets to set the -p flag (or fix the DOP in the program). There is already a default-parallelsm flag in the config and that value should be used, IMO. 2015-03-12 10:07 GMT+01:00 Robert Metzger (JIRA) j...@apache.org: [ https://issues.apache.org/jira/browse/FLINK-1679?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14358345#comment-14358345 ] Robert Metzger commented on FLINK-1679: --- I would suggest to remove all occurrences of degreeOfParalleism in the system and replace it by parallelism everywhere. The CLI frontend for example also calls it {{-p}}, not {{-dop}}. I would also suggest to set the parallelism by default to {{AUTOMAX}} in the CliFrontend. Document how degree of parallelism / parallelism / slots are connected to each other --- Key: FLINK-1679 URL: https://issues.apache.org/jira/browse/FLINK-1679 Project: Flink Issue Type: Task Components: Documentation Affects Versions: 0.9 Reporter: Robert Metzger Assignee: Ufuk Celebi I see too many users being confused about properly setting up Flink with respect to parallelism. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
Re: [DISCUSS] Make a release to be announced at ApacheCon
Have you run the 20 builds with the new shading code? With new shading the TaskManagerFailsITCase should no longer fail. If it still does, then we have to look into it again. On Thu, Mar 12, 2015 at 2:01 PM, Stephan Ewen se...@apache.org wrote: I am also big time skeptical. There are some remaining stability issues with 0.9 - Apparently a bug in the task canceling - Blocking Data Exchange is premature at this point - TaskManager startup is not robust - TaskManager / JobManager registration is not robust - Streaming fault tolerance needs more testing before we can make an assessment I think this needs a few more weeks... On Thu, Mar 12, 2015 at 1:57 PM, Aljoscha Krettek aljos...@apache.org wrote: I would like to get the Expression API for Java in there, as well. On Thu, Mar 12, 2015 at 11:59 AM, Ufuk Celebi u...@apache.org wrote: On Thu, Mar 12, 2015 at 10:11 AM, Robert Metzger rmetz...@apache.org wrote: So you're saying regarding the release you don't feel very confident that we manage to fork off release-0.9 next week? Yes. At the moment I would be uncomfortable with forking off. Regarding the failing tests: I thought that some failings jobs were related to my changes, but after looking into it, it was a false alarm. See comments here: https://github.com/apache/flink/pull/475
Building Flink takes long time now =(
Seemed like after few commits between today and yesterday, building Flink takes very long time. I used to be able to run mvn clean install -DskipTests in about 17 mins but after pull this morning, it has been 25 mins and the build has not complete yet. I am using MacOSX with JDK7 - Henry
Re: [DISCUSS] Make a release to be announced at ApacheCon
On Thursday, March 12, 2015, Till Rohrmann till.rohrm...@gmail.com wrote: Have you run the 20 builds with the new shading code? With new shading the TaskManagerFailsITCase should no longer fail. If it still does, then we have to look into it again. No, rebased on Monday before shading. Let me rebase and rerun tonight.
Re: [DISCUSS] Make a release to be announced at ApacheCon
I would like to get the Expression API for Java in there, as well. On Thu, Mar 12, 2015 at 11:59 AM, Ufuk Celebi u...@apache.org wrote: On Thu, Mar 12, 2015 at 10:11 AM, Robert Metzger rmetz...@apache.org wrote: So you're saying regarding the release you don't feel very confident that we manage to fork off release-0.9 next week? Yes. At the moment I would be uncomfortable with forking off. Regarding the failing tests: I thought that some failings jobs were related to my changes, but after looking into it, it was a false alarm. See comments here: https://github.com/apache/flink/pull/475
Re: [DISCUSS] Make a release to be announced at ApacheCon
I am also big time skeptical. There are some remaining stability issues with 0.9 - Apparently a bug in the task canceling - Blocking Data Exchange is premature at this point - TaskManager startup is not robust - TaskManager / JobManager registration is not robust - Streaming fault tolerance needs more testing before we can make an assessment I think this needs a few more weeks... On Thu, Mar 12, 2015 at 1:57 PM, Aljoscha Krettek aljos...@apache.org wrote: I would like to get the Expression API for Java in there, as well. On Thu, Mar 12, 2015 at 11:59 AM, Ufuk Celebi u...@apache.org wrote: On Thu, Mar 12, 2015 at 10:11 AM, Robert Metzger rmetz...@apache.org wrote: So you're saying regarding the release you don't feel very confident that we manage to fork off release-0.9 next week? Yes. At the moment I would be uncomfortable with forking off. Regarding the failing tests: I thought that some failings jobs were related to my changes, but after looking into it, it was a false alarm. See comments here: https://github.com/apache/flink/pull/475
[jira] [Created] (FLINK-1695) Create machine learning library
Till Rohrmann created FLINK-1695: Summary: Create machine learning library Key: FLINK-1695 URL: https://issues.apache.org/jira/browse/FLINK-1695 Project: Flink Issue Type: Improvement Reporter: Till Rohrmann Assignee: Till Rohrmann Create the infrastructure for Flink's machine learning library. This includes the creation of the module structure and the implementation of basic types such as vectors and matrices. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Created] (FLINK-1696) Add multiple linear regression to ML library
Till Rohrmann created FLINK-1696: Summary: Add multiple linear regression to ML library Key: FLINK-1696 URL: https://issues.apache.org/jira/browse/FLINK-1696 Project: Flink Issue Type: Improvement Reporter: Till Rohrmann Assignee: Till Rohrmann Add multiple linear regression to ML library. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
Re: [DISCUSS] Make a release to be announced at ApacheCon
On Thu, Mar 12, 2015 at 10:11 AM, Robert Metzger rmetz...@apache.org wrote: So you're saying regarding the release you don't feel very confident that we manage to fork off release-0.9 next week? Yes. At the moment I would be uncomfortable with forking off. Regarding the failing tests: I thought that some failings jobs were related to my changes, but after looking into it, it was a false alarm. See comments here: https://github.com/apache/flink/pull/475
[jira] [Created] (FLINK-1693) Deprecate the Spargel API
Vasia Kalavri created FLINK-1693: Summary: Deprecate the Spargel API Key: FLINK-1693 URL: https://issues.apache.org/jira/browse/FLINK-1693 Project: Flink Issue Type: Task Components: Spargel Affects Versions: 0.9 Reporter: Vasia Kalavri For the upcoming 0.9 release, we should mark all user-facing methods from the Spargel API as deprecated, with a warning that we are going to remove it at some point. We should also add a comment in the docs and point people to Gelly. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Created] (FLINK-1694) Change the split between create/run of a vertex-centric iteration
Vasia Kalavri created FLINK-1694: Summary: Change the split between create/run of a vertex-centric iteration Key: FLINK-1694 URL: https://issues.apache.org/jira/browse/FLINK-1694 Project: Flink Issue Type: Task Components: Gelly Reporter: Vasia Kalavri Currently, the vertex-centric API in Gelly looks like this: {code:java} Graph inputGaph = ... //create graph VertexCentricIteration iteration = inputGraph.createVertexCentricIteration(); ... // configure the iteration Graph newGraph = inputGaph.runVertexCentricIteration(iteration); {code} We have this create/run split, in order to expose the iteration object and be able to call the public methods of VertexCentricIteration. However, this is not very nice and might lead to errors, if create and run are mistakenly called on different graph objects. One suggestion is to change this to the following: {code:java} VertexCentricIteration iteration = inputGraph.createVertexCentricIteration(); ... // configure the iteration Graph newGraph = iteration.result(); {code} or to go with a single run call, where we add an IterationConfiguration object as a parameter and we don't expose the iteration object to the user at all: {code:java} IterationConfiguration parameters = ... Graph newGraph = inputGraph.runVertexCentricIteration(parameters); {code} and we can also have a simplified method where no configuration is passed. What do you think? Personally, I like the second option a bit more. -Vasia. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
Re: [jira] [Commented] (FLINK-1679) Document how degree of parallelism / parallelism / slots are connected to each other
+1 for unifying the way to set the parallelism and deprecating the old methods. We had the AUTOMAX discussion before in the corresponding pull request. It seems to be that there are two orthogonal views on how resources should be allocated by default. I strongly agree with Robert. Users have exclusive access to resources or use a resource manager (YARN). They are often unaware of the parallelism and are turned off by the bad performance with parallelism of 1. Setting AUTOMAX by default gives the best possible Flink experience. After all, Flink doesn't even support proper sharing of resources at the moment. So scenarios where multiple users manually set the parallelism will cause problems with job canceling due to unavailable resources and missing queuing features. Let's leave it up to the advanced users to set the granularity of the parallelism and provide the best out of the box experience for Flink novices. Best regards, Max On Thu, Mar 12, 2015 at 10:31 AM, Robert Metzger rmetz...@apache.org wrote: We can also make the change non-API breaking by adding an additional method and deprecating the old one. Why would the AUTOMAX parallelism eat up all cluster resources? It would only allocate all slots WITHIN the Flink cluster. Those users (=new users) who would benefit from the AUTOMAX parallelism have probably set the parallelism per TaskManager set to 1 anyways. Advanced users will set their parallelism / slots configuration anyways properly. In my experience, most users: - have exclusive access to a test cluster in the beginning (I don't think anybody who doesn't know the system at all would start Flink on a production cluster) - or use YARN - do not set any parallelism for jobs or slots per TaskManager. From these observations, I would actually set the number of slots on the TaskManagers to the number of available CPUs. And for the CLI frontend, I would by default let a job use all available slots (most users don't know that Flink allows to run multiple jobs at the same time). If users want to change the behavior, they have to look into the documentation. On Thu, Mar 12, 2015 at 10:20 AM, Fabian Hueske fhue...@gmail.com wrote: +1 for going consistently with parallelism. However, these are API-breaking changes and we need to mark them deprecated before throwing them out, IMO. I am not comfortable with using AUTOMAX as a default. This is fine on dedicated setups like YARN sessions, but will consume all available resources of a cluster if a user forgets to set the -p flag (or fix the DOP in the program). There is already a default-parallelsm flag in the config and that value should be used, IMO. 2015-03-12 10:07 GMT+01:00 Robert Metzger (JIRA) j...@apache.org: [ https://issues.apache.org/jira/browse/FLINK-1679?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14358345#comment-14358345 ] Robert Metzger commented on FLINK-1679: --- I would suggest to remove all occurrences of degreeOfParalleism in the system and replace it by parallelism everywhere. The CLI frontend for example also calls it {{-p}}, not {{-dop}}. I would also suggest to set the parallelism by default to {{AUTOMAX}} in the CliFrontend. Document how degree of parallelism / parallelism / slots are connected to each other --- Key: FLINK-1679 URL: https://issues.apache.org/jira/browse/FLINK-1679 Project: Flink Issue Type: Task Components: Documentation Affects Versions: 0.9 Reporter: Robert Metzger Assignee: Ufuk Celebi I see too many users being confused about properly setting up Flink with respect to parallelism. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
Re: [jira] [Commented] (FLINK-1106) Deprecate old Record API
And I'm +1 for removing the old API with the next release. 2015-03-10 17:38 GMT+01:00 Fabian Hueske fhue...@gmail.com: Yeah, I spotted a good amount of optimizer tests that depend on the Record API. I implemented the last optimizer tests with the new API and would volunteer to port the other optimizer tests. 2015-03-10 16:32 GMT+01:00 Stephan Ewen (JIRA) j...@apache.org: [ https://issues.apache.org/jira/browse/FLINK-1106?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14355063#comment-14355063 ] Stephan Ewen commented on FLINK-1106: - A bit of test coverage depends on the deprecated API. We would need to port at least some of the tests to the new API. We can probably drop some subsumed / obsolete tests. Deprecate old Record API Key: FLINK-1106 URL: https://issues.apache.org/jira/browse/FLINK-1106 Project: Flink Issue Type: Task Components: Java API Affects Versions: 0.7.0-incubating Reporter: Robert Metzger Assignee: Robert Metzger Priority: Critical Fix For: 0.7.0-incubating For the upcoming 0.7 release, we should mark all user-facing methods from the old Record Java API as deprecated, with a warning that we are going to remove it at some point. I would suggest to wait one or two releases from the 0.7 release (given our current release cycle). I'll start a mailing-list discussion at some point regarding this. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Created] (FLINK-1688) Add socket sink
Márton Balassi created FLINK-1688: - Summary: Add socket sink Key: FLINK-1688 URL: https://issues.apache.org/jira/browse/FLINK-1688 Project: Flink Issue Type: Sub-task Components: Streaming Reporter: Márton Balassi Priority: Trivial Add a sink that writes output to socket. I'd consider two options, one which implements a socket server and one which implements a client. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
Re: [DISCUSS] Make a release to be announced at ApacheCon
I will follow up again with Sally this week if there any special messaging or communications needed to do for the Apache Con from our side. - Henry On Tue, Mar 10, 2015 at 3:20 AM, Robert Metzger rmetz...@apache.org wrote: Hey, whats the status on this? There is one week left until we are going to fork off a branch for 0.9 .. if we stick to the suggested timeline. The initial email said I am very much in favor of doing this, under the strong condition that we are very confident that the master has grown to be stable enough. I think it is time to evaluate whether we are confident that the master is stable. Best Robert On Wed, Mar 4, 2015 at 9:42 AM, Robert Metzger rmetz...@apache.org wrote: +1 for Marton as a release manager. Thank you! On Tue, Mar 3, 2015 at 7:56 PM, Henry Saputra henry.sapu...@gmail.com wrote: Ah, thanks Márton. So we are chartering to the similar concept of Spark RRD staging execution =P I suppose there will be a runtime configuration or hint to tell the Flink Job manager to indicate which execution is preferred? - Henry On Tue, Mar 3, 2015 at 2:09 AM, Márton Balassi balassi.mar...@gmail.com wrote: Hi Henry, Batch mode is a new execution mode for batch Flink jobs where instead of pipelining the whole execution the job is scheduled in stages, thus materializing the intermediate result before continuing to the next operators. For implications see [1]. [1] http://www.slideshare.net/KostasTzoumas/flink-internals, page 18-21. On Mon, Mar 2, 2015 at 11:39 PM, Henry Saputra henry.sapu...@gmail.com wrote: HI Stephan, What is Batch mode feature in the list? - Henry On Mon, Mar 2, 2015 at 5:03 AM, Stephan Ewen se...@apache.org wrote: Hi all! ApacheCon is coming up and it is the 15th anniversary of the Apache Software Foundation. In the course of the conference, Apache would like to make a series of announcements. If we manage to make a release during (or shortly before) ApacheCon, they will announce it through their channels. I am very much in favor of doing this, under the strong condition that we are very confident that the master has grown to be stable enough (there are major changes in the distributed runtime since version 0.8 that we are still stabilizing). No use in a widely announced build that does not have the quality. Flink has now many new features that warrant a release soon (once we fixed the last quirks in the new distributed runtime). Notable new features are: - Gelly - Streaming windows - Flink on Tez - Expression API - Distributed Runtime on Akka - Batch mode - Maybe even a first ML library version - Some streaming fault tolerance Robert proposed to have a feature freeze mid Match for that. His cornerpoints were: Feature freeze (forking off release-0.9): March 17 RC1 vote: March 24 The RC1 vote is 20 days before the ApacheCon (13. April). For the last three releases, the average voting time was 20 days: R 0.8.0 -- 14 days R 0.7.0 -- 22 days R 0.6 -- 26 days Please share your opinion on this! Greetings, Stephan