Re: [Proposal] Simulate setting for application launch
The external system recording was just an example, not a specific use case. The idea is to provide comprehensive information to populateDAG as to the context it is being called under. It is akin to the test mode or simulate flag that you see with various utilities. The platform cannot control what populateDAG does, even without this information, in multiple calls that you mention the application can return different DAGs by depending on any external factor such as time of day or some external variable. This is to merely provide more context information in the config. It is upto the application to do what it wishes with it. On Tue, Dec 19, 2017 at 2:28 PM, Vlad Rozovwrote: > -0.5: populateDAG() may be called by the platform as many times as it > needs (even in case it calls it only once now to launch an application). > Passing different parameters to populateDAG() in simulate launch mode and > actual launch may lead to different DAG being constructed for those two > modes. Can't the use case you described be handled by a plugin? > > Thank you, > > Vlad > > > On 12/19/17 10:06, Sanjay Pujare wrote: > >> +1 although I prefer something that is more enforceable. So I like the >> idea >> of another method but that introduces incompatibility so may be in 4.0? >> >> On Tue, Dec 19, 2017 at 9:40 AM, Munagala Ramanath < >> amberar...@yahoo.com.invalid> wrote: >> >> +1 >>> Ram >>> On Tuesday, December 19, 2017, 8:33:21 AM PST, Pramod Immaneni < >>> pra...@datatorrent.com> wrote: >>> >>> I have a mini proposal. The command get-app-package-info runs the >>> populateDAG method of an application to construct the DAG but does not >>> actually launch the DAG. An application developer does not know in which >>> context the populateDAG is being called. For example, if they are >>> recording >>> application starts in an external system from populateDAG, they will have >>> false entries there. This can be solved in different ways such as >>> introducing another method in StreamingApplication or more parameters >>> to populateDAG but a non disruptive option would be to add a property in >>> the configuration object that is passed to populateDAG to indicate if it >>> is >>> simulate/test mode or real launch. An application developer can use this >>> property to take the appropriate actions. >>> >>> Thanks >>> >>> >>> >
Re: [Proposal] Simulate setting for application launch
-0.5: populateDAG() may be called by the platform as many times as it needs (even in case it calls it only once now to launch an application). Passing different parameters to populateDAG() in simulate launch mode and actual launch may lead to different DAG being constructed for those two modes. Can't the use case you described be handled by a plugin? Thank you, Vlad On 12/19/17 10:06, Sanjay Pujare wrote: +1 although I prefer something that is more enforceable. So I like the idea of another method but that introduces incompatibility so may be in 4.0? On Tue, Dec 19, 2017 at 9:40 AM, Munagala Ramanath < amberar...@yahoo.com.invalid> wrote: +1 Ram On Tuesday, December 19, 2017, 8:33:21 AM PST, Pramod Immaneni < pra...@datatorrent.com> wrote: I have a mini proposal. The command get-app-package-info runs the populateDAG method of an application to construct the DAG but does not actually launch the DAG. An application developer does not know in which context the populateDAG is being called. For example, if they are recording application starts in an external system from populateDAG, they will have false entries there. This can be solved in different ways such as introducing another method in StreamingApplication or more parameters to populateDAG but a non disruptive option would be to add a property in the configuration object that is passed to populateDAG to indicate if it is simulate/test mode or real launch. An application developer can use this property to take the appropriate actions. Thanks
Re: [Proposal] Simulate setting for application launch
+1 although I prefer something that is more enforceable. So I like the idea of another method but that introduces incompatibility so may be in 4.0? On Tue, Dec 19, 2017 at 9:40 AM, Munagala Ramanath < amberar...@yahoo.com.invalid> wrote: > +1 > Ram > On Tuesday, December 19, 2017, 8:33:21 AM PST, Pramod Immaneni < > pra...@datatorrent.com> wrote: > > I have a mini proposal. The command get-app-package-info runs the > populateDAG method of an application to construct the DAG but does not > actually launch the DAG. An application developer does not know in which > context the populateDAG is being called. For example, if they are recording > application starts in an external system from populateDAG, they will have > false entries there. This can be solved in different ways such as > introducing another method in StreamingApplication or more parameters > to populateDAG but a non disruptive option would be to add a property in > the configuration object that is passed to populateDAG to indicate if it is > simulate/test mode or real launch. An application developer can use this > property to take the appropriate actions. > > Thanks > >
Re: [Proposal] Simulate setting for application launch
+1 Ram On Tuesday, December 19, 2017, 8:33:21 AM PST, Pramod Immaneniwrote: I have a mini proposal. The command get-app-package-info runs the populateDAG method of an application to construct the DAG but does not actually launch the DAG. An application developer does not know in which context the populateDAG is being called. For example, if they are recording application starts in an external system from populateDAG, they will have false entries there. This can be solved in different ways such as introducing another method in StreamingApplication or more parameters to populateDAG but a non disruptive option would be to add a property in the configuration object that is passed to populateDAG to indicate if it is simulate/test mode or real launch. An application developer can use this property to take the appropriate actions. Thanks
[Proposal] Simulate setting for application launch
I have a mini proposal. The command get-app-package-info runs the populateDAG method of an application to construct the DAG but does not actually launch the DAG. An application developer does not know in which context the populateDAG is being called. For example, if they are recording application starts in an external system from populateDAG, they will have false entries there. This can be solved in different ways such as introducing another method in StreamingApplication or more parameters to populateDAG but a non disruptive option would be to add a property in the configuration object that is passed to populateDAG to indicate if it is simulate/test mode or real launch. An application developer can use this property to take the appropriate actions. Thanks
Re: [Discuss] Design of the python execution operator
Hi Ananth, >From your explanation, it looks like the threads overall allow you to achieve two things. Have some sort of overall timeout if by which a tuple doesn't finish processing then it is flagged as such. Second, it doesn't block processing of subsequent tuples and you can still process them meeting the SLA. By checkpoint, however, I think you should try to have a resolution one way or the other for all the tuples received within the checkpoint period or every window boundary (see idempotency below), otherwise, there is a chance of data loss in case of operator restarts. If a loss is acceptable for stragglers you could let straggler processing continue beyond checkpoint boundary and let them finish when they can. You could support both behaviors by use of a property. Furthermore, you may not want all threads stuck with stragglers and then you are back to square one so you may need to stop processing stragglers beyond a certain thread usage threshold. Is there a way to interrupt the processing of the engine? Then there is the question of idempotency. I suspect it would be difficult to maintain it unless you wait for processing to finish for all tuples received during the window every window boundary. You may provide an option for relaxing the strict guarantees for the stragglers like mentioned above. Pramod On Thu, Dec 14, 2017 at 10:49 AM, Ananth Gwrote: > Hello Pramod, > > Thanks for the comments. I adjusted the title of the JIRA. Here is what I > was thinking for the worker pool implementation. > > - The main reason ( which I forgot to mention in the design points below ) > is that the java embedded engine allows only the thread that created the > instance to execute the python logic. This is more because of the JNI > specification itself. Some hints here https://stackoverflow.com/ > questions/18056347/jni-calling-java-from-c-with-multiple-threads < > https://stackoverflow.com/questions/18056347/jni-calling-java-from-c-with- > multiple-threads> and here http://journals.ecs.soton.ac. > uk/java/tutorial/native1.1/implementing/sync.html < > http://journals.ecs.soton.ac.uk/java/tutorial/native1.1/ > implementing/sync.html> > > - This essentially means that the main operator thread will have to call > the python code execution logic if the design were otherwise. > > - Since the end user can choose to can write any kind of logic including > blocking I/O as part of the implementation, I did not want to stall the > operator thread for any usage pattern. > > - In fact there is only one overall interpreter in the JVM process space > and the interpreter thread is just a JNI wrapper around it to account for > the JNI limitations above. > > - It is for the very same reason, there is an API in the implementation to > support for registering Shared Modules across all of the interpreter > threads. Use cases for this exist when there is a global variable provided > by the underlying Python library and loading it multiple times can cause > issues. Hence the API to register a shared module which can be used by all > of the Interpreter Threads. > > - The operator submits to a work request queue and consumes from a > response queue for each of the interpreter thread. There exists one request > and one response queue per interpreter thread. > > - The stragglers will get drained from the response queue for a previously > submitted request queue. > > - The other reason why I chose to implement it this ways is also for some > of the use case that I foresee in the ML scoring scenarios. In fraud > systems, if I have a strict SLA to score a model, the main thread in the > operator is not helping me implement this pattern at all. The caller to the > Apex application will need to proceed if the scoring gets delayed due to > whatever reason. However the scoring can continue on the interpreter thread > and can be drained later ( It is just that the caller did not make use of > this result but can still be persisted for operators consuming from the > straggler port. > > - There are 3 output ports for this operator. DefaultOutputPort, > stragglersPort and an errorPort. > > - Some libraries like Tensorflow can become really heavy. Tensorflow > models can execute a tensorflow DAG as part of a model scoring > implementation and hence I wanted to take the approach of a worker pool. > Yes your point is valid if we wait for the stragglers to complete in a > given window. The current implementation does not force to wait for all of > the stragglers to complete. The stragglers are emitted only when there is a > new tuple that is being processed. i.e. when a new tuple arrives for > scoring , the straggler response queue is checked if there are any entries > and if yes, the responses are emitted into the stragglerPort. This > essentially means that there are situations when the straggler port is > emitting the result for a request submitted in the previous window. This > also implies that idempotency cannot be
[jira] [Resolved] (APEXCORE-798) Exclude log4j.properties from engine-test.jar
[ https://issues.apache.org/jira/browse/APEXCORE-798?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Vlad Rozov resolved APEXCORE-798. - Resolution: Fixed Fix Version/s: 3.7.0 > Exclude log4j.properties from engine-test.jar > - > > Key: APEXCORE-798 > URL: https://issues.apache.org/jira/browse/APEXCORE-798 > Project: Apache Apex Core > Issue Type: Bug >Reporter: Vlad Rozov >Assignee: Vlad Rozov >Priority: Minor > Fix For: 3.7.0 > > > log4j.properties is supposed to be excluded from the engine test jar based on > the > {code:xml} > > org.apache.maven.plugins > maven-jar-plugin > 2.4 > > true > > > true > repository > > > > log4j.properties > > > > {code} -- This message was sent by Atlassian JIRA (v6.4.14#64029)
[jira] [Commented] (APEXCORE-798) Exclude log4j.properties from engine-test.jar
[ https://issues.apache.org/jira/browse/APEXCORE-798?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16296967#comment-16296967 ] ASF GitHub Bot commented on APEXCORE-798: - vrozov closed pull request #590: APEXCORE-798 Exclude log4j.properties from engine-test.jar URL: https://github.com/apache/apex-core/pull/590 This is a PR merged from a forked repository. As GitHub hides the original diff on merge, it is displayed below for the sake of provenance: As this is a foreign pull request (from a fork), the diff is supplied below (as it won't show otherwise due to GitHub magic): diff --git a/engine/pom.xml b/engine/pom.xml index 4294c8abfc..93e7d976e5 100644 --- a/engine/pom.xml +++ b/engine/pom.xml @@ -49,7 +49,7 @@ - + **/yarn-site.xml diff --git a/engine/src/test/java/com/datatorrent/stram/StramMiniClusterTest.java b/engine/src/test/java/com/datatorrent/stram/StramMiniClusterTest.java index bbb7c1c49b..fb02593e98 100644 --- a/engine/src/test/java/com/datatorrent/stram/StramMiniClusterTest.java +++ b/engine/src/test/java/com/datatorrent/stram/StramMiniClusterTest.java @@ -129,6 +129,7 @@ public static void setup() throws InterruptedException, IOException conf.set("yarn.scheduler.capacity.root.queues", "default"); conf.set("yarn.scheduler.capacity.root.default.capacity", "100"); conf.setBoolean(YarnConfiguration.NM_DISK_HEALTH_CHECK_ENABLE, false); +conf.setFloat(YarnConfiguration.NM_MAX_PER_DISK_UTILIZATION_PERCENTAGE, 100.0F); conf.set(YarnConfiguration.NM_ADMIN_USER_ENV, String.format("JAVA_HOME=%s,CLASSPATH=%s", System.getProperty("java.home"), getTestRuntimeClasspath())); conf.set(YarnConfiguration.NM_ENV_WHITELIST, YarnConfiguration.DEFAULT_NM_ENV_WHITELIST.replaceAll("JAVA_HOME,*", "")); This is an automated message from the Apache Git Service. To respond to the message, please log on GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org > Exclude log4j.properties from engine-test.jar > - > > Key: APEXCORE-798 > URL: https://issues.apache.org/jira/browse/APEXCORE-798 > Project: Apache Apex Core > Issue Type: Bug >Reporter: Vlad Rozov >Assignee: Vlad Rozov >Priority: Minor > > log4j.properties is supposed to be excluded from the engine test jar based on > the > {code:xml} > > org.apache.maven.plugins > maven-jar-plugin > 2.4 > > true > > > true > repository > > > > log4j.properties > > > > {code} -- This message was sent by Atlassian JIRA (v6.4.14#64029)