Re: [Proposal] Simulate setting for application launch

2017-12-19 Thread Pramod Immaneni
The external system recording was just an example, not a specific use case.
The idea is to provide comprehensive information to populateDAG as to the
context it is being called under. It is akin to the test mode or simulate
flag that you see with various utilities. The platform cannot control what
populateDAG does, even without this information, in multiple calls that you
mention the application can return different DAGs by depending on
any external factor such as time of day or some external variable. This is
to merely provide more context information in the config. It is upto the
application to do what it wishes with it.

On Tue, Dec 19, 2017 at 2:28 PM, Vlad Rozov  wrote:

> -0.5: populateDAG() may be called by the platform as many times as it
> needs (even in case it calls it only once now to launch an application).
> Passing different parameters to populateDAG() in simulate launch mode and
> actual launch may lead to different DAG being constructed for those two
> modes. Can't the use case you described be handled by a plugin?
>
> Thank you,
>
> Vlad
>
>
> On 12/19/17 10:06, Sanjay Pujare wrote:
>
>> +1 although I prefer something that is more enforceable. So I like the
>> idea
>> of another method but that introduces incompatibility so may be in 4.0?
>>
>> On Tue, Dec 19, 2017 at 9:40 AM, Munagala Ramanath <
>> amberar...@yahoo.com.invalid> wrote:
>>
>>   +1
>>> Ram
>>>  On Tuesday, December 19, 2017, 8:33:21 AM PST, Pramod Immaneni <
>>> pra...@datatorrent.com> wrote:
>>>
>>>   I have a mini proposal. The command get-app-package-info runs the
>>> populateDAG method of an application to construct the DAG but does not
>>> actually launch the DAG. An application developer does not know in which
>>> context the populateDAG is being called. For example, if they are
>>> recording
>>> application starts in an external system from populateDAG, they will have
>>> false entries there. This can be solved in different ways such as
>>> introducing another method in StreamingApplication or more parameters
>>> to populateDAG but a non disruptive option would be to add a property in
>>> the configuration object that is passed to populateDAG to indicate if it
>>> is
>>> simulate/test mode or real launch. An application developer can use this
>>> property to take the appropriate actions.
>>>
>>> Thanks
>>>
>>>
>>>
>


Re: [Proposal] Simulate setting for application launch

2017-12-19 Thread Vlad Rozov
-0.5: populateDAG() may be called by the platform as many times as it 
needs (even in case it calls it only once now to launch an application). 
Passing different parameters to populateDAG() in simulate launch mode 
and actual launch may lead to different DAG being constructed for those 
two modes. Can't the use case you described be handled by a plugin?


Thank you,

Vlad

On 12/19/17 10:06, Sanjay Pujare wrote:

+1 although I prefer something that is more enforceable. So I like the idea
of another method but that introduces incompatibility so may be in 4.0?

On Tue, Dec 19, 2017 at 9:40 AM, Munagala Ramanath <
amberar...@yahoo.com.invalid> wrote:


  +1
Ram
 On Tuesday, December 19, 2017, 8:33:21 AM PST, Pramod Immaneni <
pra...@datatorrent.com> wrote:

  I have a mini proposal. The command get-app-package-info runs the
populateDAG method of an application to construct the DAG but does not
actually launch the DAG. An application developer does not know in which
context the populateDAG is being called. For example, if they are recording
application starts in an external system from populateDAG, they will have
false entries there. This can be solved in different ways such as
introducing another method in StreamingApplication or more parameters
to populateDAG but a non disruptive option would be to add a property in
the configuration object that is passed to populateDAG to indicate if it is
simulate/test mode or real launch. An application developer can use this
property to take the appropriate actions.

Thanks






Re: [Proposal] Simulate setting for application launch

2017-12-19 Thread Sanjay Pujare
+1 although I prefer something that is more enforceable. So I like the idea
of another method but that introduces incompatibility so may be in 4.0?

On Tue, Dec 19, 2017 at 9:40 AM, Munagala Ramanath <
amberar...@yahoo.com.invalid> wrote:

>  +1
> Ram
> On Tuesday, December 19, 2017, 8:33:21 AM PST, Pramod Immaneni <
> pra...@datatorrent.com> wrote:
>
>  I have a mini proposal. The command get-app-package-info runs the
> populateDAG method of an application to construct the DAG but does not
> actually launch the DAG. An application developer does not know in which
> context the populateDAG is being called. For example, if they are recording
> application starts in an external system from populateDAG, they will have
> false entries there. This can be solved in different ways such as
> introducing another method in StreamingApplication or more parameters
> to populateDAG but a non disruptive option would be to add a property in
> the configuration object that is passed to populateDAG to indicate if it is
> simulate/test mode or real launch. An application developer can use this
> property to take the appropriate actions.
>
> Thanks
>
>


Re: [Proposal] Simulate setting for application launch

2017-12-19 Thread Munagala Ramanath
 +1
Ram
On Tuesday, December 19, 2017, 8:33:21 AM PST, Pramod Immaneni 
 wrote:  
 
 I have a mini proposal. The command get-app-package-info runs the
populateDAG method of an application to construct the DAG but does not
actually launch the DAG. An application developer does not know in which
context the populateDAG is being called. For example, if they are recording
application starts in an external system from populateDAG, they will have
false entries there. This can be solved in different ways such as
introducing another method in StreamingApplication or more parameters
to populateDAG but a non disruptive option would be to add a property in
the configuration object that is passed to populateDAG to indicate if it is
simulate/test mode or real launch. An application developer can use this
property to take the appropriate actions.

Thanks
  

[Proposal] Simulate setting for application launch

2017-12-19 Thread Pramod Immaneni
I have a mini proposal. The command get-app-package-info runs the
populateDAG method of an application to construct the DAG but does not
actually launch the DAG. An application developer does not know in which
context the populateDAG is being called. For example, if they are recording
application starts in an external system from populateDAG, they will have
false entries there. This can be solved in different ways such as
introducing another method in StreamingApplication or more parameters
to populateDAG but a non disruptive option would be to add a property in
the configuration object that is passed to populateDAG to indicate if it is
simulate/test mode or real launch. An application developer can use this
property to take the appropriate actions.

Thanks


Re: [Discuss] Design of the python execution operator

2017-12-19 Thread Pramod Immaneni
Hi Ananth,

>From your explanation, it looks like the threads overall allow you to
achieve two things. Have some sort of overall timeout if by which a tuple
doesn't finish processing then it is flagged as such. Second, it doesn't
block processing of subsequent tuples and you can still process them
meeting the SLA. By checkpoint, however, I think you should try to have a
resolution one way or the other for all the tuples received within the
checkpoint period or every window boundary (see idempotency below),
otherwise, there is a chance of data loss in case of operator restarts. If
a loss is acceptable for stragglers you could let straggler processing
continue beyond checkpoint boundary and let them finish when they can. You
could support both behaviors by use of a property. Furthermore, you may not
want all threads stuck with stragglers and then you are back to square one
so you may need to stop processing stragglers beyond a certain thread usage
threshold. Is there a way to interrupt the processing of the engine?

Then there is the question of idempotency. I suspect it would be difficult
to maintain it unless you wait for processing to finish for all tuples
received during the window every window boundary. You may provide an option
for relaxing the strict guarantees for the stragglers like mentioned above.

Pramod

On Thu, Dec 14, 2017 at 10:49 AM, Ananth G  wrote:

> Hello Pramod,
>
> Thanks for the comments. I adjusted the title of the JIRA. Here is what I
> was thinking for the worker pool implementation.
>
> - The main reason ( which I forgot to mention in the design points below )
> is that the java embedded engine allows only the thread that created the
> instance to execute the python logic. This is more because of the JNI
> specification itself. Some hints here https://stackoverflow.com/
> questions/18056347/jni-calling-java-from-c-with-multiple-threads <
> https://stackoverflow.com/questions/18056347/jni-calling-java-from-c-with-
> multiple-threads> and here http://journals.ecs.soton.ac.
> uk/java/tutorial/native1.1/implementing/sync.html <
> http://journals.ecs.soton.ac.uk/java/tutorial/native1.1/
> implementing/sync.html>
>
> - This essentially means that the main operator thread will have to call
> the python code execution logic if the design were otherwise.
>
> - Since the end user can choose to can write any kind of logic including
> blocking I/O as part of the implementation, I did not want to stall the
> operator thread for any usage pattern.
>
> - In fact there is only one overall interpreter in the JVM process space
> and the interpreter thread is just a JNI wrapper around it to account for
> the JNI limitations above.
>
> - It is for the very same reason, there is an API in the implementation to
> support for registering Shared Modules across all of the interpreter
> threads. Use cases for this exist when there is a global variable provided
> by the underlying Python library and loading it multiple times can cause
> issues. Hence the API to register a shared module which can be used by all
> of the Interpreter Threads.
>
> - The operator submits to a work request queue and consumes from a
> response queue for each of the interpreter thread. There exists one request
> and one response queue per interpreter thread.
>
> - The stragglers will get drained from the response queue for a previously
> submitted request queue.
>
> - The other reason why I chose to implement it this ways is also for some
> of the use case that I foresee in the ML scoring scenarios. In fraud
> systems, if I have a strict SLA to score a model, the main thread in the
> operator is not helping me implement this pattern at all. The caller to the
> Apex application will need to proceed if the scoring gets delayed due to
> whatever reason. However the scoring can continue on the interpreter thread
> and can be drained later ( It is just that the caller did not make use of
> this result but can still be persisted for operators consuming from the
> straggler port.
>
> - There are 3 output ports for this operator. DefaultOutputPort,
> stragglersPort and an errorPort.
>
> - Some libraries like Tensorflow can become really heavy. Tensorflow
> models can execute a tensorflow DAG as part of a model scoring
> implementation and hence I wanted to take the approach of a worker pool.
> Yes your point is valid if we wait for the stragglers to complete in a
> given window. The current implementation does not force to wait for all of
> the stragglers to complete. The stragglers are emitted only when there is a
> new tuple that is being processed. i.e. when a new tuple arrives for
> scoring , the straggler response queue is checked if there are any entries
> and if yes, the responses are emitted into the stragglerPort. This
> essentially means that there are situations when the straggler port is
> emitting the result for a request submitted in the previous window. This
> also implies that idempotency cannot be 

[jira] [Resolved] (APEXCORE-798) Exclude log4j.properties from engine-test.jar

2017-12-19 Thread Vlad Rozov (JIRA)

 [ 
https://issues.apache.org/jira/browse/APEXCORE-798?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Vlad Rozov resolved APEXCORE-798.
-
   Resolution: Fixed
Fix Version/s: 3.7.0

> Exclude log4j.properties from engine-test.jar
> -
>
> Key: APEXCORE-798
> URL: https://issues.apache.org/jira/browse/APEXCORE-798
> Project: Apache Apex Core
>  Issue Type: Bug
>Reporter: Vlad Rozov
>Assignee: Vlad Rozov
>Priority: Minor
> Fix For: 3.7.0
>
>
> log4j.properties is supposed to be excluded from the engine test jar based on 
> the 
> {code:xml}
> 
>   org.apache.maven.plugins
>   maven-jar-plugin
>   2.4
>   
> true
> 
>   
> true
> repository
>   
> 
> 
>   log4j.properties
> 
>   
> 
> {code}



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Commented] (APEXCORE-798) Exclude log4j.properties from engine-test.jar

2017-12-19 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/APEXCORE-798?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16296967#comment-16296967
 ] 

ASF GitHub Bot commented on APEXCORE-798:
-

vrozov closed pull request #590: APEXCORE-798 Exclude log4j.properties from 
engine-test.jar
URL: https://github.com/apache/apex-core/pull/590
 
 
   

This is a PR merged from a forked repository.
As GitHub hides the original diff on merge, it is displayed below for
the sake of provenance:

As this is a foreign pull request (from a fork), the diff is supplied
below (as it won't show otherwise due to GitHub magic):

diff --git a/engine/pom.xml b/engine/pom.xml
index 4294c8abfc..93e7d976e5 100644
--- a/engine/pom.xml
+++ b/engine/pom.xml
@@ -49,7 +49,7 @@
   
 
 
-  
+  
 
 **/yarn-site.xml
   
diff --git 
a/engine/src/test/java/com/datatorrent/stram/StramMiniClusterTest.java 
b/engine/src/test/java/com/datatorrent/stram/StramMiniClusterTest.java
index bbb7c1c49b..fb02593e98 100644
--- a/engine/src/test/java/com/datatorrent/stram/StramMiniClusterTest.java
+++ b/engine/src/test/java/com/datatorrent/stram/StramMiniClusterTest.java
@@ -129,6 +129,7 @@ public static void setup() throws InterruptedException, 
IOException
 conf.set("yarn.scheduler.capacity.root.queues", "default");
 conf.set("yarn.scheduler.capacity.root.default.capacity", "100");
 conf.setBoolean(YarnConfiguration.NM_DISK_HEALTH_CHECK_ENABLE, false);
+conf.setFloat(YarnConfiguration.NM_MAX_PER_DISK_UTILIZATION_PERCENTAGE, 
100.0F);
 conf.set(YarnConfiguration.NM_ADMIN_USER_ENV, 
String.format("JAVA_HOME=%s,CLASSPATH=%s", System.getProperty("java.home"), 
getTestRuntimeClasspath()));
 conf.set(YarnConfiguration.NM_ENV_WHITELIST, 
YarnConfiguration.DEFAULT_NM_ENV_WHITELIST.replaceAll("JAVA_HOME,*", ""));
 


 


This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


> Exclude log4j.properties from engine-test.jar
> -
>
> Key: APEXCORE-798
> URL: https://issues.apache.org/jira/browse/APEXCORE-798
> Project: Apache Apex Core
>  Issue Type: Bug
>Reporter: Vlad Rozov
>Assignee: Vlad Rozov
>Priority: Minor
>
> log4j.properties is supposed to be excluded from the engine test jar based on 
> the 
> {code:xml}
> 
>   org.apache.maven.plugins
>   maven-jar-plugin
>   2.4
>   
> true
> 
>   
> true
> repository
>   
> 
> 
>   log4j.properties
> 
>   
> 
> {code}



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)