[jira] [Created] (APEXMALHAR-2255) Use latest java sdk for couchbase

2016-09-20 Thread Priyanka Gugale (JIRA)
Priyanka Gugale created APEXMALHAR-2255:
---

 Summary: Use latest java sdk for couchbase
 Key: APEXMALHAR-2255
 URL: https://issues.apache.org/jira/browse/APEXMALHAR-2255
 Project: Apache Apex Malhar
  Issue Type: Test
Reporter: Priyanka Gugale


Right now the Couchbase connector in Malhar uses couchbase-client library which 
is outdated. We should instead use java-client library from couchbase to 
connect to couchbase server.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (APEXMALHAR-2254) File input operator is not idempotent with closing files on replay

2016-09-20 Thread Pramod Immaneni (JIRA)

 [ 
https://issues.apache.org/jira/browse/APEXMALHAR-2254?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Pramod Immaneni updated APEXMALHAR-2254:

Description: 
With the file input operator, on a replay in a failure scenario, the same data 
is output as before the failure, for every window that is being replayed after 
checkpoint. To do this the operator keeps track of the files and offsets for 
every window and replays the data based on that. 

However, if it so happens that before the failure the processing of a file was 
finished and it was closed exactly before the end window and the next file was 
opened and processed in a new window, in the replay the closing of the first 
file does not happen in earlier window but happens in the latter window. This 
can cause problems if an operator depends on the closing file also to happen in 
an idempotent manner.

Improve the operator to save the closing and opening of files in the idempotent 
state as well so that it can also happen in an idempotent manner.

  was:
With the file input operator, on a replay, the same data is replayed for the 
windows that are being replayed after checkpoint. To do this the operator keeps 
track of the files and offsets for every window and replays the data based on 
that. 

However, if it so happens that before the failure the processing of a file was 
finished and it was closed exactly before the end window and the next file was 
opened and processed in a new window, in the replay the closing of the first 
file does not happen in earlier window but happens in the latter window. This 
can cause problems if an operator depends on the closing file also to happen in 
an idempotent manner.

Improve the operator to save the closing and opening of files in the idempotent 
state as well so that it can also happen in an idempotent manner.


> File input operator is not idempotent with closing files on replay
> --
>
> Key: APEXMALHAR-2254
> URL: https://issues.apache.org/jira/browse/APEXMALHAR-2254
> Project: Apache Apex Malhar
>  Issue Type: Bug
>Reporter: Pramod Immaneni
>Assignee: Pramod Immaneni
>
> With the file input operator, on a replay in a failure scenario, the same 
> data is output as before the failure, for every window that is being replayed 
> after checkpoint. To do this the operator keeps track of the files and 
> offsets for every window and replays the data based on that. 
> However, if it so happens that before the failure the processing of a file 
> was finished and it was closed exactly before the end window and the next 
> file was opened and processed in a new window, in the replay the closing of 
> the first file does not happen in earlier window but happens in the latter 
> window. This can cause problems if an operator depends on the closing file 
> also to happen in an idempotent manner.
> Improve the operator to save the closing and opening of files in the 
> idempotent state as well so that it can also happen in an idempotent manner.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Created] (APEXMALHAR-2254) File input operator is not idempotent with closing files on replay

2016-09-20 Thread Pramod Immaneni (JIRA)
Pramod Immaneni created APEXMALHAR-2254:
---

 Summary: File input operator is not idempotent with closing files 
on replay
 Key: APEXMALHAR-2254
 URL: https://issues.apache.org/jira/browse/APEXMALHAR-2254
 Project: Apache Apex Malhar
  Issue Type: Bug
Reporter: Pramod Immaneni
Assignee: Pramod Immaneni


With the file input operator, on a replay, the same data is replayed for the 
windows that are being replayed after checkpoint. To do this the operator keeps 
track of the files and offsets for every window and replays the data based on 
that. 

However, if it so happens that before the failure the processing of a file was 
finished and it was closed exactly before the end window and the next file was 
opened and processed in a new window, in the replay the closing of the first 
file does not happen in earlier window but happens in the latter window. This 
can cause problems if an operator depends on the closing file also to happen in 
an idempotent manner.

Improve the operator to save the closing and opening of files in the idempotent 
state as well so that it can also happen in an idempotent manner.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Resolved] (APEXMALHAR-2247) Add iteration feature in SpillableArrayListImpl and generalize SerdeListSlice to SerdeCollectionSlice

2016-09-20 Thread David Yan (JIRA)

 [ 
https://issues.apache.org/jira/browse/APEXMALHAR-2247?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

David Yan resolved APEXMALHAR-2247.
---
   Resolution: Fixed
Fix Version/s: 3.6.0

> Add iteration feature in SpillableArrayListImpl and generalize SerdeListSlice 
> to SerdeCollectionSlice
> -
>
> Key: APEXMALHAR-2247
> URL: https://issues.apache.org/jira/browse/APEXMALHAR-2247
> Project: Apache Apex Malhar
>  Issue Type: New Feature
>Reporter: David Yan
>Assignee: David Yan
> Fix For: 3.6.0
>
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[GitHub] apex-malhar pull request #405: APEXMALHAR-2130 #resolve Added a spillable ma...

2016-09-20 Thread davidyan74
Github user davidyan74 closed the pull request at:

https://github.com/apache/apex-malhar/pull/405


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[jira] [Commented] (APEXMALHAR-2130) implement scalable windowed storage

2016-09-20 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/APEXMALHAR-2130?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15508172#comment-15508172
 ] 

ASF GitHub Bot commented on APEXMALHAR-2130:


Github user davidyan74 closed the pull request at:

https://github.com/apache/apex-malhar/pull/405


> implement scalable windowed storage
> ---
>
> Key: APEXMALHAR-2130
> URL: https://issues.apache.org/jira/browse/APEXMALHAR-2130
> Project: Apache Apex Malhar
>  Issue Type: Task
>Reporter: bright chen
>Assignee: David Yan
>
> This feature is used for supporting windowing.
> The storage needs to have the following features:
> 1. Spillable key value storage (integrate with APEXMALHAR-2026)
> 2. Upon checkpoint, it saves a snapshot for the entire data set with the 
> checkpointing window id.  This should be done incrementally (ManagedState) to 
> avoid wasting space with unchanged data
> 3. When recovering, it takes the recovery window id and restores to that 
> snapshot
> 4. When a window is committed, all windows with a lower ID should be purged 
> from the store.
> 5. It should implement the WindowedStorage and WindowedKeyedStorage 
> interfaces, and because of 2 and 3, we may want to add methods to the 
> WindowedStorage interface so that the implementation of WindowedOperator can 
> notify the storage of checkpointing, recovering and committing of a window.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (APEXMALHAR-2247) Add iteration feature in SpillableArrayListImpl and generalize SerdeListSlice to SerdeCollectionSlice

2016-09-20 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/APEXMALHAR-2247?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15508171#comment-15508171
 ] 

ASF GitHub Bot commented on APEXMALHAR-2247:


Github user asfgit closed the pull request at:

https://github.com/apache/apex-malhar/pull/418


> Add iteration feature in SpillableArrayListImpl and generalize SerdeListSlice 
> to SerdeCollectionSlice
> -
>
> Key: APEXMALHAR-2247
> URL: https://issues.apache.org/jira/browse/APEXMALHAR-2247
> Project: Apache Apex Malhar
>  Issue Type: New Feature
>Reporter: David Yan
>Assignee: David Yan
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[GitHub] apex-malhar pull request #418: APEXMALHAR-2247 #resolve Added iteration feat...

2016-09-20 Thread asfgit
Github user asfgit closed the pull request at:

https://github.com/apache/apex-malhar/pull/418


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[jira] [Updated] (APEXMALHAR-2253) Iterator for spillable data structure should throw ConcurrentModificationException during the iteration

2016-09-20 Thread Siyuan Hua (JIRA)

 [ 
https://issues.apache.org/jira/browse/APEXMALHAR-2253?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Siyuan Hua updated APEXMALHAR-2253:
---
Assignee: David Yan

> Iterator for spillable data structure should throw 
> ConcurrentModificationException during the iteration
> ---
>
> Key: APEXMALHAR-2253
> URL: https://issues.apache.org/jira/browse/APEXMALHAR-2253
> Project: Apache Apex Malhar
>  Issue Type: Improvement
>Reporter: Siyuan Hua
>Assignee: David Yan
>
> Right now, the iterator leaves the spillable data structure mutable during 
> iteration which violates java collection agreement. We should try to fix this 
> when there are more users.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Created] (APEXMALHAR-2253) Iterator for spillable data structure should throw ConcurrentModificationException during the iteration

2016-09-20 Thread Siyuan Hua (JIRA)
Siyuan Hua created APEXMALHAR-2253:
--

 Summary: Iterator for spillable data structure should throw 
ConcurrentModificationException during the iteration
 Key: APEXMALHAR-2253
 URL: https://issues.apache.org/jira/browse/APEXMALHAR-2253
 Project: Apache Apex Malhar
  Issue Type: Improvement
Reporter: Siyuan Hua


Right now, the iterator leaves the spillable data structure mutable during 
iteration which violates java collection agreement. We should try to fix this 
when there are more users.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (APEXCORE-474) Unifier placement during M*1 deployment

2016-09-20 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/APEXCORE-474?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15508067#comment-15508067
 ] 

ASF GitHub Bot commented on APEXCORE-474:
-

GitHub user sandeshh opened a pull request:

https://github.com/apache/apex-core/pull/395

[WIP] APEXCORE-474 In M*1 case, deploy the unifier in the same container as…

… downstream

Opening this, work in progress, PR to get early feedback. Only changes I am 
doing is for unit tests.

There are another 4 failing tests.

@tweise and @PramodSSImmaneni Please review.

You can merge this pull request into a Git repository by running:

$ git pull https://github.com/sandeshh/apex-core APEXCORE-474

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/apex-core/pull/395.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #395


commit 91a8470b7e465ba767372956d3311238d720c23e
Author: sandeshh 
Date:   2016-09-20T22:56:23Z

APEXCORE-474 In M*1 case, deploy the unifier in the same container as 
downstream




> Unifier placement during M*1 deployment
> ---
>
> Key: APEXCORE-474
> URL: https://issues.apache.org/jira/browse/APEXCORE-474
> Project: Apache Apex Core
>  Issue Type: Improvement
>Reporter: Sandesh
>Assignee: Sandesh
>
> During M*1 deployment, unifier was deployed in the separate container. But 
> there is no advantage in doing that. 
> It is better to make the unifier THREAD_LOCAL with the downstream operator.
> ( https://issues.apache.org/jira/browse/APEXCORE-482 )
> Note:
> Recently saw one Kafka ETL app, that had a total of 18 containers allocated, 
> but out of that 5 containers were allocated for default unifiers. It also 
> means, lots of time is spent in SerDe. 
> Implementing this feature will improve the performance greatly.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[GitHub] apex-core pull request #395: [WIP] APEXCORE-474 In M*1 case, deploy the unif...

2016-09-20 Thread sandeshh
GitHub user sandeshh opened a pull request:

https://github.com/apache/apex-core/pull/395

[WIP] APEXCORE-474 In M*1 case, deploy the unifier in the same container 
as…

… downstream

Opening this, work in progress, PR to get early feedback. Only changes I am 
doing is for unit tests.

There are another 4 failing tests.

@tweise and @PramodSSImmaneni Please review.

You can merge this pull request into a Git repository by running:

$ git pull https://github.com/sandeshh/apex-core APEXCORE-474

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/apex-core/pull/395.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #395


commit 91a8470b7e465ba767372956d3311238d720c23e
Author: sandeshh 
Date:   2016-09-20T22:56:23Z

APEXCORE-474 In M*1 case, deploy the unifier in the same container as 
downstream




---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[jira] [Updated] (APEXMALHAR-2244) Optimize WindowedStorage and Spillable data structures for time series

2016-09-20 Thread David Yan (JIRA)

 [ 
https://issues.apache.org/jira/browse/APEXMALHAR-2244?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

David Yan updated APEXMALHAR-2244:
--
Description: 
The spillable data structures currently does not make any assumption about the 
key that is used in Managed State, and as a result, it uses ManagedStateImpl to 
interface with Managed State and uses time buckets that are based on the apex 
window id. But for WindowedStorage used by WindowedOperator, the key to the 
storage is a window, which is event time based. Using the default 
ManagedStateImpl would be very inefficient for event time based keys, since it 
would write data that would belong to the same window to different time buckets.

On a high level, the below summarizes roughly what needs to be done:

1. a way to tell the spillable data structures to use the 
ManagedTimeUnifiedStateImpl
2. a way to tell the spillable data structures how to extract the timestamp 
from the key. Note that in the case of WindowedOperator, the timestamp should 
be the end timestamp of the window (beginTimeMillis + durationMillis), not the 
begin timestamp.
3. a way to tell the spillable data structures how to assign the time bucket 
given that timestamp
4. with point 3, the spillable implementations of WindowedStorage will need to 
take a config parameter that says how much time (in millis) is each time bucket
5. only purge a time bucket when all keys that belong to that time bucket are 
removed and the apex window id of the first window in which the keys are all 
removed has been committed



  was:
The spillable data structures currently does not make any assumption about the 
key that is used in Managed State, and as a result, it uses ManagedStateImpl to 
interface with Managed State and uses time buckets that are based on the apex 
window id. But for WindowedStorage used by WindowedOperator, the key to the 
storage is a window, which is event time based. Using the default 
ManagedStateImpl would be very inefficient for event time based keys, since it 
would write data that would belong to the same window to different time buckets.

On a high level, the below summarizes roughly what needs to be done:

1. a way to tell the spillable data structures to use the 
ManagedTimeUnifiedStateImpl
2. a way to tell the spillable data structures how to extract the timestamp 
from the key. Note that in the case of WindowedOperator, the timestamp should 
be the end timestamp of the window (beginTimeMillis + durationMillis), not the 
begin timestamp.
3. a way to tell the spillable data structures how to assign the time bucket 
given that timestamp
4. only purge a time bucket when all keys that belong to that time bucket are 
removed and the apex window id of the first window in which the keys are all 
removed has been committed




> Optimize WindowedStorage and Spillable data structures for time series
> --
>
> Key: APEXMALHAR-2244
> URL: https://issues.apache.org/jira/browse/APEXMALHAR-2244
> Project: Apache Apex Malhar
>  Issue Type: Sub-task
>Reporter: David Yan
>Assignee: Siyuan Hua
>
> The spillable data structures currently does not make any assumption about 
> the key that is used in Managed State, and as a result, it uses 
> ManagedStateImpl to interface with Managed State and uses time buckets that 
> are based on the apex window id. But for WindowedStorage used by 
> WindowedOperator, the key to the storage is a window, which is event time 
> based. Using the default ManagedStateImpl would be very inefficient for event 
> time based keys, since it would write data that would belong to the same 
> window to different time buckets.
> On a high level, the below summarizes roughly what needs to be done:
> 1. a way to tell the spillable data structures to use the 
> ManagedTimeUnifiedStateImpl
> 2. a way to tell the spillable data structures how to extract the timestamp 
> from the key. Note that in the case of WindowedOperator, the timestamp should 
> be the end timestamp of the window (beginTimeMillis + durationMillis), not 
> the begin timestamp.
> 3. a way to tell the spillable data structures how to assign the time bucket 
> given that timestamp
> 4. with point 3, the spillable implementations of WindowedStorage will need 
> to take a config parameter that says how much time (in millis) is each time 
> bucket
> 5. only purge a time bucket when all keys that belong to that time bucket are 
> removed and the apex window id of the first window in which the keys are all 
> removed has been committed



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (APEXMALHAR-2244) Optimize WindowedStorage and Spillable data structures for time series

2016-09-20 Thread Thomas Weise (JIRA)

[ 
https://issues.apache.org/jira/browse/APEXMALHAR-2244?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15507450#comment-15507450
 ] 

Thomas Weise commented on APEXMALHAR-2244:
--

How will the spillable DS know which time bucket to look for a given key in 
(get)? That's so that we don't look into multiple time buckets since it is not 
necessary in this case and the number of time buckets can also be high.

> Optimize WindowedStorage and Spillable data structures for time series
> --
>
> Key: APEXMALHAR-2244
> URL: https://issues.apache.org/jira/browse/APEXMALHAR-2244
> Project: Apache Apex Malhar
>  Issue Type: Sub-task
>Reporter: David Yan
>Assignee: Siyuan Hua
>
> The spillable data structures currently does not make any assumption about 
> the key that is used in Managed State, and as a result, it uses 
> ManagedStateImpl to interface with Managed State and uses time buckets that 
> are based on the apex window id. But for WindowedStorage used by 
> WindowedOperator, the key to the storage is a window, which is event time 
> based. Using the default ManagedStateImpl would be very inefficient for event 
> time based keys, since it would write data that would belong to the same 
> window to different time buckets.
> On a high level, the below summarizes roughly what needs to be done:
> 1. a way to tell the spillable data structures to use the 
> ManagedTimeUnifiedStateImpl
> 2. a way to tell the spillable data structures how to extract the timestamp 
> from the key. Note that in the case of WindowedOperator, the timestamp should 
> be the end timestamp of the window (beginTimeMillis + durationMillis), not 
> the begin timestamp.
> 3. a way to tell the spillable data structures how to assign the time bucket 
> given that timestamp
> 4. only purge a time bucket when all keys that belong to that time bucket are 
> removed and the apex window id of the first window in which the keys are all 
> removed has been committed



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (APEXMALHAR-2244) Optimize WindowedStorage and Spillable data structures for time series

2016-09-20 Thread David Yan (JIRA)

 [ 
https://issues.apache.org/jira/browse/APEXMALHAR-2244?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

David Yan updated APEXMALHAR-2244:
--
Description: 
The spillable data structures currently does not make any assumption about the 
key that is used in Managed State, and as a result, it uses ManagedStateImpl to 
interface with Managed State and uses time buckets that are based on the apex 
window id. But for WindowedStorage used by WindowedOperator, the key to the 
storage is a window, which is event time based. Using the default 
ManagedStateImpl would be very inefficient for event time based keys, since it 
would write data that would belong to the same window to different time buckets.

On a high level, the below summarizes roughly what needs to be done:

1. a way to tell the spillable data structures to use the 
ManagedTimeUnifiedStateImpl
2. a way to tell the spillable data structures how to extract the timestamp 
from the key. Note that in the case of WindowedOperator, the timestamp should 
be the end timestamp of the window (beginTimeMillis + durationMillis), not the 
begin timestamp.
3. a way to tell the spillable data structures how to assign the time bucket 
given that timestamp
4. only purge a time bucket when all keys that belong to that time bucket are 
removed and the apex window id of the first window in which the keys are all 
removed has been committed



  was:
The spillable data structures currently does not make any assumption about the 
key that is used in Managed State, and as a result, it uses ManagedStateImpl to 
interface with Managed State. But for WindowedStorage used by WindowedOperator, 
the key to the storage is a window, which is event time based. Using the 
default ManagedStateImpl would be wrong for event time based keys, since 
ManagedStateImpl appears to purge data based on the apex window id (process 
time based).

In a high level, the below summarizes roughly what needs to be done:

1. a way to tell the spillable data structures to use the 
ManagedTimeUnifiedStateImpl
2. a way to tell the spillable data structures how to extract the timestamp 
from the key. Note that in the case of WindowedOperator, the timestamp should 
be the end timestamp of the window (beginTimeMillis + durationMillis), not the 
begin timestamp.
3. a way to tell the spillable data structures how to assign the time bucket 
given that timestamp
4. only purge a time bucket when all keys that belong to that time bucket are 
removed and the apex window id of the first window in which the keys are all 
removed has been committed




> Optimize WindowedStorage and Spillable data structures for time series
> --
>
> Key: APEXMALHAR-2244
> URL: https://issues.apache.org/jira/browse/APEXMALHAR-2244
> Project: Apache Apex Malhar
>  Issue Type: Sub-task
>Reporter: David Yan
>Assignee: Siyuan Hua
>
> The spillable data structures currently does not make any assumption about 
> the key that is used in Managed State, and as a result, it uses 
> ManagedStateImpl to interface with Managed State and uses time buckets that 
> are based on the apex window id. But for WindowedStorage used by 
> WindowedOperator, the key to the storage is a window, which is event time 
> based. Using the default ManagedStateImpl would be very inefficient for event 
> time based keys, since it would write data that would belong to the same 
> window to different time buckets.
> On a high level, the below summarizes roughly what needs to be done:
> 1. a way to tell the spillable data structures to use the 
> ManagedTimeUnifiedStateImpl
> 2. a way to tell the spillable data structures how to extract the timestamp 
> from the key. Note that in the case of WindowedOperator, the timestamp should 
> be the end timestamp of the window (beginTimeMillis + durationMillis), not 
> the begin timestamp.
> 3. a way to tell the spillable data structures how to assign the time bucket 
> given that timestamp
> 4. only purge a time bucket when all keys that belong to that time bucket are 
> removed and the apex window id of the first window in which the keys are all 
> removed has been committed



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


Re: Python support

2016-09-20 Thread Sasha Parfenov
+1 on both executing Python code in an operator and high level API for
constructing Pipelines in Python.

There is a large user base of engineers and data scientists which use
Python on regular basis for crunching through big data.  Providing them
with a powerful new platform for big data processing, wrapped in a familiar
language, will open Apex to a much broader user base and help grow the
project.

Given the potentially new user base of Python developers, it may make sense
to prioritize the high level API for pipeline construction.  This will
allow users to build simple applications with existing library operators,
and we can get feedback on what areas they would like to see improved next
- custom Python operator support or more built-in library operators.

Thanks,
Sasha

On Thu, Sep 15, 2016 at 2:06 PM, Thomas Weise  wrote:

> Hi,
>
> Python (not Jython) seems to be a popular language and frequently used for
> data analysis, especially where flexibility matters. It has a comprehensive
> library and it is generally considered low barrier to entry. I have also
> seen Python used in critical back-end components, although that's probably
> not very common?
>
> I think Python support could potentially expand the user base for Apex.
> There are 2 main areas that can be considered:
>
> 1) Support to execute Python code through an operator
> 2) A client API that lets users construct pipelines in Python
>
> The former can exist without the latter. And it would enable users to
> leverage existing code that otherwise would have to be rewritten in a JVM
> language. The engine could ship scripts/packages so they are automatically
> distributed on the cluster.
>
> A useful client API probably requires back-end support for lambda functions
> and more complex UDFs.
>
> Would be great to get some feedback, especially from those that have
> experience with Python, on how an integration could potentially open up new
> use cases for Apex.
>
> Thanks,
> Thomas
>


[jira] [Assigned] (APEXMALHAR-2252) Document AbstractManagedStateImpl subclasses

2016-09-20 Thread Chandni Singh (JIRA)

 [ 
https://issues.apache.org/jira/browse/APEXMALHAR-2252?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Chandni Singh reassigned APEXMALHAR-2252:
-

Assignee: Chandni Singh

> Document AbstractManagedStateImpl subclasses
> 
>
> Key: APEXMALHAR-2252
> URL: https://issues.apache.org/jira/browse/APEXMALHAR-2252
> Project: Apache Apex Malhar
>  Issue Type: Task
>Reporter: Thomas Weise
>Assignee: Chandni Singh
>
> For the user the time bucketing concept is very important to understand. We 
> should highlight the difference in the documentation and also consider 
> renaming the subclasses for clarify.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (APEXMALHAR-2252) Document AbstractManagedStateImpl subclasses

2016-09-20 Thread Thomas Weise (JIRA)

 [ 
https://issues.apache.org/jira/browse/APEXMALHAR-2252?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Thomas Weise updated APEXMALHAR-2252:
-
Description: For the user the time bucketing concept is very important to 
understand. We should highlight the difference in the documentation and also 
consider renaming the subclasses for clarify.

> Document AbstractManagedStateImpl subclasses
> 
>
> Key: APEXMALHAR-2252
> URL: https://issues.apache.org/jira/browse/APEXMALHAR-2252
> Project: Apache Apex Malhar
>  Issue Type: Task
>Reporter: Thomas Weise
>
> For the user the time bucketing concept is very important to understand. We 
> should highlight the difference in the documentation and also consider 
> renaming the subclasses for clarify.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Created] (APEXMALHAR-2252) Document AbstractManagedStateImpl subclasses

2016-09-20 Thread Thomas Weise (JIRA)
Thomas Weise created APEXMALHAR-2252:


 Summary: Document AbstractManagedStateImpl subclasses
 Key: APEXMALHAR-2252
 URL: https://issues.apache.org/jira/browse/APEXMALHAR-2252
 Project: Apache Apex Malhar
  Issue Type: Task
Reporter: Thomas Weise






--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


Re: Python support

2016-09-20 Thread Tushar Gosavi
+1 on this feature.

we could use py4j or communication with python process through pipes
to run python code through jvm.

- Tushar.



On Fri, Sep 16, 2016 at 12:10 PM, Thomas Weise  wrote:
> Jython is not a replacement for Python, it seems to be fairly limited. We
> would need the ability to run Python with all its libraries.
>
> Thomas
>
> On Thu, Sep 15, 2016 at 11:25 PM, David Yan  wrote:
>
>> On a very high level, we can build a Python framework in Apex by having a
>> Python binding on our high level API that generates Jython operators with
>> the business logic written by users in Python, along with existing
>> connectors.
>>
>> David
>>
>> On Sep 15, 2016 11:00 PM, "Chinmay Kolhatkar" 
>> wrote:
>>
>> > Strongly +1 on this. One thing that proves this is useful for Apex is
>> > hadoop streaming where python is used write map-reduce jobs. This not
>> only
>> > will increase the reach in development world but also would be appealing
>> to
>> > administrators to write an app as they are usually aware of python.
>> >
>> >
>> > Few suggestions (not in specific order):
>> > 1. As a part of supporting python execution in operator code, we should
>> > provide a complete lifecycle of an operator to be specified from python.
>> >
>> > 2. I would personally not worry about providing python binding for low
>> > level apex client APIs like addOperator, addStream etc... If one has to
>> do
>> > it, I think its best to use JAVA api as the most power of those low level
>> > APIs can be leveraged there.
>> >
>> > 3. For client APIs, I would rather suggest we focus on high level APIs
>> like
>> > apex stream API (malhar-stream). We should provide a complete python
>> > binding for them. Python is very useful when it comes to functional
>> > programming and Stream API provide exactly that.
>> >
>> > 4. Thinking very high level, I don't think we need any change in
>> apex-core
>> > for this. This could be another project in malhar itself. There are
>> python
>> > libraries like py4j or pyjnius or JPype which allows to access Java
>> objects
>> > from python.
>> > Basically, we just need to establish a right bridge betweeen java and
>> > python VM. We need to be thoughtful about performance as these bridges
>> > across programming languages are costly.
>> >
>> > 5. We need to decide on how the code execution will look like on this.
>> For
>> > eg., should a py file be an alternative to Application.java in the
>> package?
>> > This means, the starting point is apex cli i.e. java. Hence instead of
>> > finding classes implementing StreamingApplication, apexcli needs to find
>> py
>> > file which defines definition of DAG.
>> > OR should the flow start with "__main__" of python file and end up in
>> Java?
>> >
>> > 6. This might be too early, but it important to emphasis that we need to
>> > plan for writing examples and documentation for python binding.
>> >
>> > -Chinmay.
>> >
>> >
>> >
>> > On Fri, Sep 16, 2016 at 2:36 AM, Thomas Weise  wrote:
>> >
>> > > Hi,
>> > >
>> > > Python (not Jython) seems to be a popular language and frequently used
>> > for
>> > > data analysis, especially where flexibility matters. It has a
>> > comprehensive
>> > > library and it is generally considered low barrier to entry. I have
>> also
>> > > seen Python used in critical back-end components, although that's
>> > probably
>> > > not very common?
>> > >
>> > > I think Python support could potentially expand the user base for Apex.
>> > > There are 2 main areas that can be considered:
>> > >
>> > > 1) Support to execute Python code through an operator
>> > > 2) A client API that lets users construct pipelines in Python
>> > >
>> > > The former can exist without the latter. And it would enable users to
>> > > leverage existing code that otherwise would have to be rewritten in a
>> JVM
>> > > language. The engine could ship scripts/packages so they are
>> > automatically
>> > > distributed on the cluster.
>> > >
>> > > A useful client API probably requires back-end support for lambda
>> > functions
>> > > and more complex UDFs.
>> > >
>> > > Would be great to get some feedback, especially from those that have
>> > > experience with Python, on how an integration could potentially open up
>> > new
>> > > use cases for Apex.
>> > >
>> > > Thanks,
>> > > Thomas
>> > >
>> >
>>


[jira] [Moved] (APEXMALHAR-2251) Slice hashCode should not depend on offset

2016-09-20 Thread Thomas Weise (JIRA)

 [ 
https://issues.apache.org/jira/browse/APEXMALHAR-2251?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Thomas Weise moved APEXCORE-537 to APEXMALHAR-2251:
---

Workflow: Default workflow, editable Closed status  (was: jira)
 Key: APEXMALHAR-2251  (was: APEXCORE-537)
 Project: Apache Apex Malhar  (was: Apache Apex Core)

> Slice hashCode should not depend on offset
> --
>
> Key: APEXMALHAR-2251
> URL: https://issues.apache.org/jira/browse/APEXMALHAR-2251
> Project: Apache Apex Malhar
>  Issue Type: Bug
>Reporter: bright chen
>
> The 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Created] (APEXCORE-537) Slice hashCode should not depend on offset

2016-09-20 Thread bright chen (JIRA)
bright chen created APEXCORE-537:


 Summary: Slice hashCode should not depend on offset
 Key: APEXCORE-537
 URL: https://issues.apache.org/jira/browse/APEXCORE-537
 Project: Apache Apex Core
  Issue Type: Bug
Reporter: bright chen


The 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Resolved] (APEXMALHAR-2245) WindowBoundedMapCache.remove not working when key not in cache

2016-09-20 Thread Thomas Weise (JIRA)

 [ 
https://issues.apache.org/jira/browse/APEXMALHAR-2245?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Thomas Weise resolved APEXMALHAR-2245.
--
   Resolution: Fixed
Fix Version/s: 3.6.0

> WindowBoundedMapCache.remove not working when key not in cache
> --
>
> Key: APEXMALHAR-2245
> URL: https://issues.apache.org/jira/browse/APEXMALHAR-2245
> Project: Apache Apex Malhar
>  Issue Type: Bug
>Reporter: David Yan
>Assignee: David Yan
> Fix For: 3.6.0
>
>
> WindowBoundedMapCache.remove is a no-op if the key is not in the cache, 
> resulting in the entry not being removed in the underlying storage



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (APEXMALHAR-2245) WindowBoundedMapCache.remove is a no-op if the key is not in the cache, resulting in the entry not being removed in the underlying storage

2016-09-20 Thread Thomas Weise (JIRA)

 [ 
https://issues.apache.org/jira/browse/APEXMALHAR-2245?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Thomas Weise updated APEXMALHAR-2245:
-
Description: WindowBoundedMapCache.remove is a no-op if the key is not in 
the cache, resulting in the entry not being removed in the underlying storage

> WindowBoundedMapCache.remove is a no-op if the key is not in the cache, 
> resulting in the entry not being removed in the underlying storage
> --
>
> Key: APEXMALHAR-2245
> URL: https://issues.apache.org/jira/browse/APEXMALHAR-2245
> Project: Apache Apex Malhar
>  Issue Type: Bug
>Reporter: David Yan
>Assignee: David Yan
>
> WindowBoundedMapCache.remove is a no-op if the key is not in the cache, 
> resulting in the entry not being removed in the underlying storage



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (APEXMALHAR-2246) The underlying map of SpillableByteArrayListMultimapImpl uses a primitive byte[] as a key, which won't work because it does not compare the underlying bytes

2016-09-20 Thread Thomas Weise (JIRA)

 [ 
https://issues.apache.org/jira/browse/APEXMALHAR-2246?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Thomas Weise updated APEXMALHAR-2246:
-
Description: 

Apache Apex MalharAPEXMALHAR-2246
The underlying map of SpillableByteArrayListMultimapImpl uses a primitive 
byte[] as a key, which won't work because it does not compare the underlying 
bytes

> The underlying map of SpillableByteArrayListMultimapImpl uses a primitive 
> byte[] as a key, which won't work because it does not compare the underlying 
> bytes
> 
>
> Key: APEXMALHAR-2246
> URL: https://issues.apache.org/jira/browse/APEXMALHAR-2246
> Project: Apache Apex Malhar
>  Issue Type: Bug
>Reporter: David Yan
>Assignee: David Yan
> Fix For: 3.6.0
>
>
>   
> Apache Apex MalharAPEXMALHAR-2246
> The underlying map of SpillableByteArrayListMultimapImpl uses a primitive 
> byte[] as a key, which won't work because it does not compare the underlying 
> bytes



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (APEXMALHAR-2246) The underlying map of SpillableByteArrayListMultimapImpl uses a primitive byte[] as a key, which won't work because it does not compare the underlying bytes

2016-09-20 Thread Thomas Weise (JIRA)

 [ 
https://issues.apache.org/jira/browse/APEXMALHAR-2246?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Thomas Weise updated APEXMALHAR-2246:
-
Description: The underlying map of SpillableByteArrayListMultimapImpl uses 
a primitive byte[] as a key, which won't work because it does not compare the 
underlying bytes  (was:
Apache Apex MalharAPEXMALHAR-2246
The underlying map of SpillableByteArrayListMultimapImpl uses a primitive 
byte[] as a key, which won't work because it does not compare the underlying 
bytes)

> The underlying map of SpillableByteArrayListMultimapImpl uses a primitive 
> byte[] as a key, which won't work because it does not compare the underlying 
> bytes
> 
>
> Key: APEXMALHAR-2246
> URL: https://issues.apache.org/jira/browse/APEXMALHAR-2246
> Project: Apache Apex Malhar
>  Issue Type: Bug
>Reporter: David Yan
>Assignee: David Yan
> Fix For: 3.6.0
>
>
> The underlying map of SpillableByteArrayListMultimapImpl uses a primitive 
> byte[] as a key, which won't work because it does not compare the underlying 
> bytes



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (APEXMALHAR-2246) The underlying map of SpillableByteArrayListMultimapImpl uses a primitive byte[] as a key, which won't work because it does not compare the underlying bytes

2016-09-20 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/APEXMALHAR-2246?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15507248#comment-15507248
 ] 

ASF GitHub Bot commented on APEXMALHAR-2246:


Github user asfgit closed the pull request at:

https://github.com/apache/apex-malhar/pull/417


> The underlying map of SpillableByteArrayListMultimapImpl uses a primitive 
> byte[] as a key, which won't work because it does not compare the underlying 
> bytes
> 
>
> Key: APEXMALHAR-2246
> URL: https://issues.apache.org/jira/browse/APEXMALHAR-2246
> Project: Apache Apex Malhar
>  Issue Type: Bug
>Reporter: David Yan
>Assignee: David Yan
> Fix For: 3.6.0
>
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Resolved] (APEXMALHAR-2246) The underlying map of SpillableByteArrayListMultimapImpl uses a primitive byte[] as a key, which won't work because it does not compare the underlying bytes

2016-09-20 Thread Thomas Weise (JIRA)

 [ 
https://issues.apache.org/jira/browse/APEXMALHAR-2246?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Thomas Weise resolved APEXMALHAR-2246.
--
   Resolution: Fixed
Fix Version/s: 3.6.0

> The underlying map of SpillableByteArrayListMultimapImpl uses a primitive 
> byte[] as a key, which won't work because it does not compare the underlying 
> bytes
> 
>
> Key: APEXMALHAR-2246
> URL: https://issues.apache.org/jira/browse/APEXMALHAR-2246
> Project: Apache Apex Malhar
>  Issue Type: Bug
>Reporter: David Yan
>Assignee: David Yan
> Fix For: 3.6.0
>
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[GitHub] apex-malhar pull request #417: APEXMALHAR-2246 #resolve use Slice instead of...

2016-09-20 Thread asfgit
Github user asfgit closed the pull request at:

https://github.com/apache/apex-malhar/pull/417


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] apex-malhar pull request #368: APEXMALHAR-2184: Add documentation for File I...

2016-09-20 Thread asfgit
Github user asfgit closed the pull request at:

https://github.com/apache/apex-malhar/pull/368


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[jira] [Updated] (APEXMALHAR-2184) Add documentation for FileSystem Input Operator

2016-09-20 Thread Munagala V. Ramanath (JIRA)

 [ 
https://issues.apache.org/jira/browse/APEXMALHAR-2184?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Munagala V. Ramanath updated APEXMALHAR-2184:
-
Fix Version/s: 3.6.0

> Add documentation for FileSystem Input Operator
> ---
>
> Key: APEXMALHAR-2184
> URL: https://issues.apache.org/jira/browse/APEXMALHAR-2184
> Project: Apache Apex Malhar
>  Issue Type: Documentation
>Reporter: Priyanka Gugale
>Assignee: Priyanka Gugale
>Priority: Minor
> Fix For: 3.6.0
>
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (APEXMALHAR-2184) Add documentation for FileSystem Input Operator

2016-09-20 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/APEXMALHAR-2184?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15506948#comment-15506948
 ] 

ASF GitHub Bot commented on APEXMALHAR-2184:


Github user asfgit closed the pull request at:

https://github.com/apache/apex-malhar/pull/368


> Add documentation for FileSystem Input Operator
> ---
>
> Key: APEXMALHAR-2184
> URL: https://issues.apache.org/jira/browse/APEXMALHAR-2184
> Project: Apache Apex Malhar
>  Issue Type: Documentation
>Reporter: Priyanka Gugale
>Assignee: Priyanka Gugale
>Priority: Minor
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (APEXMALHAR-2245) WindowBoundedMapCache.remove is a no-op if the key is not in the cache, resulting in the entry not being removed in the underlying storage

2016-09-20 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/APEXMALHAR-2245?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15506925#comment-15506925
 ] 

ASF GitHub Bot commented on APEXMALHAR-2245:


Github user asfgit closed the pull request at:

https://github.com/apache/apex-malhar/pull/416


> WindowBoundedMapCache.remove is a no-op if the key is not in the cache, 
> resulting in the entry not being removed in the underlying storage
> --
>
> Key: APEXMALHAR-2245
> URL: https://issues.apache.org/jira/browse/APEXMALHAR-2245
> Project: Apache Apex Malhar
>  Issue Type: Bug
>Reporter: David Yan
>Assignee: David Yan
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[GitHub] apex-malhar pull request #416: APEXMALHAR-2245 #resolve Add the key in remov...

2016-09-20 Thread asfgit
Github user asfgit closed the pull request at:

https://github.com/apache/apex-malhar/pull/416


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] apex-malhar pull request #413: APEXMALHAR-2230 simplify the kafka input oper...

2016-09-20 Thread asfgit
Github user asfgit closed the pull request at:

https://github.com/apache/apex-malhar/pull/413


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


Re: Improving Apex relaunch time.

2016-09-20 Thread Sandesh Hegde
How about re-launching the app from the same location?

If at all they want to store the state we can provide savepoint feature.

On Tue, Sep 20, 2016 at 4:39 AM Tushar Gosavi 
wrote:

> We have observed that application relaunch takes long time.
> The one major reason for delay in application startup during relaunch
> is time taken to copy state of exisitng application to new application.
> This state could grow in GBs and copy is performed in single thread before
> new application is submitted to Yarn.
>
> The state of previous application constists
> - jars
> - stram checkpoint/recovery file.
> - events
> - container file
> - stats recording if enabled.
> - operator checkpoints
> - operator data.
>
> We could avoid copying debugging data like stat recording which could
> run in TB for long
> running application and is not required for functioning of new application.
>
> Similarly operator checkpoints could be read in parallel when they are
> launched for first time,
> This will also help in copying only required checkpoints and will be
> done in parallel
> by multiple containers/threads.
>
> For operator data stored in application directory, we could copy it
> completely for now, but
> in future we could provide an callback which will allow operator
> partition to read only
> required state from previous location.
>
> let me know your though on this.
>
> Regards,
> - Tushar.
>


Improving Apex relaunch time.

2016-09-20 Thread Tushar Gosavi
We have observed that application relaunch takes long time.
The one major reason for delay in application startup during relaunch
is time taken to copy state of exisitng application to new application.
This state could grow in GBs and copy is performed in single thread before
new application is submitted to Yarn.

The state of previous application constists
- jars
- stram checkpoint/recovery file.
- events
- container file
- stats recording if enabled.
- operator checkpoints
- operator data.

We could avoid copying debugging data like stat recording which could
run in TB for long
running application and is not required for functioning of new application.

Similarly operator checkpoints could be read in parallel when they are
launched for first time,
This will also help in copying only required checkpoints and will be
done in parallel
by multiple containers/threads.

For operator data stored in application directory, we could copy it
completely for now, but
in future we could provide an callback which will allow operator
partition to read only
required state from previous location.

let me know your though on this.

Regards,
- Tushar.


[jira] [Created] (APEXMALHAR-2250) AbstractFileInputOperator.DirectoryScanner does not handle directories correctly.

2016-09-20 Thread Tushar Gosavi (JIRA)
Tushar Gosavi created APEXMALHAR-2250:
-

 Summary: AbstractFileInputOperator.DirectoryScanner does not 
handle directories correctly.
 Key: APEXMALHAR-2250
 URL: https://issues.apache.org/jira/browse/APEXMALHAR-2250
 Project: Apache Apex Malhar
  Issue Type: Bug
Reporter: Tushar Gosavi


The default DirectoryScanner defined in AbstractFileInputOperator does not 
handle directories correctly. If there is a directory in the configured path, 
it gets added as a file in pendingFile list and when operator tries to open it 
for reading
it fails, the operator keeps retrying for configured number of time and then 
ignore this file.

The fix would be to not return directory name in scanned file names in the 
first place.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)