[jira] [Resolved] (PIO-115) Cache name-to-ID lookups for Storage app & channel

2017-08-29 Thread Mars Hall (JIRA)

 [ 
https://issues.apache.org/jira/browse/PIO-115?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Mars Hall resolved PIO-115.
---
Resolution: Fixed

> Cache name-to-ID lookups for Storage app & channel
> --
>
> Key: PIO-115
> URL: https://issues.apache.org/jira/browse/PIO-115
> Project: PredictionIO
>  Issue Type: Improvement
>  Components: Core
>Affects Versions: 0.11.0-incubating
>Reporter: Mars Hall
>Assignee: Mars Hall
>
> When stress testing the Universal Recommender with high-concurrency HTTP/REST 
> queries, we observed that Elasticsearch traffic was majority composed of 
> requests resolving the Storage app's name & channel, over and over and over 
> again! In this case, [each per-query call to 
> `LEventStore.findByEntity`|https://github.com/heroku/predictionio-engine-ur/blob/master/src/main/scala/URAlgorithm.scala#L694]
>  re-resolves the app name to an ID.
> Implement memoization for the function that performs these name-to-ID 
> lookups, so that only one set of lookups is performed per process for each 
> app+channel combination.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Resolved] (PIO-114) Elasticsearch 5.x StorageClient basic HTTP authentication

2017-08-29 Thread Mars Hall (JIRA)

 [ 
https://issues.apache.org/jira/browse/PIO-114?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Mars Hall resolved PIO-114.
---
Resolution: Fixed

> Elasticsearch 5.x StorageClient basic HTTP authentication
> -
>
> Key: PIO-114
> URL: https://issues.apache.org/jira/browse/PIO-114
> Project: PredictionIO
>  Issue Type: New Feature
>  Components: Core
>Affects Versions: 0.11.0-incubating
>Reporter: Mars Hall
>Assignee: Mars Hall
>
> Add optional username-password configuration for the new Elasticsearch 5 
> client; in {{conf/pio-env.sh}} config:
> {code}
> # Optional basic HTTP auth
> PIO_STORAGE_SOURCES_ELASTICSEARCH_USERNAME=my-name
> PIO_STORAGE_SOURCES_ELASTICSEARCH_PASSWORD=my-secret
> {code}
> These credentials are sent in each Elasticsearch request as an HTTP Basic 
> Authorization header.
> Enables use of public-cloud, hosted Elasticsearch clusters, such as [Bonsai 
> on Heroku](https://elements.heroku.com/addons/bonsai).



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Resolved] (PIO-106) Elasticsearch 5.x StorageClient should reuse RestClient

2017-08-29 Thread Mars Hall (JIRA)

 [ 
https://issues.apache.org/jira/browse/PIO-106?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Mars Hall resolved PIO-106.
---
Resolution: Fixed

> Elasticsearch 5.x StorageClient should reuse RestClient
> ---
>
> Key: PIO-106
> URL: https://issues.apache.org/jira/browse/PIO-106
> Project: PredictionIO
>  Issue Type: Improvement
>  Components: Core
>Affects Versions: 0.11.0-incubating
>Reporter: Mars Hall
>Assignee: Mars Hall
>
> When using the proposed [PIO-105 Batch 
> Predictions|https://issues.apache.org/jira/browse/PIO-105] feature with an 
> engine that queries Elasticsearch in {{Algorithm#predict}}, Elasticsearch's 
> REST interface appears to become overloaded, ending with the Spark job being 
> killed from errors like:
> {noformat}
> [ERROR] [ESChannels] Failed to access to /pio_meta/channels/_search
> [ERROR] [Utils] Aborting task
> [ERROR] [ESApps] Failed to access to /pio_meta/apps/_search
> [ERROR] [Executor] Exception in task 747.0 in stage 1.0 (TID 749)
> [ERROR] [Executor] Exception in task 735.0 in stage 1.0 (TID 737)
> [ERROR] [Common$] Invalid app name ur
> [ERROR] [Utils] Aborting task
> [ERROR] [URAlgorithm] Error when read recent events: 
> java.lang.IllegalArgumentException: Invalid app name ur
> [ERROR] [Executor] Exception in task 749.0 in stage 1.0 (TID 751)
> [ERROR] [Utils] Aborting task
> [ERROR] [Executor] Exception in task 748.0 in stage 1.0 (TID 750)
> [WARN] [TaskSetManager] Lost task 749.0 in stage 1.0 (TID 751, localhost, 
> executor driver): java.net.BindException: Can't assign requested address
>   at sun.nio.ch.Net.connect0(Native Method)
>   at sun.nio.ch.Net.connect(Net.java:454)
>   at sun.nio.ch.Net.connect(Net.java:446)
>   at sun.nio.ch.SocketChannelImpl.connect(SocketChannelImpl.java:648)
>   at 
> org.apache.http.impl.nio.reactor.DefaultConnectingIOReactor.processSessionRequests(DefaultConnectingIOReactor.java:273)
>   at 
> org.apache.http.impl.nio.reactor.DefaultConnectingIOReactor.processEvents(DefaultConnectingIOReactor.java:139)
>   at 
> org.apache.http.impl.nio.reactor.AbstractMultiworkerIOReactor.execute(AbstractMultiworkerIOReactor.java:348)
>   at 
> org.apache.http.impl.nio.conn.PoolingNHttpClientConnectionManager.execute(PoolingNHttpClientConnectionManager.java:192)
>   at 
> org.apache.http.impl.nio.client.CloseableHttpAsyncClientBase$1.run(CloseableHttpAsyncClientBase.java:64)
>   at java.lang.Thread.run(Thread.java:745)
> {noformat}
> After these errors happen & the job is killed, Elasticsearch immediately 
> recovers. It responds to queries normally. I researched what could cause this 
> and found an [old issue in the main Elasticsearch 
> repo|https://github.com/elastic/elasticsearch/issues/3647]. With the hints 
> given therein about *using keep-alive in the ES client* to avoid these 
> performance issues, I investigated how PredictionIO's [Elasticsearch 
> StorageClient|https://github.com/apache/incubator-predictionio/tree/develop/storage/elasticsearch/src/main/scala/org/apache/predictionio/data/storage/elasticsearch]
>  manages its connections.
> I found that unlike the other StorageClients (Elasticsearch1, HBase, JDBC), 
> Elasticsearch creates a new underlying connection, an Elasticsearch 
> RestClient, for 
> [every|https://github.com/apache/incubator-predictionio/blob/develop/storage/elasticsearch/src/main/scala/org/apache/predictionio/data/storage/elasticsearch/ESApps.scala#L80]
>  
> [single|https://github.com/apache/incubator-predictionio/blob/develop/storage/elasticsearch/src/main/scala/org/apache/predictionio/data/storage/elasticsearch/ESApps.scala#L157]
>  
> [query|https://github.com/apache/incubator-predictionio/blob/develop/storage/elasticsearch/src/main/scala/org/apache/predictionio/data/storage/elasticsearch/ESChannels.scala#L78]
>  & 
> [interaction|https://github.com/apache/incubator-predictionio/blob/develop/storage/elasticsearch/src/main/scala/org/apache/predictionio/data/storage/elasticsearch/ESEngineInstances.scala#L205]
>  with its API. As a result, *there is no way Elasticsearch TCP connections 
> can be reused via HTTP keep-alive*.
> High-performance workloads with Elasticsearch 5.x will suffer from these 
> issues unless we refactor Elasticsearch StorageClient to share the underlying 
> RestClient instead of [building a new one everytime the client is 
> used|https://github.com/apache/incubator-predictionio/blob/develop/storage/elasticsearch/src/main/scala/org/apache/predictionio/data/storage/elasticsearch/StorageClient.scala#L31].
> There are certainly different approaches we could take to sharing a 
> RestClient so that its keep-alive behavior may work as designed:
> * maintain a singleton RestClient that is reused throughout the ES storage 
> classes
> * create a RestClient on-demand and pass it as an argument to ES 

[jira] [Commented] (PIO-106) Elasticsearch 5.x StorageClient should reuse RestClient

2017-08-29 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/PIO-106?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16145871#comment-16145871
 ] 

ASF GitHub Bot commented on PIO-106:


Github user asfgit closed the pull request at:

https://github.com/apache/incubator-predictionio/pull/421


> Elasticsearch 5.x StorageClient should reuse RestClient
> ---
>
> Key: PIO-106
> URL: https://issues.apache.org/jira/browse/PIO-106
> Project: PredictionIO
>  Issue Type: Improvement
>  Components: Core
>Affects Versions: 0.11.0-incubating
>Reporter: Mars Hall
>Assignee: Mars Hall
>
> When using the proposed [PIO-105 Batch 
> Predictions|https://issues.apache.org/jira/browse/PIO-105] feature with an 
> engine that queries Elasticsearch in {{Algorithm#predict}}, Elasticsearch's 
> REST interface appears to become overloaded, ending with the Spark job being 
> killed from errors like:
> {noformat}
> [ERROR] [ESChannels] Failed to access to /pio_meta/channels/_search
> [ERROR] [Utils] Aborting task
> [ERROR] [ESApps] Failed to access to /pio_meta/apps/_search
> [ERROR] [Executor] Exception in task 747.0 in stage 1.0 (TID 749)
> [ERROR] [Executor] Exception in task 735.0 in stage 1.0 (TID 737)
> [ERROR] [Common$] Invalid app name ur
> [ERROR] [Utils] Aborting task
> [ERROR] [URAlgorithm] Error when read recent events: 
> java.lang.IllegalArgumentException: Invalid app name ur
> [ERROR] [Executor] Exception in task 749.0 in stage 1.0 (TID 751)
> [ERROR] [Utils] Aborting task
> [ERROR] [Executor] Exception in task 748.0 in stage 1.0 (TID 750)
> [WARN] [TaskSetManager] Lost task 749.0 in stage 1.0 (TID 751, localhost, 
> executor driver): java.net.BindException: Can't assign requested address
>   at sun.nio.ch.Net.connect0(Native Method)
>   at sun.nio.ch.Net.connect(Net.java:454)
>   at sun.nio.ch.Net.connect(Net.java:446)
>   at sun.nio.ch.SocketChannelImpl.connect(SocketChannelImpl.java:648)
>   at 
> org.apache.http.impl.nio.reactor.DefaultConnectingIOReactor.processSessionRequests(DefaultConnectingIOReactor.java:273)
>   at 
> org.apache.http.impl.nio.reactor.DefaultConnectingIOReactor.processEvents(DefaultConnectingIOReactor.java:139)
>   at 
> org.apache.http.impl.nio.reactor.AbstractMultiworkerIOReactor.execute(AbstractMultiworkerIOReactor.java:348)
>   at 
> org.apache.http.impl.nio.conn.PoolingNHttpClientConnectionManager.execute(PoolingNHttpClientConnectionManager.java:192)
>   at 
> org.apache.http.impl.nio.client.CloseableHttpAsyncClientBase$1.run(CloseableHttpAsyncClientBase.java:64)
>   at java.lang.Thread.run(Thread.java:745)
> {noformat}
> After these errors happen & the job is killed, Elasticsearch immediately 
> recovers. It responds to queries normally. I researched what could cause this 
> and found an [old issue in the main Elasticsearch 
> repo|https://github.com/elastic/elasticsearch/issues/3647]. With the hints 
> given therein about *using keep-alive in the ES client* to avoid these 
> performance issues, I investigated how PredictionIO's [Elasticsearch 
> StorageClient|https://github.com/apache/incubator-predictionio/tree/develop/storage/elasticsearch/src/main/scala/org/apache/predictionio/data/storage/elasticsearch]
>  manages its connections.
> I found that unlike the other StorageClients (Elasticsearch1, HBase, JDBC), 
> Elasticsearch creates a new underlying connection, an Elasticsearch 
> RestClient, for 
> [every|https://github.com/apache/incubator-predictionio/blob/develop/storage/elasticsearch/src/main/scala/org/apache/predictionio/data/storage/elasticsearch/ESApps.scala#L80]
>  
> [single|https://github.com/apache/incubator-predictionio/blob/develop/storage/elasticsearch/src/main/scala/org/apache/predictionio/data/storage/elasticsearch/ESApps.scala#L157]
>  
> [query|https://github.com/apache/incubator-predictionio/blob/develop/storage/elasticsearch/src/main/scala/org/apache/predictionio/data/storage/elasticsearch/ESChannels.scala#L78]
>  & 
> [interaction|https://github.com/apache/incubator-predictionio/blob/develop/storage/elasticsearch/src/main/scala/org/apache/predictionio/data/storage/elasticsearch/ESEngineInstances.scala#L205]
>  with its API. As a result, *there is no way Elasticsearch TCP connections 
> can be reused via HTTP keep-alive*.
> High-performance workloads with Elasticsearch 5.x will suffer from these 
> issues unless we refactor Elasticsearch StorageClient to share the underlying 
> RestClient instead of [building a new one everytime the client is 
> used|https://github.com/apache/incubator-predictionio/blob/develop/storage/elasticsearch/src/main/scala/org/apache/predictionio/data/storage/elasticsearch/StorageClient.scala#L31].
> There are certainly different approaches we could take to sharing a 
> RestClient so that its keep-alive behavior may work as designed:
> * maintain a 

[jira] [Commented] (PIO-115) Cache name-to-ID lookups for Storage app & channel

2017-08-29 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/PIO-115?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16145855#comment-16145855
 ] 

ASF GitHub Bot commented on PIO-115:


Github user asfgit closed the pull request at:

https://github.com/apache/incubator-predictionio/pull/424


> Cache name-to-ID lookups for Storage app & channel
> --
>
> Key: PIO-115
> URL: https://issues.apache.org/jira/browse/PIO-115
> Project: PredictionIO
>  Issue Type: Improvement
>  Components: Core
>Affects Versions: 0.11.0-incubating
>Reporter: Mars Hall
>Assignee: Mars Hall
>
> When stress testing the Universal Recommender with high-concurrency HTTP/REST 
> queries, we observed that Elasticsearch traffic was majority composed of 
> requests resolving the Storage app's name & channel, over and over and over 
> again! In this case, [each per-query call to 
> `LEventStore.findByEntity`|https://github.com/heroku/predictionio-engine-ur/blob/master/src/main/scala/URAlgorithm.scala#L694]
>  re-resolves the app name to an ID.
> Implement memoization for the function that performs these name-to-ID 
> lookups, so that only one set of lookups is performed per process for each 
> app+channel combination.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[GitHub] incubator-predictionio issue #421: Elasticsearch 5.x singleton client with a...

2017-08-29 Thread mars
Github user mars commented on the issue:

https://github.com/apache/incubator-predictionio/pull/421
  
I will resolve these conflicts today and then merge this PR.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] incubator-predictionio pull request #427: [PIO-116] PySpark Support

2017-08-29 Thread marevol
GitHub user marevol opened a pull request:

https://github.com/apache/incubator-predictionio/pull/427

[PIO-116] PySpark Support

This PR provides PySpark support with minimum PIO changes.

1. Support pyspark on pio-shell
2. Add python files to use pyspark
3. Add --main-py-file option to "pio train" to submit .py file to spark

Note that this provides only fixes for Spark 2.x.
(because this fixes expect to use SparkML)

Sample project is:
https://github.com/jpioug/predictionio-template-iris
(For prediction API, Scala code is used.)

You can merge this pull request into a Git repository by running:

$ git pull https://github.com/marevol/incubator-predictionio pyspark

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/incubator-predictionio/pull/427.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #427


commit ee28fcf139c6ac8184d990cbdc4d43b00ff483fd
Author: Shinsuke Sugaya 
Date:   2017-08-22T09:47:05Z

add pyspark sub-command

commit 97f0343691ff1ca98f1ce65fc8ad3e25df6cd15b
Author: Shinsuke Sugaya 
Date:   2017-08-27T14:16:18Z

replace with values.toString

commit 2970397a6024f17872011979edcae1712f8a4362
Author: Shinsuke Sugaya 
Date:   2017-08-28T10:04:24Z

add --main-py-file option to train




---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[jira] [Commented] (PIO-116) PySpark Support

2017-08-29 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/PIO-116?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16144973#comment-16144973
 ] 

ASF GitHub Bot commented on PIO-116:


GitHub user marevol opened a pull request:

https://github.com/apache/incubator-predictionio/pull/427

[PIO-116] PySpark Support

This PR provides PySpark support with minimum PIO changes.

1. Support pyspark on pio-shell
2. Add python files to use pyspark
3. Add --main-py-file option to "pio train" to submit .py file to spark

Note that this provides only fixes for Spark 2.x.
(because this fixes expect to use SparkML)

Sample project is:
https://github.com/jpioug/predictionio-template-iris
(For prediction API, Scala code is used.)

You can merge this pull request into a Git repository by running:

$ git pull https://github.com/marevol/incubator-predictionio pyspark

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/incubator-predictionio/pull/427.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #427


commit ee28fcf139c6ac8184d990cbdc4d43b00ff483fd
Author: Shinsuke Sugaya 
Date:   2017-08-22T09:47:05Z

add pyspark sub-command

commit 97f0343691ff1ca98f1ce65fc8ad3e25df6cd15b
Author: Shinsuke Sugaya 
Date:   2017-08-27T14:16:18Z

replace with values.toString

commit 2970397a6024f17872011979edcae1712f8a4362
Author: Shinsuke Sugaya 
Date:   2017-08-28T10:04:24Z

add --main-py-file option to train




> PySpark Support
> ---
>
> Key: PIO-116
> URL: https://issues.apache.org/jira/browse/PIO-116
> Project: PredictionIO
>  Issue Type: New Feature
>  Components: Core
>Reporter: Shinsuke Sugaya
>Assignee: Shinsuke Sugaya
>
> This provides PySpark support with minimum PIO changes.
> 1. Support pyspark on pio-shell
> 2. Add python files to use pyspark
> 3. Add --main-py-file option to "pio train" to submit .py file to spark
> Note that this provides only fixes for Spark 2.x.
> (because this fixes expect to use SparkML)
> Sample project is:
> https://github.com/jpioug/predictionio-template-iris
> (For prediction API, Scala code is used.)



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Created] (PIO-116) PySpark Support

2017-08-29 Thread Shinsuke Sugaya (JIRA)
Shinsuke Sugaya created PIO-116:
---

 Summary: PySpark Support
 Key: PIO-116
 URL: https://issues.apache.org/jira/browse/PIO-116
 Project: PredictionIO
  Issue Type: New Feature
  Components: Core
Reporter: Shinsuke Sugaya
Assignee: Shinsuke Sugaya


This provides PySpark support with minimum PIO changes.

1. Support pyspark on pio-shell
2. Add python files to use pyspark
3. Add --main-py-file option to "pio train" to submit .py file to spark

Note that this provides only fixes for Spark 2.x.
(because this fixes expect to use SparkML)

Sample project is:
https://github.com/jpioug/predictionio-template-iris
(For prediction API, Scala code is used.)



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)