[jira] [Updated] (PIO-114) Elasticsearch 5.x StorageClient basic HTTP authentication

2017-08-14 Thread Mars Hall (JIRA)

 [ 
https://issues.apache.org/jira/browse/PIO-114?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Mars Hall updated PIO-114:
--
External issue URL: 
https://github.com/apache/incubator-predictionio/pull/421

> Elasticsearch 5.x StorageClient basic HTTP authentication
> -
>
> Key: PIO-114
> URL: https://issues.apache.org/jira/browse/PIO-114
> Project: PredictionIO
>  Issue Type: New Feature
>  Components: Core
>Affects Versions: 0.11.0-incubating
>Reporter: Mars Hall
>Assignee: Mars Hall
>
> Add optional username-password configuration for the new Elasticsearch 5 
> client; in {{conf/pio-env.sh}} config:
> {code}
> # Optional basic HTTP auth
> PIO_STORAGE_SOURCES_ELASTICSEARCH_USERNAME=my-name
> PIO_STORAGE_SOURCES_ELASTICSEARCH_PASSWORD=my-secret
> {code}
> These credentials are sent in each Elasticsearch request as an HTTP Basic 
> Authorization header.
> Enables use of public-cloud, hosted Elasticsearch clusters, such as [Bonsai 
> on Heroku](https://elements.heroku.com/addons/bonsai).



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[GitHub] incubator-predictionio pull request #421: Elasticsearch singleton client wit...

2017-08-14 Thread mars
GitHub user mars opened a pull request:

https://github.com/apache/incubator-predictionio/pull/421

Elasticsearch singleton client with authentication

Fixes both [PIO-106](https://issues.apache.org/jira/browse/PIO-106) & 
[PIO-114](https://issues.apache.org/jira/browse/PIO-114), replacing 
https://github.com/apache/incubator-predictionio/pull/372. These are combined 
because they each heavily revise the same class.

## Authentication

Add optional username-password configuration for the new Elasticsearch 5 
client; in `pio-env.sh` config:

```bash
# Optional basic HTTP auth
PIO_STORAGE_SOURCES_ELASTICSEARCH_USERNAME=my-name
PIO_STORAGE_SOURCES_ELASTICSEARCH_PASSWORD=my-secret
```

These credentials are sent in each Elasticsearch request as an HTTP Basic 
Authorization header.

Enables use of public-cloud, hosted Elasticsearch clusters, such as [Bonsai 
on Heroku](https://elements.heroku.com/addons/bonsai).

## Singleton client

This PR moves to a singleton Elasticsearch RestClient which has built-in 
HTTP keep-alive and TCP connection pooling. Running on this branch, we've seen 
a 2x speed-up in predictions from the Universal Recommender with ES5, and the 
feared "cannot assign requested address" 😱 Elasticsearch connection errors 
have completely disappeared. Running `pio batchpredict` for 160K queries 
results in only 7 total TCP connections to Elasticsearch. Previously that would 
escalate to ~25,000 connections before denying further connections.

**This fundamentally changes the interface for the new [Elasticsearch 5.x 
REST 
client](https://github.com/apache/incubator-predictionio/tree/develop/storage/elasticsearch/src/main/scala/org/apache/predictionio/data/storage/elasticsearch)**
 introduced with PredictionIO 0.11.0-incubating. With this changeset, the 
`client` is a single instance of 
[`org.elasticsearch.client.RestClient`](https://github.com/elastic/elasticsearch/blob/master/client/rest/src/main/java/org/elasticsearch/client/RestClient.java).

🚨 **As a result of this change, any engine templates that directly use 
the Elasticsearch 5 StorageClient would require an update for compatibility.** 
The change is this:

### Original 

```scala
val client: StorageClient = … // code to instantiate client
val restClient: RestClient = client.open()
try {
  restClient.performRequest(…)
} finally {
  restClient.close()
}
```

### With this PR

```scala
val client: RestClient = … // code to instantiate client
client.performRequest(…)
```

*No more balancing `open` & `close` as this is handled by using a new 
`CleanupFunctions` hook added to the framework in this PR.*

[Universal Recommender](https://github.com/actionml/universal-recommender) 
is the only template that I know of which directly uses the ES StorageClient 
outside of PredictionIO core. See example [UR changes for compatibility with 
this 
PR](https://github.com/heroku/predictionio-engine-ur/compare/esclient-singleton).

### Elasticsearch StorageClient changes

* reimplemented as singleton
* installs a cleanup function

See 
[StorageClient](https://github.com/apache/incubator-predictionio/compare/develop...mars:esclient-singleton?expand=1#diff-2926f4cfd93ccb02320e2a9503ccd223)

### Core changes

A new 
[`CleanupFunctions`](https://github.com/apache/incubator-predictionio/compare/develop...mars:esclient-singleton?expand=1#diff-2a958821ac58f019fbce38540c775f19)
 hook has been added which enables developers of storage modules to register 
anonymous functions with `CleanupFunctions.add { … }` to be executed after 
Spark-related commands/workflows. The hook is called in a `finally { 
CleanupFunctions.run() }` from within:

* `pio import`
* `pio export`
* `pio train`
* `pio batchpredict`

Apologies for the huge indentation shifts from the requisite try-finally 
blocks:

```scala
try {
  // Freshly indented code.
} finally {
  CleanupFunctions.run()
}
```

You can merge this pull request into a Git repository by running:

$ git pull https://github.com/mars/incubator-predictionio 
esclient-singleton-with-auth

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/incubator-predictionio/pull/421.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #421


commit f30f27bcc09a397efb42a7923938beceaeac37bf
Author: Mars Hall 
Date:   2017-08-08T23:29:15Z

Migrate to singleton Elasticsearch client to use underlying connection 
pooling (PoolingNHttpClientConnectionManager)

commit d99927089a41cb85f525cb74bdf394eed4686bf2
Author: Mars Hall 
Date:   2017-08-10T03:00:58Z


[GitHub] incubator-predictionio pull request #420: [PIO-106] Elasticsearch 5.x Storag...

2017-08-14 Thread mars
Github user mars closed the pull request at:

https://github.com/apache/incubator-predictionio/pull/420


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] incubator-predictionio issue #420: [PIO-106] Elasticsearch 5.x StorageClient...

2017-08-14 Thread mars
Github user mars commented on the issue:

https://github.com/apache/incubator-predictionio/pull/420
  
Closing in favor of 
https://github.com/apache/incubator-predictionio/pull/421


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] incubator-predictionio issue #372: Elasticsearch basic HTTP authentication

2017-08-14 Thread mars
Github user mars commented on the issue:

https://github.com/apache/incubator-predictionio/pull/372
  
Closing in favor of 
https://github.com/apache/incubator-predictionio/pull/421


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---



[jira] [Commented] (PIO-106) Elasticsearch 5.x StorageClient should reuse RestClient

2017-08-14 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/PIO-106?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16126322#comment-16126322
 ] 

ASF GitHub Bot commented on PIO-106:


Github user mars closed the pull request at:

https://github.com/apache/incubator-predictionio/pull/420


> Elasticsearch 5.x StorageClient should reuse RestClient
> ---
>
> Key: PIO-106
> URL: https://issues.apache.org/jira/browse/PIO-106
> Project: PredictionIO
>  Issue Type: Improvement
>  Components: Core
>Affects Versions: 0.11.0-incubating
>Reporter: Mars Hall
>Assignee: Mars Hall
>
> When using the proposed [PIO-105 Batch 
> Predictions|https://issues.apache.org/jira/browse/PIO-105] feature with an 
> engine that queries Elasticsearch in {{Algorithm#predict}}, Elasticsearch's 
> REST interface appears to become overloaded, ending with the Spark job being 
> killed from errors like:
> {noformat}
> [ERROR] [ESChannels] Failed to access to /pio_meta/channels/_search
> [ERROR] [Utils] Aborting task
> [ERROR] [ESApps] Failed to access to /pio_meta/apps/_search
> [ERROR] [Executor] Exception in task 747.0 in stage 1.0 (TID 749)
> [ERROR] [Executor] Exception in task 735.0 in stage 1.0 (TID 737)
> [ERROR] [Common$] Invalid app name ur
> [ERROR] [Utils] Aborting task
> [ERROR] [URAlgorithm] Error when read recent events: 
> java.lang.IllegalArgumentException: Invalid app name ur
> [ERROR] [Executor] Exception in task 749.0 in stage 1.0 (TID 751)
> [ERROR] [Utils] Aborting task
> [ERROR] [Executor] Exception in task 748.0 in stage 1.0 (TID 750)
> [WARN] [TaskSetManager] Lost task 749.0 in stage 1.0 (TID 751, localhost, 
> executor driver): java.net.BindException: Can't assign requested address
>   at sun.nio.ch.Net.connect0(Native Method)
>   at sun.nio.ch.Net.connect(Net.java:454)
>   at sun.nio.ch.Net.connect(Net.java:446)
>   at sun.nio.ch.SocketChannelImpl.connect(SocketChannelImpl.java:648)
>   at 
> org.apache.http.impl.nio.reactor.DefaultConnectingIOReactor.processSessionRequests(DefaultConnectingIOReactor.java:273)
>   at 
> org.apache.http.impl.nio.reactor.DefaultConnectingIOReactor.processEvents(DefaultConnectingIOReactor.java:139)
>   at 
> org.apache.http.impl.nio.reactor.AbstractMultiworkerIOReactor.execute(AbstractMultiworkerIOReactor.java:348)
>   at 
> org.apache.http.impl.nio.conn.PoolingNHttpClientConnectionManager.execute(PoolingNHttpClientConnectionManager.java:192)
>   at 
> org.apache.http.impl.nio.client.CloseableHttpAsyncClientBase$1.run(CloseableHttpAsyncClientBase.java:64)
>   at java.lang.Thread.run(Thread.java:745)
> {noformat}
> After these errors happen & the job is killed, Elasticsearch immediately 
> recovers. It responds to queries normally. I researched what could cause this 
> and found an [old issue in the main Elasticsearch 
> repo|https://github.com/elastic/elasticsearch/issues/3647]. With the hints 
> given therein about *using keep-alive in the ES client* to avoid these 
> performance issues, I investigated how PredictionIO's [Elasticsearch 
> StorageClient|https://github.com/apache/incubator-predictionio/tree/develop/storage/elasticsearch/src/main/scala/org/apache/predictionio/data/storage/elasticsearch]
>  manages its connections.
> I found that unlike the other StorageClients (Elasticsearch1, HBase, JDBC), 
> Elasticsearch creates a new underlying connection, an Elasticsearch 
> RestClient, for 
> [every|https://github.com/apache/incubator-predictionio/blob/develop/storage/elasticsearch/src/main/scala/org/apache/predictionio/data/storage/elasticsearch/ESApps.scala#L80]
>  
> [single|https://github.com/apache/incubator-predictionio/blob/develop/storage/elasticsearch/src/main/scala/org/apache/predictionio/data/storage/elasticsearch/ESApps.scala#L157]
>  
> [query|https://github.com/apache/incubator-predictionio/blob/develop/storage/elasticsearch/src/main/scala/org/apache/predictionio/data/storage/elasticsearch/ESChannels.scala#L78]
>  & 
> [interaction|https://github.com/apache/incubator-predictionio/blob/develop/storage/elasticsearch/src/main/scala/org/apache/predictionio/data/storage/elasticsearch/ESEngineInstances.scala#L205]
>  with its API. As a result, *there is no way Elasticsearch TCP connections 
> can be reused via HTTP keep-alive*.
> High-performance workloads with Elasticsearch 5.x will suffer from these 
> issues unless we refactor Elasticsearch StorageClient to share the underlying 
> RestClient instead of [building a new one everytime the client is 
> used|https://github.com/apache/incubator-predictionio/blob/develop/storage/elasticsearch/src/main/scala/org/apache/predictionio/data/storage/elasticsearch/StorageClient.scala#L31].
> There are certainly different approaches we could take to sharing a 
> RestClient so that its keep-alive behavior may work as designed:
> * maintain a 

[jira] [Updated] (PIO-114) Elasticsearch 5.x StorageClient basic HTTP authentication

2017-08-14 Thread Mars Hall (JIRA)

 [ 
https://issues.apache.org/jira/browse/PIO-114?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Mars Hall updated PIO-114:
--
Description: 
Add optional username-password configuration for the new Elasticsearch 5 
client; in {{conf/pio-env.sh}} config:


{code}
# Optional basic HTTP auth
PIO_STORAGE_SOURCES_ELASTICSEARCH_USERNAME=my-name
PIO_STORAGE_SOURCES_ELASTICSEARCH_PASSWORD=my-secret
{code}

These credentials are sent in each Elasticsearch request as an HTTP Basic 
Authorization header.

Enables use of public-cloud, hosted Elasticsearch clusters, such as [Bonsai on 
Heroku](https://elements.heroku.com/addons/bonsai).

  was:
Add optional username-password configuration for the new Elasticsearch 5 
client; in {conf/pio-env.sh} config:


{code}
# Optional basic HTTP auth
PIO_STORAGE_SOURCES_ELASTICSEARCH_USERNAME=my-name
PIO_STORAGE_SOURCES_ELASTICSEARCH_PASSWORD=my-secret
{code}

These credentials are sent in each Elasticsearch request as an HTTP Basic 
Authorization header.

Enables use of public-cloud, hosted Elasticsearch clusters, such as [Bonsai on 
Heroku](https://elements.heroku.com/addons/bonsai).


> Elasticsearch 5.x StorageClient basic HTTP authentication
> -
>
> Key: PIO-114
> URL: https://issues.apache.org/jira/browse/PIO-114
> Project: PredictionIO
>  Issue Type: New Feature
>  Components: Core
>Affects Versions: 0.11.0-incubating
>Reporter: Mars Hall
>Assignee: Mars Hall
>
> Add optional username-password configuration for the new Elasticsearch 5 
> client; in {{conf/pio-env.sh}} config:
> {code}
> # Optional basic HTTP auth
> PIO_STORAGE_SOURCES_ELASTICSEARCH_USERNAME=my-name
> PIO_STORAGE_SOURCES_ELASTICSEARCH_PASSWORD=my-secret
> {code}
> These credentials are sent in each Elasticsearch request as an HTTP Basic 
> Authorization header.
> Enables use of public-cloud, hosted Elasticsearch clusters, such as [Bonsai 
> on Heroku](https://elements.heroku.com/addons/bonsai).



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Created] (PIO-114) Elasticsearch 5.x StorageClient basic HTTP authentication

2017-08-14 Thread Mars Hall (JIRA)
Mars Hall created PIO-114:
-

 Summary: Elasticsearch 5.x StorageClient basic HTTP authentication
 Key: PIO-114
 URL: https://issues.apache.org/jira/browse/PIO-114
 Project: PredictionIO
  Issue Type: New Feature
  Components: Core
Affects Versions: 0.11.0-incubating
Reporter: Mars Hall
Assignee: Mars Hall


Add optional username-password configuration for the new Elasticsearch 5 
client; in {conf/pio-env.sh} config:


{code:shell}
# Optional basic HTTP auth
PIO_STORAGE_SOURCES_ELASTICSEARCH_USERNAME=my-name
PIO_STORAGE_SOURCES_ELASTICSEARCH_PASSWORD=my-secret
{code}
```

These credentials are sent in each Elasticsearch request as an HTTP Basic 
Authorization header.

Enables use of public-cloud, hosted Elasticsearch clusters, such as [Bonsai on 
Heroku](https://elements.heroku.com/addons/bonsai).



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)