[GitHub] incubator-predictionio pull request #420: [PIO-106] Elasticsearch 5.x Storag...

2017-08-14 Thread mars
Github user mars closed the pull request at:

https://github.com/apache/incubator-predictionio/pull/420


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] incubator-predictionio pull request #420: [PIO-106] Elasticsearch 5.x Storag...

2017-08-10 Thread mars
Github user mars commented on a diff in the pull request:


https://github.com/apache/incubator-predictionio/pull/420#discussion_r132601642
  
--- Diff: 
storage/elasticsearch/src/main/scala/org/apache/predictionio/data/storage/elasticsearch/ESEvaluationInstances.scala
 ---
@@ -110,28 +104,24 @@ class ESEvaluationInstances(client: ESClient, config: 
StorageClientConfig, index
 error(s"Failed to access to /$index/$estype/$id", e)
 None
 } finally {
-  restClient.close()
+  client.close()
--- End diff --

This `close` should be removed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] incubator-predictionio pull request #420: [PIO-106] Elasticsearch 5.x Storag...

2017-08-10 Thread mars
GitHub user mars opened a pull request:

https://github.com/apache/incubator-predictionio/pull/420

[PIO-106] Elasticsearch 5.x StorageClient should reuse RestClient

Implements [PIO-106](https://issues.apache.org/jira/browse/PIO-106)

This PR moves to a singleton Elasticsearch RestClient which has built-in 
HTTP keep-alive and TCP connection pooling. Running on this branch, we've seen 
a 2x speed-up in predictions from the Universal Recommender with ES5, and the 
feared "cannot bind" 😱 Elasticsearch connection errors have completely 
disappeared. Running `pio batchpredict` for 170K queries results in only 7 
total TCP connections to Elasticsearch. Previously that would escalate to 
~25,000 connections before denying further connections.

**This fundamentally changes the interface for the new [Elasticsearch 5.x 
REST 
client](https://github.com/apache/incubator-predictionio/tree/develop/storage/elasticsearch/src/main/scala/org/apache/predictionio/data/storage/elasticsearch)**
 introduced with PredictionIO 0.11.0-incubating. With this changeset, the 
`client` is a single instance of 
[`org.elasticsearch.client.RestClient`](https://github.com/elastic/elasticsearch/blob/master/client/rest/src/main/java/org/elasticsearch/client/RestClient.java).

🚨 **As a result of this change, any engine templates that directly use 
the Elasticsearch 5 StorageClient would require an update for compatibility.** 
The change is this:

### Original 

```scala
val client: StorageClient = … // code to instantiate client
val restClient: RestClient = client.open()
try {
  restClient.performRequest(…)
} finally {
  restClient.close()
}
```

### With this PR

```scala
val client: RestClient = … // code to instantiate client
client.performRequest(…)
```

*No more balancing `open` & `close` as this is handled by using a new 
`CleanupFunctions` hook added to the framework in this PR.*

[Universal Recommender](https://github.com/actionml/universal-recommender) 
is the only template that I know of which directly uses the ES StorageClient 
outside of PredictionIO core. See the [UR changes for compatibility with this 
PR](https://github.com/heroku/predictionio-engine-ur/compare/esclient-singleton).

### Elasticsearch StorageClient changes

* reimplemented as singleton
* installs a cleanup function

See 
[StorageClient](https://github.com/apache/incubator-predictionio/compare/develop...mars:esclient-singleton?expand=1#diff-2926f4cfd93ccb02320e2a9503ccd223)

### Core changes

A new 
[`CleanupFunctions`](https://github.com/apache/incubator-predictionio/compare/develop...mars:esclient-singleton?expand=1#diff-2a958821ac58f019fbce38540c775f19)
 hook has been added which enables developers of storage modules to register 
anonymous functions with `CleanupFunctions.add { … }` to be executed after 
Spark-related commands/workflows. The hook is called in a `finally { 
CleanupFunctions.run() }` from within:

* `pio import`
* `pio export`
* `pio train`
* `pio batchpredict`

Apologies for the huge indentation shifts from the requisite try-finally 
blocks:

```scala
try {
  // Freshly indented code.
} finally {
  CleanupFunctions.run()
}
```

You can merge this pull request into a Git repository by running:

$ git pull https://github.com/mars/incubator-predictionio esclient-singleton

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/incubator-predictionio/pull/420.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #420


commit f30f27bcc09a397efb42a7923938beceaeac37bf
Author: Mars Hall 
Date:   2017-08-08T23:29:15Z

Migrate to singleton Elasticsearch client to use underlying connection 
pooling (PoolingNHttpClientConnectionManager)

commit d99927089a41cb85f525cb74bdf394eed4686bf2
Author: Mars Hall 
Date:   2017-08-10T03:00:58Z

Log stacktrace for Storage initialization errors.

commit dc4c31cbcddbb3b281d52b8099e210adc546d1ed
Author: Mars Hall 
Date:   2017-08-10T22:55:38Z

Remove shade rule that breaks Elasticsearch 5 client

commit 7634a7ab720239d5f8efda85f67b26bdaff797f8
Author: Mars Hall 
Date:   2017-08-10T22:59:01Z

Collect & run cleanup functions to allow spark-submit processes to end 
gracefully.

commit 5953451f40e554eafa887328122c794edbbd8f1d
Author: Mars Hall 
Date:   2017-08-11T00:06:24Z

Rename CleanupFunctions to match object name




---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is