ASF GitHub Bot commented on PIO-106:

GitHub user mars opened a pull request:


    [PIO-106] Elasticsearch 5.x StorageClient should reuse RestClient

    Implements [PIO-106](https://issues.apache.org/jira/browse/PIO-106)
    This PR moves to a singleton Elasticsearch RestClient which has built-in 
HTTP keep-alive and TCP connection pooling. Running on this branch, we've seen 
a 2x speed-up in predictions from the Universal Recommender with ES5, and the 
feared "cannot bind" 😱 Elasticsearch connection errors have completely 
disappeared. Running `pio batchpredict` for 170K queries results in only 7 
total TCP connections to Elasticsearch. Previously that would escalate to 
~25,000 connections before denying further connections.
    **This fundamentally changes the interface for the new [Elasticsearch 5.x 
 introduced with PredictionIO 0.11.0-incubating. With this changeset, the 
`client` is a single instance of 
    🚨 **As a result of this change, any engine templates that directly use the 
Elasticsearch 5 StorageClient would require an update for compatibility.** The 
change is this:
    ### Original 
    val client: StorageClient = … // code to instantiate client
    val restClient: RestClient = client.open()
    try {
    } finally {
    ### With this PR
    val client: RestClient = … // code to instantiate client
    *No more balancing `open` & `close` as this is handled by using a new 
`CleanupFunctions` hook added to the framework in this PR.*
    [Universal Recommender](https://github.com/actionml/universal-recommender) 
is the only template that I know of which directly uses the ES StorageClient 
outside of PredictionIO core. See the [UR changes for compatibility with this 
    ### Elasticsearch StorageClient changes
    * reimplemented as singleton
    * installs a cleanup function
    ### Core changes
    A new 
 hook has been added which enables developers of storage modules to register 
anonymous functions with `CleanupFunctions.add { … }` to be executed after 
Spark-related commands/workflows. The hook is called in a `finally { 
CleanupFunctions.run() }` from within:
    * `pio import`
    * `pio export`
    * `pio train`
    * `pio batchpredict`
    Apologies for the huge indentation shifts from the requisite try-finally 
    try {
      // Freshly indented code.
    } finally {

You can merge this pull request into a Git repository by running:

    $ git pull https://github.com/mars/incubator-predictionio esclient-singleton

Alternatively you can review and apply these changes as the patch at:


To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

    This closes #420
commit f30f27bcc09a397efb42a7923938beceaeac37bf
Author: Mars Hall <m...@heroku.com>
Date:   2017-08-08T23:29:15Z

    Migrate to singleton Elasticsearch client to use underlying connection 
pooling (PoolingNHttpClientConnectionManager)

commit d99927089a41cb85f525cb74bdf394eed4686bf2
Author: Mars Hall <m...@heroku.com>
Date:   2017-08-10T03:00:58Z

    Log stacktrace for Storage initialization errors.

commit dc4c31cbcddbb3b281d52b8099e210adc546d1ed
Author: Mars Hall <m...@heroku.com>
Date:   2017-08-10T22:55:38Z

    Remove shade rule that breaks Elasticsearch 5 client

commit 7634a7ab720239d5f8efda85f67b26bdaff797f8
Author: Mars Hall <m...@heroku.com>
Date:   2017-08-10T22:59:01Z

    Collect & run cleanup functions to allow spark-submit processes to end 

commit 5953451f40e554eafa887328122c794edbbd8f1d
Author: Mars Hall <m...@heroku.com>
Date:   2017-08-11T00:06:24Z

    Rename CleanupFunctions to match object name


> Elasticsearch 5.x StorageClient should reuse RestClient
> -------------------------------------------------------
>                 Key: PIO-106
>                 URL: https://issues.apache.org/jira/browse/PIO-106
>             Project: PredictionIO
>          Issue Type: Improvement
>          Components: Core
>    Affects Versions: 0.11.0-incubating
>            Reporter: Mars Hall
>            Assignee: Mars Hall
> When using the proposed [PIO-105 Batch 
> Predictions|https://issues.apache.org/jira/browse/PIO-105] feature with an 
> engine that queries Elasticsearch in {{Algorithm#predict}}, Elasticsearch's 
> REST interface appears to become overloaded, ending with the Spark job being 
> killed from errors like:
> {noformat}
> [ERROR] [ESChannels] Failed to access to /pio_meta/channels/_search
> [ERROR] [Utils] Aborting task
> [ERROR] [ESApps] Failed to access to /pio_meta/apps/_search
> [ERROR] [Executor] Exception in task 747.0 in stage 1.0 (TID 749)
> [ERROR] [Executor] Exception in task 735.0 in stage 1.0 (TID 737)
> [ERROR] [Common$] Invalid app name ur
> [ERROR] [Utils] Aborting task
> [ERROR] [URAlgorithm] Error when read recent events: 
> java.lang.IllegalArgumentException: Invalid app name ur
> [ERROR] [Executor] Exception in task 749.0 in stage 1.0 (TID 751)
> [ERROR] [Utils] Aborting task
> [ERROR] [Executor] Exception in task 748.0 in stage 1.0 (TID 750)
> [WARN] [TaskSetManager] Lost task 749.0 in stage 1.0 (TID 751, localhost, 
> executor driver): java.net.BindException: Can't assign requested address
>   at sun.nio.ch.Net.connect0(Native Method)
>   at sun.nio.ch.Net.connect(Net.java:454)
>   at sun.nio.ch.Net.connect(Net.java:446)
>   at sun.nio.ch.SocketChannelImpl.connect(SocketChannelImpl.java:648)
>   at 
> org.apache.http.impl.nio.reactor.DefaultConnectingIOReactor.processSessionRequests(DefaultConnectingIOReactor.java:273)
>   at 
> org.apache.http.impl.nio.reactor.DefaultConnectingIOReactor.processEvents(DefaultConnectingIOReactor.java:139)
>   at 
> org.apache.http.impl.nio.reactor.AbstractMultiworkerIOReactor.execute(AbstractMultiworkerIOReactor.java:348)
>   at 
> org.apache.http.impl.nio.conn.PoolingNHttpClientConnectionManager.execute(PoolingNHttpClientConnectionManager.java:192)
>   at 
> org.apache.http.impl.nio.client.CloseableHttpAsyncClientBase$1.run(CloseableHttpAsyncClientBase.java:64)
>   at java.lang.Thread.run(Thread.java:745)
> {noformat}
> After these errors happen & the job is killed, Elasticsearch immediately 
> recovers. It responds to queries normally. I researched what could cause this 
> and found an [old issue in the main Elasticsearch 
> repo|https://github.com/elastic/elasticsearch/issues/3647]. With the hints 
> given therein about *using keep-alive in the ES client* to avoid these 
> performance issues, I investigated how PredictionIO's [Elasticsearch 
> StorageClient|https://github.com/apache/incubator-predictionio/tree/develop/storage/elasticsearch/src/main/scala/org/apache/predictionio/data/storage/elasticsearch]
>  manages its connections.
> I found that unlike the other StorageClients (Elasticsearch1, HBase, JDBC), 
> Elasticsearch creates a new underlying connection, an Elasticsearch 
> RestClient, for 
> [every|https://github.com/apache/incubator-predictionio/blob/develop/storage/elasticsearch/src/main/scala/org/apache/predictionio/data/storage/elasticsearch/ESApps.scala#L80]
> [single|https://github.com/apache/incubator-predictionio/blob/develop/storage/elasticsearch/src/main/scala/org/apache/predictionio/data/storage/elasticsearch/ESApps.scala#L157]
> [query|https://github.com/apache/incubator-predictionio/blob/develop/storage/elasticsearch/src/main/scala/org/apache/predictionio/data/storage/elasticsearch/ESChannels.scala#L78]
>  & 
> [interaction|https://github.com/apache/incubator-predictionio/blob/develop/storage/elasticsearch/src/main/scala/org/apache/predictionio/data/storage/elasticsearch/ESEngineInstances.scala#L205]
>  with its API. As a result, *there is no way Elasticsearch TCP connections 
> can be reused via HTTP keep-alive*.
> High-performance workloads with Elasticsearch 5.x will suffer from these 
> issues unless we refactor Elasticsearch StorageClient to share the underlying 
> RestClient instead of [building a new one everytime the client is 
> used|https://github.com/apache/incubator-predictionio/blob/develop/storage/elasticsearch/src/main/scala/org/apache/predictionio/data/storage/elasticsearch/StorageClient.scala#L31].
> There are certainly different approaches we could take to sharing a 
> RestClient so that its keep-alive behavior may work as designed:
> * maintain a singleton RestClient that is reused throughout the ES storage 
> classes
> * create a RestClient on-demand and pass it as an argument to ES storage 
> methods
> * other ideas?

This message was sent by Atlassian JIRA

Reply via email to