[GitHub] [metamodel] arjansh commented on a change in pull request #229: Upgrade to Elasticsearch 7.3.1

GitBox Fri, 18 Oct 2019 07:30:27 -0700

arjansh commented on a change in pull request #229: Upgrade to Elasticsearch 
7.3.1
URL: https://github.com/apache/metamodel/pull/229#discussion_r336520030


 ##########
 File path: 
elasticsearch/rest/src/main/java/org/apache/metamodel/elasticsearch/rest/ElasticSearchRestDataSet.java
 ##########
 @@ -50,12 +50,15 @@ public ElasticSearchRestDataSet(final RestHighLevelClient 
client, final SearchRe
     @Override
     public void closeNow() {
         final String scrollId = _searchResponse.getScrollId();
-        final ClearScrollRequest clearScrollRequest = new ClearScrollRequest();
-        clearScrollRequest.addScrollId(scrollId);
-        try {
-            client.clearScroll((ClearScrollRequest) clearScrollRequest, 
RequestOptions.DEFAULT);
-        } catch (IOException e) {
-            logger.warn("Could not clear scroll.", e);
+        
+        if (scrollId != null) {
+            final ClearScrollRequest clearScrollRequest = new 
ClearScrollRequest();
+            clearScrollRequest.addScrollId(scrollId);
+            try {
+                client.clearScroll((ClearScrollRequest) clearScrollRequest, 
RequestOptions.DEFAULT);
+            } catch (final IOException e) {
+                logger.warn("Could not clear scroll.", e);
+            }
 
 Review comment:
   It lives on the server and yes, it is a relatively expensive operation. 
Typically the scroll id is used when iterating over larger sets of data, so you 
do a query and get a thousand results, out of which only 400 are returned by 
the first HTTP request, then you use the scroll id to get the second 400, and 
you clear the scroll id, once you're done with your `DataSet`. This is a 
typical use case for which you use the scrolling mechanism and you should not 
have to clear it often.
   
   What we do in MetaModel however is that when updating data using the 
`DeleteAndInsertBuilder` it first gets all rows to update (resulting in a 
scroll id, which is also retained on the Elasticsearch cluster), then it 
deletes all these rows, then it updates them locally (that is the Row objects) 
and then it inserts them into the actual table.
   
   If you update data by quickly in succession calling MetaModel's update 
mechanism using the `DeleteAndInsertBuilder` it will quickly create a lot of 
"scrolls" of which MetaModel only allows 500. Within the 
`DeleteAndInsertBuilder` we however don't know on beforehand if the update 
query which is fired will only update one (or just a few) records (in case we 
could opt not to use the scrolling mechanism at all, because of the overhead it 
introduces) or if it will update thousands of rows.
   
   So if we want to optimize that, then we should probably come up with 
something which I think is out of scope for this PR.
   
   

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services

[GitHub] [metamodel] arjansh commented on a change in pull request #229: Upgrade to Elasticsearch 7.3.1

Reply via email to