[ 
https://issues.apache.org/jira/browse/BEAM-12093?focusedWorklogId=587110&page=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-587110
 ]

ASF GitHub Bot logged work on BEAM-12093:
-----------------------------------------

                Author: ASF GitHub Bot
            Created on: 22/Apr/21 07:59
            Start Date: 22/Apr/21 07:59
    Worklog Time Spent: 10m 
      Work Description: echauchot commented on pull request #14347:
URL: https://github.com/apache/beam/pull/14347#issuecomment-824628922


   > @echauchot Thanks for the review, I'll work my way through your comments 
and suggestions.
   > 
   > > Besides, Evan, as you know ES very well, and you seem to be interested 
in contributing. Would you be interested in putting yourself in ES Owners file 
and jira ES label ?
   > 
   > I'd be very happy to  I've added myself to the ES owners file now, happy 
to lend a hand reviewing! Thanks 
   > 
   > With respect to Jira, could you please add appropriate permissions for me 
to either assign myself to the ES label, or assign me to the label yourself if 
that is the preferred workflow. I have an account on issues.apache.org/jira but 
only with permission to create tickets I believe.
   
   I just added you to the contributor role and as lead on elasticsearch 
component. @kennknowles please let me know if there is any problem assigning a 
non-committer to leader on a component. If such I'll put myself back. If not, 
@egalpin feel free to ping me if you have questions related to the IO in the 
future.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
[email protected]


Issue Time Tracking
-------------------

    Worklog Id:     (was: 587110)
    Time Spent: 5h 20m  (was: 5h 10m)

> Overhaul ElasticsearchIO#Write
> ------------------------------
>
>                 Key: BEAM-12093
>                 URL: https://issues.apache.org/jira/browse/BEAM-12093
>             Project: Beam
>          Issue Type: Improvement
>          Components: io-java-elasticsearch
>            Reporter: Evan Galpin
>            Priority: P2
>              Labels: elasticsearch
>          Time Spent: 5h 20m
>  Remaining Estimate: 0h
>
> The current ElasticsearchIO#Write is great, but there are two related areas 
> which could be improved:
>  # Separation of concern
>  # Bulk API batch size optimization
>  
> Presently, the Write transform has 2 responsibilities which are coupled and 
> inseparable by users:
>  # Convert input documents into Bulk API entities, serializing based on user 
> settings (partial update, delete, upsert, etc)
>  # Batch the converted Bulk API entities together and interface with the 
> target ES cluster
>  
> Having these 2 roles tightly coupled means testing requires an available 
> Elasticsearch cluster, making unit testing almost impossible. Allowing access 
> to the serialized documents would make unit testing much easier for pipeline 
> developers, among numerous other benefits to having separation between 
> serialization and IO.
> Relatedly, the batching of entities when creating Bulk API payloads is 
> currently limited by the lesser of Beam Runner bundling semantics, and the 
> `ElasticsearchIO#Write#maxBatchSize` setting. This is understandable for 
> portability between runners, but it also means most Bulk payloads only have a 
> few (1-5) entities. By using Stateful Processing to better adhere to the 
> `ElasticsearchIO#Write#maxBatchSize` setting, we have been able to drop the 
> number of indexing requests in an Elasticsearch cluster by 50-100x. 
> Separating the role of document serialization and IO allows supporting 
> multiple IO techniques with minimal and understandable code.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

Reply via email to