[jira] [Commented] (BEAM-3201) ElasticsearchIO should deal with documents id

Nicholas Verbeck (JIRA) Tue, 21 Nov 2017 07:25:38 -0800

    [ 
https://issues.apache.org/jira/browse/BEAM-3201?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16260893#comment-16260893
 ]


Nicholas Verbeck commented on BEAM-3201:
----------------------------------------

[~echauchot] The reason I ask about the _index and _type is to support dynamic 
index and types. To implement the document id that this story aims to do, the 
metadata needs to change anyway. So why not change it together and get 3 
features out of one. withIndex and withType wouldn't change as those are 
treated as fallback values when the data is not provided in the metadata, same 
goes for document id but that would defeat the purpose of the bulk api in this 
case. We'd just add the withDocumentIdField as described here, as well as two 
more fields withTypeField and withIndexField. Then only modify the metadata 
field as needed based on the available data and configuration. If the field 
lookups fail to find something. They fall back to using the defaults provided.

Example metadata for all 3: 
{code:json}
{ "index" : { "_index" : "test", "_type" : "type1", "_id" : "1" } }
{code}

Here are the docs describing what I'm referring to with defaults and provided 
metadata via the bulk api 
https://www.elastic.co/guide/en/elasticsearch/reference/5.5/docs-bulk.html

I had create BEAM-3222 to implement dynamic types and indexes. As I really need 
them right now to support auto scaling of ES indexes via index templates. But 
in researching the work needed for that. I found out both of these stories are 
fully the same. Just different fields within the same metadata. 

As for User Funcs for providing the fields. I think the work should just be 
handled up the pipeline while the document is still in an object form. Doing 
the work to deserialize, add fields, serialize is a lot of extra work that 
would slow a pipelines throughput down. This is why I linked the JsonPath 
project for fetching the field. It doesn't require a full deserialize of the 
document but token walks it for performance. 


> ElasticsearchIO should deal with documents id
> ---------------------------------------------
>
>                 Key: BEAM-3201
>                 URL: https://issues.apache.org/jira/browse/BEAM-3201
>             Project: Beam
>          Issue Type: Improvement
>          Components: sdk-java-extensions
>            Reporter: Etienne Chauchot
>            Assignee: Chet Aldrich
>
> Today the ESIO only inserts the payload of the ES documents. Elasticsearch 
> generates a document id for each record inserted. So each new insertion is 
> considered as a new document. Users want to be able to update documents using 
> the IO. So, for the write part of the IO, users should be able to provide a 
> document id so that they could update already stored documents. Providing an 
> id for the documents could also help the user on indempotency.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

[jira] [Commented] (BEAM-3201) ElasticsearchIO should deal with documents id

Reply via email to