[
https://issues.apache.org/jira/browse/BEAM-3201?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16260893#comment-16260893
]
Nicholas Verbeck commented on BEAM-3201:
----------------------------------------
[~echauchot] The reason I ask about the _index and _type is to support dynamic
index and types. To implement the document id that this story aims to do, the
metadata needs to change anyway. So why not change it together and get 3
features out of one. withIndex and withType wouldn't change as those are
treated as fallback values when the data is not provided in the metadata, same
goes for document id but that would defeat the purpose of the bulk api in this
case. We'd just add the withDocumentIdField as described here, as well as two
more fields withTypeField and withIndexField. Then only modify the metadata
field as needed based on the available data and configuration. If the field
lookups fail to find something. They fall back to using the defaults provided.
Example metadata for all 3:
{code:json}
{ "index" : { "_index" : "test", "_type" : "type1", "_id" : "1" } }
{code}
Here are the docs describing what I'm referring to with defaults and provided
metadata via the bulk api
https://www.elastic.co/guide/en/elasticsearch/reference/5.5/docs-bulk.html
I had create BEAM-3222 to implement dynamic types and indexes. As I really need
them right now to support auto scaling of ES indexes via index templates. But
in researching the work needed for that. I found out both of these stories are
fully the same. Just different fields within the same metadata.
As for User Funcs for providing the fields. I think the work should just be
handled up the pipeline while the document is still in an object form. Doing
the work to deserialize, add fields, serialize is a lot of extra work that
would slow a pipelines throughput down. This is why I linked the JsonPath
project for fetching the field. It doesn't require a full deserialize of the
document but token walks it for performance.
> ElasticsearchIO should deal with documents id
> ---------------------------------------------
>
> Key: BEAM-3201
> URL: https://issues.apache.org/jira/browse/BEAM-3201
> Project: Beam
> Issue Type: Improvement
> Components: sdk-java-extensions
> Reporter: Etienne Chauchot
> Assignee: Chet Aldrich
>
> Today the ESIO only inserts the payload of the ES documents. Elasticsearch
> generates a document id for each record inserted. So each new insertion is
> considered as a new document. Users want to be able to update documents using
> the IO. So, for the write part of the IO, users should be able to provide a
> document id so that they could update already stored documents. Providing an
> id for the documents could also help the user on indempotency.
--
This message was sent by Atlassian JIRA
(v6.4.14#64029)