[jira] [Commented] (BEAM-3201) ElasticsearchIO should allow the user to optionally pass id, type and index per document

2018-03-26 Thread Chet Aldrich (JIRA)

[ 
https://issues.apache.org/jira/browse/BEAM-3201?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16414759#comment-16414759
 ] 

Chet Aldrich commented on BEAM-3201:


Oh ok, great, good to hear!



> ElasticsearchIO should allow the user to optionally pass id, type and index 
> per document
> 
>
> Key: BEAM-3201
> URL: https://issues.apache.org/jira/browse/BEAM-3201
> Project: Beam
>  Issue Type: Improvement
>  Components: io-java-elasticsearch
>Reporter: Etienne Chauchot
>Assignee: Chet Aldrich
>Priority: Major
>
> *Dynamic documents id*: Today the ESIO only inserts the payload of the ES 
> documents. Elasticsearch generates a document id for each record inserted. So 
> each new insertion is considered as a new document. Users want to be able to 
> update documents using the IO. So, for the write part of the IO, users should 
> be able to provide a document id so that they could update already stored 
> documents. Providing an id for the documents could also help the user on 
> indempotency.
> *Dynamic ES type and ES index*: In some cases (streaming pipeline with high 
> throughput) partitioning the PCollection to allow to plug to different ESIO 
> instances (pointing to different index/type) is not very practical, the users 
> would like to be able to set ES index/type per document.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (BEAM-3201) ElasticsearchIO should allow the user to optionally pass id, type and index per document

2018-03-21 Thread Chet Aldrich (JIRA)

[ 
https://issues.apache.org/jira/browse/BEAM-3201?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16409079#comment-16409079
 ] 

Chet Aldrich commented on BEAM-3201:


Hey all, sorry I kinda vanished, just been really busy. I'll get back on this. 
I'll open a PR as is to start and we can go from there. 

> ElasticsearchIO should allow the user to optionally pass id, type and index 
> per document
> 
>
> Key: BEAM-3201
> URL: https://issues.apache.org/jira/browse/BEAM-3201
> Project: Beam
>  Issue Type: Improvement
>  Components: io-java-elasticsearch
>Reporter: Etienne Chauchot
>Assignee: Chet Aldrich
>Priority: Major
>
> *Dynamic documents id*: Today the ESIO only inserts the payload of the ES 
> documents. Elasticsearch generates a document id for each record inserted. So 
> each new insertion is considered as a new document. Users want to be able to 
> update documents using the IO. So, for the write part of the IO, users should 
> be able to provide a document id so that they could update already stored 
> documents. Providing an id for the documents could also help the user on 
> indempotency.
> *Dynamic ES type and ES index*: In some cases (streaming pipeline with high 
> throughput) partitioning the PCollection to allow to plug to different ESIO 
> instances (pointing to different index/type) is not very practical, the users 
> would like to be able to set ES index/type per document.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (BEAM-3201) ElasticsearchIO should allow the user to optionally pass id, type and index per document

2018-01-16 Thread Chet Aldrich (JIRA)

[ 
https://issues.apache.org/jira/browse/BEAM-3201?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16327588#comment-16327588
 ] 

Chet Aldrich commented on BEAM-3201:


Hey [~jeroens], I'm in progress on a PR for this, just need to actually get the 
rest of the code out the door. I'm gonna spend some time this week to knock 
this out, and then we can start code review.

> ElasticsearchIO should allow the user to optionally pass id, type and index 
> per document
> 
>
> Key: BEAM-3201
> URL: https://issues.apache.org/jira/browse/BEAM-3201
> Project: Beam
>  Issue Type: Improvement
>  Components: sdk-java-extensions
>Reporter: Etienne Chauchot
>Assignee: Chet Aldrich
>Priority: Major
>
> *Dynamic documents id*: Today the ESIO only inserts the payload of the ES 
> documents. Elasticsearch generates a document id for each record inserted. So 
> each new insertion is considered as a new document. Users want to be able to 
> update documents using the IO. So, for the write part of the IO, users should 
> be able to provide a document id so that they could update already stored 
> documents. Providing an id for the documents could also help the user on 
> indempotency.
> *Dynamic ES type and ES index*: In some cases (streaming pipeline with high 
> throughput) partitioning the PCollection to allow to plug to different ESIO 
> instances (pointing to different index/type) is not very practical, the users 
> would like to be able to set ES index/type per document.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (BEAM-3201) ElasticsearchIO should deal with documents id

2017-11-28 Thread Chet Aldrich (JIRA)

[ 
https://issues.apache.org/jira/browse/BEAM-3201?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16269318#comment-16269318
 ] 

Chet Aldrich commented on BEAM-3201:


[~echauchot] [~nerdynick] sounds like we have reached a rough agreement on the 
design, at least enough for me to start coding something up and show you guys 
the PR. To summarize: 

We will keep the API of PCollection. 

Three optional methods will be added, one for each of the following metadata 
fields: _id, _index, _type. Each will require a function that takes in a JSON 
object and returns a String, which is what will be placed in the corresponding 
metadata field.

If any of these methods are called, parse the string into JSON so that each of 
the methods can use it. Reuse the deserialization for speed.

Run the method for each element in the PCollection. 

I'm going to start coding this up based on what I said above, PR will come 
soon. Let me know if I'm missing something important, and I'll edit the PR 
accordingly.


> ElasticsearchIO should deal with documents id
> -
>
> Key: BEAM-3201
> URL: https://issues.apache.org/jira/browse/BEAM-3201
> Project: Beam
>  Issue Type: Improvement
>  Components: sdk-java-extensions
>Reporter: Etienne Chauchot
>Assignee: Chet Aldrich
>
> Today the ESIO only inserts the payload of the ES documents. Elasticsearch 
> generates a document id for each record inserted. So each new insertion is 
> considered as a new document. Users want to be able to update documents using 
> the IO. So, for the write part of the IO, users should be able to provide a 
> document id so that they could update already stored documents. Providing an 
> id for the documents could also help the user on indempotency.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Commented] (BEAM-3201) ElasticsearchIO should deal with documents id

2017-11-22 Thread Chet Aldrich (JIRA)

[ 
https://issues.apache.org/jira/browse/BEAM-3201?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16263472#comment-16263472
 ] 

Chet Aldrich commented on BEAM-3201:


[~echauchot] First of all, thanks for getting that all sorted for me.

[~nerdynick] 

{quote}That said, if you want to have dynamic index/type (meaning do not use 
ConnectionConfiguration.withIndex and ConnectionConfiguration.withType) and 
also dynamic id depending of the document itself, we should add 3 optional user 
defined functions so that the user can provide them. I guess it makes the 
withDocumentIdField(String fieldName) redundant. So we should not implement 
it.{quote}

According to what Etienne said here, it seems like if we want to go this route 
we may want to rethink the design for this, especially since I agree with him 
about not polluting the document payload. 

However, I'm not necessarily sold on why this is necessary in the first place. 
Could you ([~nerdynick]) elaborate more on why your use case requires 
dynamically changing the index and type that you're writing on a per-element 
basis? Why not just split up the elements and write to a separate index via a 
separate sink with a different `ConnectionConfiguration`? IMHO one write 
operation should write to only one index, since, for example, it'd be odd to be 
writing entries to two different DB tables depending on a given element instead 
of just splitting them up into separate PCollections and _then_ writing them 
out to the different tables with separate sinks. 

This opinion is only based on my current understanding of what you're trying to 
accomplish though. Feel free to enlighten me if an assumption I made about your 
use case is incorrect. 

Would appreciate input from both of you on whether this use case is needed, and 
if it is, whether we should rethink how we're approaching this so we don't 
pollute the document payload with metadata. 










> ElasticsearchIO should deal with documents id
> -
>
> Key: BEAM-3201
> URL: https://issues.apache.org/jira/browse/BEAM-3201
> Project: Beam
>  Issue Type: Improvement
>  Components: sdk-java-extensions
>Reporter: Etienne Chauchot
>Assignee: Chet Aldrich
>
> Today the ESIO only inserts the payload of the ES documents. Elasticsearch 
> generates a document id for each record inserted. So each new insertion is 
> considered as a new document. Users want to be able to update documents using 
> the IO. So, for the write part of the IO, users should be able to provide a 
> document id so that they could update already stored documents. Providing an 
> id for the documents could also help the user on indempotency.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Commented] (BEAM-3201) ElasticsearchIO should deal with documents id

2017-11-20 Thread Chet Aldrich (JIRA)

[ 
https://issues.apache.org/jira/browse/BEAM-3201?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16260187#comment-16260187
 ] 

Chet Aldrich commented on BEAM-3201:


[~nerdynick] Thanks for the lib reference, I'll look into it.

I'll also take a look at dealing with those other cases, I think that should be 
pretty similar in theory, but I'll confirm. 

> ElasticsearchIO should deal with documents id
> -
>
> Key: BEAM-3201
> URL: https://issues.apache.org/jira/browse/BEAM-3201
> Project: Beam
>  Issue Type: Improvement
>  Components: sdk-java-extensions
>Reporter: Etienne Chauchot
>Assignee: Etienne Chauchot
>
> Today the ESIO only inserts the payload of the ES documents. Elasticsearch 
> generates a document id for each record inserted. So each new insertion is 
> considered as a new document. Users want to be able to update documents using 
> the IO. So, for the write part of the IO, users should be able to provide a 
> document id so that they could update already stored documents. Providing an 
> id for the documents could also help the user on indempotency.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Commented] (BEAM-3201) ElasticsearchIO should deal with documents id

2017-11-20 Thread Chet Aldrich (JIRA)

[ 
https://issues.apache.org/jira/browse/BEAM-3201?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16260188#comment-16260188
 ] 

Chet Aldrich commented on BEAM-3201:


Also [~echauchot] you can feel free to assign this to me when you get the 
chance. 

> ElasticsearchIO should deal with documents id
> -
>
> Key: BEAM-3201
> URL: https://issues.apache.org/jira/browse/BEAM-3201
> Project: Beam
>  Issue Type: Improvement
>  Components: sdk-java-extensions
>Reporter: Etienne Chauchot
>Assignee: Etienne Chauchot
>
> Today the ESIO only inserts the payload of the ES documents. Elasticsearch 
> generates a document id for each record inserted. So each new insertion is 
> considered as a new document. Users want to be able to update documents using 
> the IO. So, for the write part of the IO, users should be able to provide a 
> document id so that they could update already stored documents. Providing an 
> id for the documents could also help the user on indempotency.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Commented] (BEAM-3201) ElasticsearchIO should deal with documents id

2017-11-16 Thread Chet Aldrich (JIRA)

[ 
https://issues.apache.org/jira/browse/BEAM-3201?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16256316#comment-16256316
 ] 

Chet Aldrich commented on BEAM-3201:


Hey, so I'd be happy to take this ticket on, and the design seems reasonable.

I have one question about the design above: 

The API for writing currently is of the form PCollection as defined 
[here|https://github.com/apache/beam/blob/master/sdks/java/io/elasticsearch/src/main/java/org/apache/beam/sdk/io/elasticsearch/ElasticsearchIO.java#L731],
 and not PCollection. I suppose we can convert the String that is 
passed in to a JSONObject or some similar construct and then try to find the 
field specified in `withDocumentIdField`. I'm assuming that we _don't_ want to 
change the input type to PCollection, right? We would instead just 
throw an exception if a String that is passed in is not valid JSON.






> ElasticsearchIO should deal with documents id
> -
>
> Key: BEAM-3201
> URL: https://issues.apache.org/jira/browse/BEAM-3201
> Project: Beam
>  Issue Type: Improvement
>  Components: sdk-java-extensions
>Reporter: Etienne Chauchot
>Assignee: Etienne Chauchot
>
> Today the ESIO only inserts the payload of the ES documents. Elasticsearch 
> generates a document id for each record inserted. So each new insertion is 
> considered as a new document. Users want to be able to update documents using 
> the IO. So, for the write part of the IO, users should be able to provide a 
> document id so that they could update already stored documents. Providing an 
> id for the documents could also help the user on indempotency.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)