Hi all, First, thank you all for your answers and especially, Etienne for your time, advises and kindness :) @Jean-Baptiste, any help on this module is welcome of course.
@Chamikara Jayalath, my aswers are inline. Have a good day ! Ludovic Le mer. 5 févr. 2020 à 20:15, Chamikara Jayalath <chamik...@google.com> a écrit : > > > On Wed, Feb 5, 2020 at 6:35 AM Etienne Chauchot <echauc...@apache.org> > wrote: > >> Still there is something I don't agree with is that IOs can be tested on >> mock. We don't really test IO behavior with mocks: there is always special >> behaviors that cannot be reproduced in mocks (split, load, with corner >> cases etc...). There was in the past IOs that were tested using mocks and >> that happened to be nonfunctional. >> >> Regarding ITests we have very few comparing to UTests and they are not as >> closely observed as UTests. >> >> Etienne >> On 05/02/2020 11:32, Jean-Baptiste Onofre wrote: >> >> Hi, >> >> We talked in the past about multiple/single module. >> >> IMHO the always preferred goal is to have a single module. However, it’s >> tricky when we have such difference, including on the user facing API. So, >> I would go with module per version, or use a specified version for a target >> Beam release. >> >> For the test, we should distinguish utest from itest. Utest can be done >> with mock, the purpose is really to test the IO behavior. Then we can have >> itest using concrete ES instance. >> >> Anyway, I’m OK with the proposal and I would like to work on this IO (I >> have other improvements coming on other IOs anyway) with you guys (and >> Ludovic especially). >> >> Regards >> JB >> >> Le 5 févr. 2020 à 10:44, Etienne Chauchot <echauc...@apache.org> a écrit >> : >> >> Hi all, >> >> We had a long discussion with Ludovic about this IO. I'd like to share >> with you to keep you informed and also gather your opinions >> >> 1. regarding version support: ES v2 is no more maintained by Elastic >> since 2018/02 so we plan to remove it from the IO. In the past we already >> retired versions (like spark 1.6 for instance). >> >> > My only concern here is that there might be users who use the existing > module who might not be able to easily upgrade the Beam version if we > remove it. But given that V2 is 5 versions behind the latest release this > might be OK. > It seems we have a consensus on this. I think there should be another general discussion on the long term support of our prefered tool IO modules. > > >> 2. regarding the user: the aim is to unlock some new features (listed by >> Ludovic) and give the user more flexibility on his request. For that, it >> requires to use high level java ES client in place of the low level REST >> client (that was used because it is the only one compatible with all ES >> versions). We plan to replace the API (json document in and out) by more >> complete standard ES objects that contain de request logic (insert/update, >> doc routing etc...) and the data. There are already IOs like SpannerIO that >> use similar objects in input PCollection rather than pure POJOs. >> >> > Won't this be a breaking change for all users ? IMO using POJOs in > PCollections is safer since we have to worry about changes to the > underlying client library API. Exception would be when underlying client > library offers a backwards compatibility guarantee that we can rely on for > the foreseeable future (for example, BQ TableRow). > Agreed but actually, there will be POJOs in order to abstract Elasticsearch's version support. The following third point explains this. > > >> 3. regarding multiple/single module: the aim is to have only one >> production code to ease the maintenance. The problem is that using high >> level client makes the code dependent to an ES lib version. We would like >> to make it invisible to the user. He should select only one jar and the IO >> should decide the lib to use behind the scene. We are thinking about using >> one module and sub-modules per version and use relocation, wrappers and a >> factory that detects the version the IO actually points to to instantiate >> the correct client version. It would also require to have DTOs in the IO >> because the high level ES java objects are not exactly the same among the >> ES versions. >> >> +1 for adding a level of indirection to make this easy for users. > >> 4. regarding tests: the aim is always to target real ES backends to have >> relevant tests (for reasons I already explained in another thread). The >> problem is that es-test-framework used today is version dependent and is a >> pain to use. We plan on using test containers per version (validated by ES >> dev advocate) and launching them as part of the UTests. Obviously we will >> launch only one container at the time per version and do all the test with >> it to avoid paying the cost of launch too much. And the tests will be >> shipped in per-version sub-modules and not in test dedicated modules like >> it is now. >> >> > Using a real ES backend for unit tests can be expensive ? Ideally we > should use a Fake (if one available) or mocking (test test out > functionality) and use real backend for IT tests that can be expensive. If > this is a local container that can be shared between unit tests with > reasonable efficiency that is OK. I'm mainly worried about introducing > flakes into unit tests due to network errors or slowness. > On this point I understand it and I agree as well, but I think Etienne is right, IO modules should be exceptions to this rule. It is really difficult to really test an IO without a real backend and Docker helps a lot here. But I take the point, these tests must be as small as posssible (one per Elasticsearch version). > > Thanks, > Cham > > >> WDYT ? >> >> Best ! >> >> Etienne >> On 30/01/2020 17:55, Alexey Romanenko wrote: >> >> I’m second for this question. We have a similar (maybe a bit less >> painful) issue for KafkaIO and it would be useful to have a general >> strategy for such cases about how to deal with that. >> >> On 24 Jan 2020, at 21:54, Kenneth Knowles <k...@apache.org> wrote: >> >> Would it make sense to have different version-specialized connectors with >> a common core library and common API package? >> >> On Fri, Jan 24, 2020 at 11:52 AM Chamikara Jayalath <chamik...@google.com> >> wrote: >> >>> Thanks for the contribution. I agree with Alexey that we should try to >>> add any new features brought in with the new PR into existing connector >>> instead of trying to maintain two implementations. >>> >>> Thanks, >>> Cham >>> >>> On Fri, Jan 24, 2020 at 9:01 AM Alexey Romanenko < >>> aromanenko....@gmail.com> wrote: >>> >>>> Hi Ludovic, >>>> >>>> Thank you for working on this and sharing the details with us. This is >>>> really great job! >>>> >>>> As I recall, we already have some support of Elasticsearch7 in current >>>> ElasticsearchIO (afaik, at least they are compatible), thanks to Zhong Chen >>>> and Etienne Chauchot, who were working on adding this [1][2] and it should >>>> be released in Beam 2.19. >>>> >>>> Would you think you can leverage this in your work on adding new >>>> Elasticsearch7 features? IMHO, supporting two different related IOs can be >>>> quite tough task and I‘d rather raise my hand to add a new functionality >>>> into existing IO than creating a new one, if it’s possible. >>>> >>>> [1] https://issues.apache.org/jira/browse/BEAM-5192 >>>> [2] https://github.com/apache/beam/pull/10433 >>>> >>>> On 22 Jan 2020, at 19:23, Ludovic Boutros <boutr...@gmail.com> wrote: >>>> >>>> Dear all, >>>> >>>> I have written a completely reworked Elasticsearch 7+ IO module. >>>> It can be found here: >>>> https://github.com/ludovic-boutros/beam/tree/fresh-reworked-elasticsearch-io-v7/sdks/java/io/elasticsearch7 >>>> >>>> This is a quite advance WIP work but I'm a quite new user of Apache >>>> Beam and I would like to get some help on this :) >>>> >>>> I can create a JIRA issue now but I prefer to wait for your wise avises >>>> first. >>>> >>>> *Why a new module ?* >>>> >>>> The current module was compliant with Elasticsearch 2.x, 5.x and 6.x. >>>> This seems to be a good point but so many things have been changed since >>>> Elasticsearch 2.x. >>>> >>>> >>> Probably this is not correct anymore due to >>> https://github.com/apache/beam/pull/10433 ? >>> >>> >>>> Elasticsearch 7.x is now partially supported (document type are >>>> removed, occ, updates...). >>>> >>>> A fresh new module, only compliant with the last version of >>>> Elasticsearch, can easily benefit a lot from the last evolutions of >>>> Elasticsearch (Java High Level Http Client). >>>> >>>> It is therefore far simpler than the current one. >>>> >>>> *Error management* >>>> >>>> Currently, errors are caught and transformed into simple exceptions. >>>> This is not always what is needed. If we would like to do specific >>>> processing on these errors (send documents in error topics for instance), >>>> it is not possible with the current module. >>>> >>>> >>> Seems like this is some sort of a dead letter queue implementation.. >>> This will be a very good feature to add to the existing connector. >>> >>> >>>> >>>> *Philosophy* >>>> >>>> This module directly uses the Elasticsearch Java client classes as >>>> inputs and outputs. >>>> >>>> This way you can configure any options you need directly in the >>>> `DocWriteRequest` objects. >>>> >>>> For instance: >>>> - If you need to use external versioning ( >>>> https://www.elastic.co/guide/en/elasticsearch/reference/current/docs-index_.html#index-versioning), >>>> you can. >>>> - If you need to use an ingest pipelines, you can. >>>> - If you need to configure an update document/script, you can. >>>> - If you need to use upserts, you can. >>>> >>>> Actually, you should be able to do everything you can do directly with >>>> Elasticsearch. >>>> >>>> Furthermore, it should be easier to keep updating the module with >>>> future Elasticsearch evolutions. >>>> >>>> *Write outputs* >>>> >>>> Two outputs are available: >>>> - Successful indexing output ; >>>> - Failed indexing output. >>>> >>>> They are available in a `WriteResult` object. >>>> >>>> These two outputs are represented by >>>> `PCollection<BulkItemResponseContainer>` objects. >>>> >>>> A `BulkItemResponseContainer` contains: >>>> - the original index request ; >>>> - the Elasticsearch response ; >>>> - a batch id. >>>> >>>> You can apply any process afterwards (reprocessing, alerting, ...). >>>> >>>> *Read input* >>>> >>>> You can read documents from Elasticsearch with this module. >>>> You can specify a `QueryBuilder` in order to filter the retrieved >>>> documents. >>>> By default, it retrieves the whole document collection. >>>> >>>> If the Elasticsearch index is sharded, multiple slices can be used >>>> during fetch. That many bundles are created. The maximum bundle count is >>>> equal to the index shard count. >>>> >>>> Thank you ! >>>> >>>> Ludovic >>>> >>>> >>>> >> >>