Re: A new reworked Elasticsearch 7+ IO module

Ludovic Boutros Thu, 06 Feb 2020 00:48:13 -0800

Hi all,

First, thank you all for your answers and especially, Etienne for your
time, advises and kindness :)
@Jean-Baptiste, any help on this module is welcome of course.


@Chamikara Jayalath, my aswers are inline.

Have a good day !

Ludovic

Le mer. 5 févr. 2020 à 20:15, Chamikara Jayalath <chamik...@google.com> a
écrit :

>
>
> On Wed, Feb 5, 2020 at 6:35 AM Etienne Chauchot <echauc...@apache.org>
> wrote:
>
>> Still there is something I don't agree with is that IOs can be tested on
>> mock. We don't really test IO behavior with mocks: there is always special
>> behaviors that cannot be reproduced in mocks (split, load, with corner
>> cases etc...). There was in the past IOs that were tested using mocks and
>> that happened to be nonfunctional.
>>
>> Regarding ITests we have very few comparing to UTests and they are not as
>> closely observed as UTests.
>>
>> Etienne
>> On 05/02/2020 11:32, Jean-Baptiste Onofre wrote:
>>
>> Hi,
>>
>> We talked in the past about multiple/single module.
>>
>> IMHO the always preferred goal is to have a single module. However, it’s
>> tricky when we have such difference, including on the user facing API. So,
>> I would go with module per version, or use a specified version for a target
>> Beam release.
>>
>> For the test, we should distinguish utest from itest. Utest can be done
>> with mock, the purpose is really to test the IO behavior. Then we can have
>> itest using concrete ES instance.
>>
>> Anyway, I’m OK with the proposal and I would like to work on this IO (I
>> have other improvements coming on other IOs anyway) with you guys (and
>> Ludovic especially).
>>
>> Regards
>> JB
>>
>> Le 5 févr. 2020 à 10:44, Etienne Chauchot <echauc...@apache.org> a écrit
>> :
>>
>> Hi all,
>>
>> We had a long discussion with Ludovic about this IO. I'd like to share
>> with you to keep you informed and also gather your opinions
>>
>> 1. regarding version support: ES v2 is no more maintained by Elastic
>> since 2018/02 so we plan to remove it from the IO. In the past we already
>> retired versions (like spark 1.6 for instance).
>>
>>
> My only concern here is that there might be users who use the existing
> module who might not be able to easily upgrade the Beam version if we
> remove it. But given that V2 is 5 versions behind the latest release this
> might be OK.
>

It seems we have a consensus on this.
I think there should be another general discussion on the long term support
of our prefered tool IO modules.


>
>
>> 2. regarding the user: the aim is to unlock some new features (listed by
>> Ludovic) and give the user more flexibility on his request. For that, it
>> requires to use high level java ES client in place of the low level REST
>> client (that was used because it is the only one compatible with all ES
>> versions). We plan to replace the API (json document in and out) by more
>> complete standard ES objects that contain de request logic (insert/update,
>> doc routing etc...) and the data. There are already IOs like SpannerIO that
>> use similar objects in input PCollection rather than pure POJOs.
>>
>>
> Won't this be a breaking change for all users ? IMO using POJOs in
> PCollections is safer since we have to worry about changes to the
> underlying client library API. Exception would be when underlying client
> library offers a backwards compatibility guarantee that we can rely on for
> the foreseeable future (for example, BQ TableRow).
>

Agreed but actually, there will be POJOs in order to abstract
Elasticsearch's version support. The following third point explains this.


>
>
>> 3. regarding multiple/single module: the aim is to have only one
>> production code to ease the maintenance.  The problem is that using high
>> level client makes the code dependent to an ES lib version. We would like
>> to make it invisible to the user. He should select only one jar and the IO
>> should decide the lib to use behind the scene. We are thinking about using
>> one module and sub-modules per version and use relocation, wrappers and a
>> factory that detects the version the IO actually points to to instantiate
>> the correct client version. It would also require to have DTOs in the IO
>> because the high level ES java objects are not exactly the same among the
>> ES versions.
>>
>> +1 for adding a level of indirection to make this easy for users.
>
>> 4. regarding tests: the aim is always to target real ES backends to have
>> relevant tests (for reasons I already explained in another thread). The
>> problem is that es-test-framework used today is version dependent and is a
>> pain to use. We plan on using test containers per version (validated by ES
>> dev advocate) and launching them as part of the UTests. Obviously we will
>> launch only one container at the time per version and do all the test with
>> it to avoid paying the cost of launch too much. And the tests will be
>> shipped in per-version sub-modules and not in test dedicated modules like
>> it is now.
>>
>>
> Using a real ES backend for unit tests can be expensive ? Ideally we
> should use a Fake (if one available) or mocking (test test out
> functionality) and use real backend for IT tests that can be expensive. If
> this is a local container that can be shared between unit tests with
> reasonable efficiency that is OK. I'm mainly worried about introducing
> flakes into unit tests due to network errors or slowness.
>

On this point I understand it and I agree as well, but I think Etienne is
right, IO modules should be exceptions to this rule.
It is really difficult to really test an IO without a real backend and
Docker helps a lot here.
But I take the point, these tests must be as small as posssible (one per
Elasticsearch version).




>
> Thanks,
> Cham
>
>
>> WDYT ?
>>
>> Best !
>>
>> Etienne
>> On 30/01/2020 17:55, Alexey Romanenko wrote:
>>
>> I’m second for this question. We have a similar (maybe a bit less
>> painful) issue for KafkaIO and it would be useful to have a general
>> strategy for such cases about how to deal with that.
>>
>> On 24 Jan 2020, at 21:54, Kenneth Knowles <k...@apache.org> wrote:
>>
>> Would it make sense to have different version-specialized connectors with
>> a common core library and common API package?
>>
>> On Fri, Jan 24, 2020 at 11:52 AM Chamikara Jayalath <chamik...@google.com>
>> wrote:
>>
>>> Thanks for the contribution. I agree with Alexey that we should try to
>>> add any new features brought in with the new PR into existing connector
>>> instead of trying to maintain two implementations.
>>>
>>> Thanks,
>>> Cham
>>>
>>> On Fri, Jan 24, 2020 at 9:01 AM Alexey Romanenko <
>>> aromanenko....@gmail.com> wrote:
>>>
>>>> Hi Ludovic,
>>>>
>>>> Thank you for working on this and sharing the details with us. This is
>>>> really great job!
>>>>
>>>> As I recall, we already have some support of Elasticsearch7 in current
>>>> ElasticsearchIO (afaik, at least they are compatible), thanks to Zhong Chen
>>>> and Etienne Chauchot, who were working on adding this [1][2] and it should
>>>> be released in Beam 2.19.
>>>>
>>>> Would you think you can leverage this in your work on adding new
>>>> Elasticsearch7 features? IMHO, supporting two different related IOs can be
>>>> quite tough task and I‘d rather raise my hand to add a new functionality
>>>> into existing IO than creating a new one, if it’s possible.
>>>>
>>>> [1] https://issues.apache.org/jira/browse/BEAM-5192
>>>> [2] https://github.com/apache/beam/pull/10433
>>>>
>>>> On 22 Jan 2020, at 19:23, Ludovic Boutros <boutr...@gmail.com> wrote:
>>>>
>>>> Dear all,
>>>>
>>>> I have written a completely reworked Elasticsearch 7+ IO module.
>>>> It can be found here:
>>>> https://github.com/ludovic-boutros/beam/tree/fresh-reworked-elasticsearch-io-v7/sdks/java/io/elasticsearch7
>>>>
>>>> This is a quite advance WIP work but I'm a quite new user of Apache
>>>> Beam and I would like to get some help on this :)
>>>>
>>>> I can create a JIRA issue now but I prefer to wait for your wise avises
>>>> first.
>>>>
>>>> *Why a new module ?*
>>>>
>>>> The current module was compliant with Elasticsearch 2.x, 5.x and 6.x.
>>>> This seems to be a good point but so many things have been changed since
>>>> Elasticsearch 2.x.
>>>>
>>>>
>>> Probably this is not correct anymore due to
>>> https://github.com/apache/beam/pull/10433 ?
>>>
>>>
>>>> Elasticsearch 7.x is now partially supported (document type are
>>>> removed, occ, updates...).
>>>>
>>>> A fresh new module, only compliant with the last version of
>>>> Elasticsearch, can easily benefit a lot from the last evolutions of
>>>> Elasticsearch (Java High Level Http Client).
>>>>
>>>> It is therefore far simpler than the current one.
>>>>
>>>> *Error management*
>>>>
>>>> Currently, errors are caught and transformed into simple exceptions.
>>>> This is not always what is needed. If we would like to do specific
>>>> processing on these errors (send documents in error topics for instance),
>>>> it is not possible with the current module.
>>>>
>>>>
>>> Seems like this is some sort of a dead letter queue implementation..
>>> This will be a very good feature to add to the existing connector.
>>>
>>>
>>>>
>>>> *Philosophy*
>>>>
>>>> This module directly uses the Elasticsearch Java client classes as
>>>> inputs and outputs.
>>>>
>>>> This way you can configure any options you need directly in the
>>>> `DocWriteRequest` objects.
>>>>
>>>> For instance:
>>>> - If you need to use external versioning (
>>>> https://www.elastic.co/guide/en/elasticsearch/reference/current/docs-index_.html#index-versioning),
>>>> you can.
>>>> - If you need to use an ingest pipelines, you can.
>>>> - If you need to configure an update document/script, you can.
>>>> - If you need to use upserts, you can.
>>>>
>>>> Actually, you should be able to do everything you can do directly with
>>>> Elasticsearch.
>>>>
>>>> Furthermore, it should be easier to keep updating the module with
>>>> future Elasticsearch evolutions.
>>>>
>>>> *Write outputs*
>>>>
>>>> Two outputs are available:
>>>> - Successful indexing output ;
>>>> - Failed indexing output.
>>>>
>>>> They are available in a `WriteResult` object.
>>>>
>>>> These two outputs are represented by
>>>> `PCollection<BulkItemResponseContainer>` objects.
>>>>
>>>> A `BulkItemResponseContainer` contains:
>>>> - the original index request ;
>>>> - the Elasticsearch response ;
>>>> - a batch id.
>>>>
>>>> You can apply any process afterwards (reprocessing, alerting, ...).
>>>>
>>>> *Read input*
>>>>
>>>> You can read documents from Elasticsearch with this module.
>>>> You can specify a `QueryBuilder` in order to filter the retrieved
>>>> documents.
>>>> By default, it retrieves the whole document collection.
>>>>
>>>> If the Elasticsearch index is sharded, multiple slices can be used
>>>> during fetch. That many bundles are created. The maximum bundle count is
>>>> equal to the index shard count.
>>>>
>>>> Thank you !
>>>>
>>>> Ludovic
>>>>
>>>>
>>>>
>>
>>

Re: A new reworked Elasticsearch 7+ IO module

Reply via email to