Re: A new reworked Elasticsearch 7+ IO module

Chamikara Jayalath Wed, 05 Feb 2020 11:15:21 -0800

On Wed, Feb 5, 2020 at 6:35 AM Etienne Chauchot <[email protected]>
wrote:


> Still there is something I don't agree with is that IOs can be tested on
> mock. We don't really test IO behavior with mocks: there is always special
> behaviors that cannot be reproduced in mocks (split, load, with corner
> cases etc...). There was in the past IOs that were tested using mocks and
> that happened to be nonfunctional.
>
> Regarding ITests we have very few comparing to UTests and they are not as
> closely observed as UTests.
>
> Etienne
> On 05/02/2020 11:32, Jean-Baptiste Onofre wrote:
>
> Hi,
>
> We talked in the past about multiple/single module.
>
> IMHO the always preferred goal is to have a single module. However, it’s
> tricky when we have such difference, including on the user facing API. So,
> I would go with module per version, or use a specified version for a target
> Beam release.
>
> For the test, we should distinguish utest from itest. Utest can be done
> with mock, the purpose is really to test the IO behavior. Then we can have
> itest using concrete ES instance.
>
> Anyway, I’m OK with the proposal and I would like to work on this IO (I
> have other improvements coming on other IOs anyway) with you guys (and
> Ludovic especially).
>
> Regards
> JB
>
> Le 5 févr. 2020 à 10:44, Etienne Chauchot <[email protected]> a écrit :
>
> Hi all,
>
> We had a long discussion with Ludovic about this IO. I'd like to share
> with you to keep you informed and also gather your opinions
>
> 1. regarding version support: ES v2 is no more maintained by Elastic since
> 2018/02 so we plan to remove it from the IO. In the past we already retired
> versions (like spark 1.6 for instance).
>
>
My only concern here is that there might be users who use the existing
module who might not be able to easily upgrade the Beam version if we
remove it. But given that V2 is 5 versions behind the latest release this
might be OK.


> 2. regarding the user: the aim is to unlock some new features (listed by
> Ludovic) and give the user more flexibility on his request. For that, it
> requires to use high level java ES client in place of the low level REST
> client (that was used because it is the only one compatible with all ES
> versions). We plan to replace the API (json document in and out) by more
> complete standard ES objects that contain de request logic (insert/update,
> doc routing etc...) and the data. There are already IOs like SpannerIO that
> use similar objects in input PCollection rather than pure POJOs.
>
>
Won't this be a breaking change for all users ? IMO using POJOs in
PCollections is safer since we have to worry about changes to the
underlying client library API. Exception would be when underlying client
library offers a backwards compatibility guarantee that we can rely on for
the foreseeable future (for example, BQ TableRow).


> 3. regarding multiple/single module: the aim is to have only one
> production code to ease the maintenance.  The problem is that using high
> level client makes the code dependent to an ES lib version. We would like
> to make it invisible to the user. He should select only one jar and the IO
> should decide the lib to use behind the scene. We are thinking about using
> one module and sub-modules per version and use relocation, wrappers and a
> factory that detects the version the IO actually points to to instantiate
> the correct client version. It would also require to have DTOs in the IO
> because the high level ES java objects are not exactly the same among the
> ES versions.
>
> +1 for adding a level of indirection to make this easy for users.

> 4. regarding tests: the aim is always to target real ES backends to have
> relevant tests (for reasons I already explained in another thread). The
> problem is that es-test-framework used today is version dependent and is a
> pain to use. We plan on using test containers per version (validated by ES
> dev advocate) and launching them as part of the UTests. Obviously we will
> launch only one container at the time per version and do all the test with
> it to avoid paying the cost of launch too much. And the tests will be
> shipped in per-version sub-modules and not in test dedicated modules like
> it is now.
>
>
Using a real ES backend for unit tests can be expensive ? Ideally we should
use a Fake (if one available) or mocking (test test out functionality) and
use real backend for IT tests that can be expensive. If this is a local
container that can be shared between unit tests with reasonable efficiency
that is OK. I'm mainly worried about introducing flakes into unit tests due
to network errors or slowness.

Thanks,
Cham


> WDYT ?
>
> Best !
>
> Etienne
> On 30/01/2020 17:55, Alexey Romanenko wrote:
>
> I’m second for this question. We have a similar (maybe a bit less painful)
> issue for KafkaIO and it would be useful to have a general strategy for
> such cases about how to deal with that.
>
> On 24 Jan 2020, at 21:54, Kenneth Knowles <[email protected]> wrote:
>
> Would it make sense to have different version-specialized connectors with
> a common core library and common API package?
>
> On Fri, Jan 24, 2020 at 11:52 AM Chamikara Jayalath <[email protected]>
> wrote:
>
>> Thanks for the contribution. I agree with Alexey that we should try to
>> add any new features brought in with the new PR into existing connector
>> instead of trying to maintain two implementations.
>>
>> Thanks,
>> Cham
>>
>> On Fri, Jan 24, 2020 at 9:01 AM Alexey Romanenko <
>> [email protected]> wrote:
>>
>>> Hi Ludovic,
>>>
>>> Thank you for working on this and sharing the details with us. This is
>>> really great job!
>>>
>>> As I recall, we already have some support of Elasticsearch7 in current
>>> ElasticsearchIO (afaik, at least they are compatible), thanks to Zhong Chen
>>> and Etienne Chauchot, who were working on adding this [1][2] and it should
>>> be released in Beam 2.19.
>>>
>>> Would you think you can leverage this in your work on adding new
>>> Elasticsearch7 features? IMHO, supporting two different related IOs can be
>>> quite tough task and I‘d rather raise my hand to add a new functionality
>>> into existing IO than creating a new one, if it’s possible.
>>>
>>> [1] https://issues.apache.org/jira/browse/BEAM-5192
>>> [2] https://github.com/apache/beam/pull/10433
>>>
>>> On 22 Jan 2020, at 19:23, Ludovic Boutros <[email protected]> wrote:
>>>
>>> Dear all,
>>>
>>> I have written a completely reworked Elasticsearch 7+ IO module.
>>> It can be found here:
>>> https://github.com/ludovic-boutros/beam/tree/fresh-reworked-elasticsearch-io-v7/sdks/java/io/elasticsearch7
>>>
>>> This is a quite advance WIP work but I'm a quite new user of Apache Beam
>>> and I would like to get some help on this :)
>>>
>>> I can create a JIRA issue now but I prefer to wait for your wise avises
>>> first.
>>>
>>> *Why a new module ?*
>>>
>>> The current module was compliant with Elasticsearch 2.x, 5.x and 6.x.
>>> This seems to be a good point but so many things have been changed since
>>> Elasticsearch 2.x.
>>>
>>>
>> Probably this is not correct anymore due to
>> https://github.com/apache/beam/pull/10433 ?
>>
>>
>>> Elasticsearch 7.x is now partially supported (document type are removed,
>>> occ, updates...).
>>>
>>> A fresh new module, only compliant with the last version of
>>> Elasticsearch, can easily benefit a lot from the last evolutions of
>>> Elasticsearch (Java High Level Http Client).
>>>
>>> It is therefore far simpler than the current one.
>>>
>>> *Error management*
>>>
>>> Currently, errors are caught and transformed into simple exceptions.
>>> This is not always what is needed. If we would like to do specific
>>> processing on these errors (send documents in error topics for instance),
>>> it is not possible with the current module.
>>>
>>>
>> Seems like this is some sort of a dead letter queue implementation.. This
>> will be a very good feature to add to the existing connector.
>>
>>
>>>
>>> *Philosophy*
>>>
>>> This module directly uses the Elasticsearch Java client classes as
>>> inputs and outputs.
>>>
>>> This way you can configure any options you need directly in the
>>> `DocWriteRequest` objects.
>>>
>>> For instance:
>>> - If you need to use external versioning (
>>> https://www.elastic.co/guide/en/elasticsearch/reference/current/docs-index_.html#index-versioning),
>>> you can.
>>> - If you need to use an ingest pipelines, you can.
>>> - If you need to configure an update document/script, you can.
>>> - If you need to use upserts, you can.
>>>
>>> Actually, you should be able to do everything you can do directly with
>>> Elasticsearch.
>>>
>>> Furthermore, it should be easier to keep updating the module with future
>>> Elasticsearch evolutions.
>>>
>>> *Write outputs*
>>>
>>> Two outputs are available:
>>> - Successful indexing output ;
>>> - Failed indexing output.
>>>
>>> They are available in a `WriteResult` object.
>>>
>>> These two outputs are represented by
>>> `PCollection<BulkItemResponseContainer>` objects.
>>>
>>> A `BulkItemResponseContainer` contains:
>>> - the original index request ;
>>> - the Elasticsearch response ;
>>> - a batch id.
>>>
>>> You can apply any process afterwards (reprocessing, alerting, ...).
>>>
>>> *Read input*
>>>
>>> You can read documents from Elasticsearch with this module.
>>> You can specify a `QueryBuilder` in order to filter the retrieved
>>> documents.
>>> By default, it retrieves the whole document collection.
>>>
>>> If the Elasticsearch index is sharded, multiple slices can be used
>>> during fetch. That many bundles are created. The maximum bundle count is
>>> equal to the index shard count.
>>>
>>> Thank you !
>>>
>>> Ludovic
>>>
>>>
>>>
>
>

Re: A new reworked Elasticsearch 7+ IO module

Reply via email to