Re: A new reworked Elasticsearch 7+ IO module

Etienne Chauchot Thu, 06 Feb 2020 01:51:13 -0800

Hi,

Thanks all for your comments, my comments are inline


On 06/02/2020 09:47, Ludovic Boutros wrote:

Hi all,

First, thank you all for your answers and especially, Etienne for yourtime, advises and kindness :)

@Jean-Baptiste, any help on this module is welcome of course.

@Chamikara Jayalath, my aswers are inline.

Have a good day !

Ludovic

Le mer. 5 févr. 2020 à 20:15, Chamikara Jayalath <chamik...@google.com<mailto:chamik...@google.com>> a écrit :




    On Wed, Feb 5, 2020 at 6:35 AM Etienne Chauchot
    <echauc...@apache.org <mailto:echauc...@apache.org>> wrote:

        Still there is something I don't agree with is that IOs can be
        tested on mock. We don't really test IO behavior with mocks:
        there is always special behaviors that cannot be reproduced in
        mocks (split, load, with corner cases etc...). There was in
        the past IOs that were tested using mocks and that happened to
        be nonfunctional.

        Regarding ITests we have very few comparing to UTests and they
        are not as closely observed as UTests.

        Etienne

        On 05/02/2020 11:32, Jean-Baptiste Onofre wrote:

        Hi,

        We talked in the past about multiple/single module.

        IMHO the always preferred goal is to have a single module.
        However, it’s tricky when we have such difference, including
        on the user facing API. So, I would go with module per
        version, or use a specified version for a target Beam release.

        For the test, we should distinguish utest from itest. Utest
        can be done with mock, the purpose is really to test the IO
        behavior. Then we can have itest using concrete ES instance.

        Anyway, I’m OK with the proposal and I would like to work on
        this IO (I have other improvements coming on other IOs
        anyway) with you guys (and Ludovic especially).

        Regards
        JB

        Le 5 févr. 2020 à 10:44, Etienne Chauchot
        <echauc...@apache.org <mailto:echauc...@apache.org>> a écrit :

        Hi all,

        We had a long discussion with Ludovic about this IO. I'd
        like to share with you to keep you informed and also gather
        your opinions

        1. regarding version support: ES v2 is no more maintained by
        Elastic since 2018/02 so we plan to remove it from the IO.
        In the past we already retired versions (like spark 1.6 for
        instance).


    My only concern here is that there might be users who use the
    existing module who might not be able to easily upgrade the Beam
    version if we remove it. But given that V2 is 5 versions behind
    the latest release this might be OK.


It seems we have a consensus on this.

I think there should be another general discussion on the long termsupport of our prefered tool IO modules.


=> yes, consensus, let's drop ESV2

        2. regarding the user: the aim is to unlock some new
        features (listed by Ludovic) and give the user more
        flexibility on his request. For that, it requires to use
        high level java ES client in place of the low level REST
        client (that was used because it is the only one compatible
        with all ES versions). We plan to replace the API (json
        document in and out) by more complete standard ES objects
        that contain de request logic (insert/update, doc routing
        etc...) and the data. There are already IOs like SpannerIO
        that use similar objects in input PCollection rather than
        pure POJOs.


    Won't this be a breaking change for all users ? IMO using POJOs in
    PCollections is safer since we have to worry about changes to the
    underlying client library API. Exception would be when underlying
    client library offers a backwards compatibility guarantee that we
    can rely on for the foreseeable future (for example, BQ TableRow).

Agreed but actually, there will be POJOs in order to abstractElasticsearch's version support. The following third point explains this.

=> indeed it will be a breaking change, hence this email to get aconsensus on that. Also I think our wrappers of ES request objects willoffer a backward compatible as the underlying objects

        3. regarding multiple/single module: the aim is to have only
        one production code to ease the maintenance.  The problem is
        that using high level client makes the code dependent to an
        ES lib version. We would like to make it invisible to the
        user. He should select only one jar and the IO should decide
        the lib to use behind the scene. We are thinking about using
        one module and sub-modules per version and use relocation,
        wrappers and a factory that detects the version the IO
        actually points to to instantiate the correct client
        version. It would also require to have DTOs in the IO
        because the high level ES java objects are not exactly the
        same among the ES versions.

    +1 for adding a level of indirection to make this easy for users.

        4. regarding tests: the aim is always to target real ES
        backends to have relevant tests (for reasons I already
        explained in another thread). The problem is that
        es-test-framework used today is version dependent and is a
        pain to use. We plan on using test containers per version
        (validated by ES dev advocate) and launching them as part of
        the UTests. Obviously we will launch only one container at
        the time per version and do all the test with it to avoid
        paying the cost of launch too much. And the tests will be
        shipped in per-version sub-modules and not in test dedicated
        modules like it is now.


    Using a real ES backend for unit tests can be expensive ? Ideally
    we should use a Fake (if one available) or mocking (test test out
    functionality) and use real backend for IT tests that can be
    expensive. If this is a local container that can be shared between
    unit tests with reasonable efficiency that is OK. I'm mainly
    worried about introducing flakes into unit tests due to network
    errors or slowness.

On this point I understand it and I agree as well, but I think Etienneis right, IO modules should be exceptions to this rule.It is really difficult to really test an IO without a real backend andDocker helps a lot here.But I take the point, these tests must be as small as posssible (oneper Elasticsearch version).

=> yes IMHO real backends are mandatory for IO testing: no good splittest without a proper backend also it happened several times that theoutput of the backend has changed its format and thus the IO had toparse differently. Such a failure would not have been detected with amock. Also all the advanced features (dynamic routing, retrials, partialupdates etc..) need a real backend. These specifics are part of thebasic work of the IO and as such should be tested in UTests IMHO.

That being said, to reduce flakiness we will reduce backend consumptionto the strict minimum as it was done for embedded backend: as fewcontainers as possible, limit the parallelism to 1, disable optionalfeatures etc...


Best

Etienne




    Thanks,
    Cham

        WDYT ?

        Best !

        Etienne

        On 30/01/2020 17:55, Alexey Romanenko wrote:

        I’m second for this question. We have a similar (maybe a
        bit less painful) issue for KafkaIO and it would be useful
        to have a general strategy for such cases about how to deal
        with that.

        On 24 Jan 2020, at 21:54, Kenneth Knowles <k...@apache.org
        <mailto:k...@apache.org>> wrote:

        Would it make sense to have different version-specialized
        connectors with a common core library and common API package?

        On Fri, Jan 24, 2020 at 11:52 AM Chamikara Jayalath
        <chamik...@google.com <mailto:chamik...@google.com>> wrote:

            Thanks for the contribution. I agree with Alexey that
            we should try to add any new features brought in with
            the new PR into existing connector instead of trying
            to maintain two implementations.

            Thanks,
            Cham

            On Fri, Jan 24, 2020 at 9:01 AM Alexey Romanenko
            <aromanenko....@gmail.com
            <mailto:aromanenko....@gmail.com>> wrote:

                Hi Ludovic,

                Thank you for working on this and sharing the
                details with us. This is really great job!

                As I recall, we already have some support
                of Elasticsearch7 in current ElasticsearchIO
                (afaik, at least they are compatible), thanks to
                Zhong Chen and Etienne Chauchot, who were working
                on adding this [1][2] and it should be released in
                Beam 2.19.

                Would you think you can leverage this in your work
                on adding new Elasticsearch7 features? IMHO,
                supporting two different related IOs can be quite
                tough task and I‘d rather raise my hand to add a
                new functionality into existing IO than creating a
                new one, if it’s possible.

                [1] https://issues.apache.org/jira/browse/BEAM-5192
                [2] https://github.com/apache/beam/pull/10433

                On 22 Jan 2020, at 19:23, Ludovic Boutros
                <boutr...@gmail.com <mailto:boutr...@gmail.com>>
                wrote:

                Dear all,

                I have written a completely reworked
                Elasticsearch 7+ IO module.
                It can be found here:
                
https://github.com/ludovic-boutros/beam/tree/fresh-reworked-elasticsearch-io-v7/sdks/java/io/elasticsearch7

                This is a quite advance WIP work but I'm a quite
                new user of Apache Beam and I would like to get
                some help on this :)

                I can create a JIRA issue now but I prefer to
                wait for your wise avises first.

                _Why a new module ?_

                The current module was compliant with
                Elasticsearch 2.x, 5.x and 6.x. This seems to be
                a good point but so many things have been changed
                since Elasticsearch 2.x.



            Probably this is not correct anymore due to
            https://github.com/apache/beam/pull/10433 ?

                Elasticsearch 7.x is now partially supported
                (document type are removed, occ, updates...).

                A fresh new module, only compliant with the last
                version of Elasticsearch, can easily benefit a
                lot from the last evolutions of Elasticsearch
                (Java High Level Http Client).

                It is therefore far simpler than the current one.

                _Error management_

                Currently, errors are caught and transformed into
                simple exceptions. This is not always what is
                needed. If we would like to do specific
                processing on these errors (send documents in
                error topics for instance), it is not possible
                with the current module.



            Seems like this is some sort of a dead letter queue
            implementation.. This will be a very good feature to
            add to the existing connector.

_Philosophy_

This module directly uses the Elasticsearch Java
client classes as inputs and outputs.

This way you can configure any options you need
directly in the `DocWriteRequest` objects.

For instance:
- If you need to use external versioning

(https://www.elastic.co/guide/en/elasticsearch/reference/current/docs-index_.html#index-versioning),
you can.
- If you need to use an ingest pipelines, you can.
- If you need to configure an update
document/script, you can.
- If you need to use upserts, you can.

Actually, you should be able to do everything you
can do directly with Elasticsearch.

Furthermore, it should be easier to keep updating
the module with future Elasticsearch evolutions.

_Write outputs_

Two outputs are available:
- Successful indexing output ;
- Failed indexing output.

They are available in a `WriteResult` object.

These two outputs are represented by
`PCollection<BulkItemResponseContainer>` objects.

A `BulkItemResponseContainer` contains:
- the original index request ;
- the Elasticsearch response ;
- a batch id.

You can apply any process afterwards
(reprocessing, alerting, ...).

_Read input_

You can read documents from Elasticsearch with
this module.
You can specify a `QueryBuilder` in order to
filter the retrieved documents.
By default, it retrieves the whole document
collection.

If the Elasticsearch index is sharded, multiple
slices can be used during fetch. That many
bundles are created. The maximum bundle count is
equal to the index shard count.

Thank you !

Ludovic

Re: A new reworked Elasticsearch 7+ IO module

Reply via email to