[jira] [Created] (GRIFFIN-326) New implementation for Elasticsearch Data Connector (Batch)

Chitral Verma (Jira) Sun, 29 Mar 2020 22:37:00 -0700

Chitral Verma created GRIFFIN-326:
-------------------------------------

             Summary: New implementation for Elasticsearch Data Connector 
(Batch)
                 Key: GRIFFIN-326
                 URL: https://issues.apache.org/jira/browse/GRIFFIN-326
             Project: Griffin
          Issue Type: Improvement
            Reporter: Chitral Verma

The current implementation of Elasticsearch relies on sending post requests
from the driver using either SQL or search mode for query filtering.

This implementation has the following potential issues,
* Data is fetched for indexes (database scopes in ES) in bulk via 1 call on
the driver. If the index has a lot of data, due to the big response payload, a
bottleneck would be created on the driver.
* Further, the driver then needs to parse this response payload and then
parallelize it, this is again a driver side bottleneck as each JSON record
needs to be mapped to a set schema in a type-safe manner.
* Only _host_, _port_ and _version_ are the available options to configure the
connection to the ES node or cluster.
* Source partitioning logic is not carried forward when parallelizing records,
the records will be randomized due to the Spark's default partitioning
* Even though this implementation is a first-class member of Apache Griffin,
yet it's based on the _custom_ connector trait.

The proposed implementation aims to,
* Deprecate the current implementation in favor of the direct official
[elasticsearch-hadoop|[https://github.com/elastic/elasticsearch-hadoop/tree/master/spark/sql-20]]
library.
* This library is built on DataSource API built on spark 2.2.x+ and thus
brings support for filter pushdowns, column pruning, unified read and write and
additional optimizations.
* Many configuration options are available for ES connectivity, [check
here|[https://github.com/elastic/elasticsearch-hadoop/blob/master/mr/src/main/java/org/elasticsearch/hadoop/cfg/ConfigurationOptions.java]]
* Any filters can be applied as expressions directly on the data frame and are
pushed automatically to the source.

The new implementation will look something like,
{code:java}
sparkSession.read.format("es").options( ??? ).load("<resource_name>"){code}

--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Created] (GRIFFIN-326) New implementation for Elasticsearch Data Connector (Batch)

Reply via email to