[
https://issues.apache.org/jira/browse/GRIFFIN-326?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Chitral Verma resolved GRIFFIN-326.
-----------------------------------
Assignee: Chitral Verma
Resolution: Fixed
> New implementation for Elasticsearch Data Connector (Batch)
> -----------------------------------------------------------
>
> Key: GRIFFIN-326
> URL: https://issues.apache.org/jira/browse/GRIFFIN-326
> Project: Griffin
> Issue Type: Sub-task
> Reporter: Chitral Verma
> Assignee: Chitral Verma
> Priority: Major
> Time Spent: 4h 20m
> Remaining Estimate: 0h
>
> The current implementation of Elasticsearch relies on sending post requests
> from the driver using either SQL or search mode for query filtering.
> This implementation has the following potential issues,
> * Data is fetched for indexes (database scopes in ES) in bulk via 1 call on
> the driver. If the index has a lot of data, due to the big response payload,
> a bottleneck would be created on the driver.
> * Further, the driver then needs to parse this response payload and then
> parallelize it, this is again a driver side bottleneck as each JSON record
> needs to be mapped to a set schema in a type-safe manner.
> * Only _host_, _port_ and _version_ are the available options to configure
> the connection to the ES node or cluster.
> * Source partitioning logic is not carried forward when parallelizing
> records, the records will be randomized due to the Spark's default
> partitioning
> * Even though this implementation is a first-class member of Apache Griffin,
> yet it's based on the _custom_ connector trait.
> The proposed implementation aims to,
> * Deprecate the current implementation in favor of the direct official
> [elasticsearch-hadoop|[https://github.com/elastic/elasticsearch-hadoop/tree/master/spark/sql-20]]
> library.
> * This library is built on DataSource API built on spark 2.2.x+ and thus
> brings support for filter pushdowns, column pruning, unified read and write
> and additional optimizations.
> * Many configuration options are available for ES connectivity, [check
> here|[https://github.com/elastic/elasticsearch-hadoop/blob/master/mr/src/main/java/org/elasticsearch/hadoop/cfg/ConfigurationOptions.java]]
> * Any filters can be applied as expressions directly on the data frame and
> are pushed automatically to the source.
> The new implementation will look something like,
> {code:java}
> sparkSession.read.format("es").options( ??? ).load("<resource_name>"){code}
--
This message was sent by Atlassian Jira
(v8.3.4#803005)