[ 
https://issues.apache.org/jira/browse/GRIFFIN-326?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Chitral Verma resolved GRIFFIN-326.
-----------------------------------
      Assignee: Chitral Verma
    Resolution: Fixed

> New implementation for Elasticsearch Data Connector (Batch)
> -----------------------------------------------------------
>
>                 Key: GRIFFIN-326
>                 URL: https://issues.apache.org/jira/browse/GRIFFIN-326
>             Project: Griffin
>          Issue Type: Sub-task
>            Reporter: Chitral Verma
>            Assignee: Chitral Verma
>            Priority: Major
>          Time Spent: 4h 20m
>  Remaining Estimate: 0h
>
> The current implementation of Elasticsearch relies on sending post requests 
> from the driver using either SQL or search mode for query filtering.
> This implementation has the following potential issues,
>  * Data is fetched for indexes (database scopes in ES) in bulk via 1 call on 
> the driver. If the index has a lot of data, due to the big response payload, 
> a bottleneck would be created on the driver.
>  * Further, the driver then needs to parse this response payload and then 
> parallelize it, this is again a driver side bottleneck as each JSON record 
> needs to be mapped to a set schema in a type-safe manner.
>  * Only _host_, _port_ and _version_ are the available options to configure 
> the connection to the ES node or cluster.
>  * Source partitioning logic is not carried forward when parallelizing 
> records, the records will be randomized due to the Spark's default 
> partitioning
>  * Even though this implementation is a first-class member of Apache Griffin, 
> yet it's based on the _custom_ connector trait.
> The proposed implementation aims to,
>  * Deprecate the current implementation in favor of the direct official 
> [elasticsearch-hadoop|[https://github.com/elastic/elasticsearch-hadoop/tree/master/spark/sql-20]]
>  library.
>  * This library is built on DataSource API built on spark 2.2.x+ and thus 
> brings support for filter pushdowns, column pruning, unified read and write 
> and additional optimizations.
>  * Many configuration options are available for ES connectivity, [check 
> here|[https://github.com/elastic/elasticsearch-hadoop/blob/master/mr/src/main/java/org/elasticsearch/hadoop/cfg/ConfigurationOptions.java]]
>  * Any filters can be applied as expressions directly on the data frame and 
> are pushed automatically to the source.
> The new implementation will look something like,
> {code:java}
> sparkSession.read.format("es").options( ??? ).load("<resource_name>"){code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

Reply via email to