subject:"\[jira\] \[Work logged\] \(GRIFFIN\-326\) New implementation for Elasticsearch Data Connector \(Batch\)"

[jira] [Work logged] (GRIFFIN-326) New implementation for Elasticsearch Data Connector (Batch)

2020-06-04 Thread ASF GitHub Bot (Jira)



 [ 
https://issues.apache.org/jira/browse/GRIFFIN-326?focusedWorklogId=441228=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-441228
 ]

ASF GitHub Bot logged work on GRIFFIN-326:
--

Author: ASF GitHub Bot
Created on: 04/Jun/20 11:09
Start Date: 04/Jun/20 11:09
Worklog Time Spent: 10m 
  Work Description: chitralverma commented on pull request #569:
URL: https://github.com/apache/griffin/pull/569#issuecomment-638782290


   @guoyuepeng Seems like a build has been running for this for month now. Can 
you check this.
   
   
https://travis-ci.org/github/apache/griffin/builds/694599100?utm_source=github_status_medium=notification



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


Issue Time Tracking
---

Worklog Id: (was: 441228)
Time Spent: 4h 20m  (was: 4h 10m)

> New implementation for Elasticsearch Data Connector (Batch)
> ---
>
> Key: GRIFFIN-326
> URL: https://issues.apache.org/jira/browse/GRIFFIN-326
> Project: Griffin
>  Issue Type: Sub-task
>Reporter: Chitral Verma
>Priority: Major
>  Time Spent: 4h 20m
>  Remaining Estimate: 0h
>
> The current implementation of Elasticsearch relies on sending post requests 
> from the driver using either SQL or search mode for query filtering.
> This implementation has the following potential issues,
>  * Data is fetched for indexes (database scopes in ES) in bulk via 1 call on 
> the driver. If the index has a lot of data, due to the big response payload, 
> a bottleneck would be created on the driver.
>  * Further, the driver then needs to parse this response payload and then 
> parallelize it, this is again a driver side bottleneck as each JSON record 
> needs to be mapped to a set schema in a type-safe manner.
>  * Only _host_, _port_ and _version_ are the available options to configure 
> the connection to the ES node or cluster.
>  * Source partitioning logic is not carried forward when parallelizing 
> records, the records will be randomized due to the Spark's default 
> partitioning
>  * Even though this implementation is a first-class member of Apache Griffin, 
> yet it's based on the _custom_ connector trait.
> The proposed implementation aims to,
>  * Deprecate the current implementation in favor of the direct official 
> [elasticsearch-hadoop|[https://github.com/elastic/elasticsearch-hadoop/tree/master/spark/sql-20]]
>  library.
>  * This library is built on DataSource API built on spark 2.2.x+ and thus 
> brings support for filter pushdowns, column pruning, unified read and write 
> and additional optimizations.
>  * Many configuration options are available for ES connectivity, [check 
> here|[https://github.com/elastic/elasticsearch-hadoop/blob/master/mr/src/main/java/org/elasticsearch/hadoop/cfg/ConfigurationOptions.java]]
>  * Any filters can be applied as expressions directly on the data frame and 
> are pushed automatically to the source.
> The new implementation will look something like,
> {code:java}
> sparkSession.read.format("es").options( ??? ).load(""){code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Work logged] (GRIFFIN-326) New implementation for Elasticsearch Data Connector (Batch)

2020-06-04 Thread ASF GitHub Bot (Jira)



 [ 
https://issues.apache.org/jira/browse/GRIFFIN-326?focusedWorklogId=441227=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-441227
 ]

ASF GitHub Bot logged work on GRIFFIN-326:
--

Author: ASF GitHub Bot
Created on: 04/Jun/20 11:01
Start Date: 04/Jun/20 11:01
Worklog Time Spent: 10m 
  Work Description: asfgit closed pull request #569:
URL: https://github.com/apache/griffin/pull/569


   



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


Issue Time Tracking
---

Worklog Id: (was: 441227)
Time Spent: 4h 10m  (was: 4h)

> New implementation for Elasticsearch Data Connector (Batch)
> ---
>
> Key: GRIFFIN-326
> URL: https://issues.apache.org/jira/browse/GRIFFIN-326
> Project: Griffin
>  Issue Type: Sub-task
>Reporter: Chitral Verma
>Priority: Major
>  Time Spent: 4h 10m
>  Remaining Estimate: 0h
>
> The current implementation of Elasticsearch relies on sending post requests 
> from the driver using either SQL or search mode for query filtering.
> This implementation has the following potential issues,
>  * Data is fetched for indexes (database scopes in ES) in bulk via 1 call on 
> the driver. If the index has a lot of data, due to the big response payload, 
> a bottleneck would be created on the driver.
>  * Further, the driver then needs to parse this response payload and then 
> parallelize it, this is again a driver side bottleneck as each JSON record 
> needs to be mapped to a set schema in a type-safe manner.
>  * Only _host_, _port_ and _version_ are the available options to configure 
> the connection to the ES node or cluster.
>  * Source partitioning logic is not carried forward when parallelizing 
> records, the records will be randomized due to the Spark's default 
> partitioning
>  * Even though this implementation is a first-class member of Apache Griffin, 
> yet it's based on the _custom_ connector trait.
> The proposed implementation aims to,
>  * Deprecate the current implementation in favor of the direct official 
> [elasticsearch-hadoop|[https://github.com/elastic/elasticsearch-hadoop/tree/master/spark/sql-20]]
>  library.
>  * This library is built on DataSource API built on spark 2.2.x+ and thus 
> brings support for filter pushdowns, column pruning, unified read and write 
> and additional optimizations.
>  * Many configuration options are available for ES connectivity, [check 
> here|[https://github.com/elastic/elasticsearch-hadoop/blob/master/mr/src/main/java/org/elasticsearch/hadoop/cfg/ConfigurationOptions.java]]
>  * Any filters can be applied as expressions directly on the data frame and 
> are pushed automatically to the source.
> The new implementation will look something like,
> {code:java}
> sparkSession.read.format("es").options( ??? ).load(""){code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Work logged] (GRIFFIN-326) New implementation for Elasticsearch Data Connector (Batch)

2020-06-04 Thread ASF GitHub Bot (Jira)



 [ 
https://issues.apache.org/jira/browse/GRIFFIN-326?focusedWorklogId=441224=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-441224
 ]

ASF GitHub Bot logged work on GRIFFIN-326:
--

Author: ASF GitHub Bot
Created on: 04/Jun/20 10:56
Start Date: 04/Jun/20 10:56
Worklog Time Spent: 10m 
  Work Description: guoyuepeng edited a comment on pull request #569:
URL: https://github.com/apache/griffin/pull/569#issuecomment-638776604


   LGTM, will merge it.



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


Issue Time Tracking
---

Worklog Id: (was: 441224)
Time Spent: 4h  (was: 3h 50m)

> New implementation for Elasticsearch Data Connector (Batch)
> ---
>
> Key: GRIFFIN-326
> URL: https://issues.apache.org/jira/browse/GRIFFIN-326
> Project: Griffin
>  Issue Type: Sub-task
>Reporter: Chitral Verma
>Priority: Major
>  Time Spent: 4h
>  Remaining Estimate: 0h
>
> The current implementation of Elasticsearch relies on sending post requests 
> from the driver using either SQL or search mode for query filtering.
> This implementation has the following potential issues,
>  * Data is fetched for indexes (database scopes in ES) in bulk via 1 call on 
> the driver. If the index has a lot of data, due to the big response payload, 
> a bottleneck would be created on the driver.
>  * Further, the driver then needs to parse this response payload and then 
> parallelize it, this is again a driver side bottleneck as each JSON record 
> needs to be mapped to a set schema in a type-safe manner.
>  * Only _host_, _port_ and _version_ are the available options to configure 
> the connection to the ES node or cluster.
>  * Source partitioning logic is not carried forward when parallelizing 
> records, the records will be randomized due to the Spark's default 
> partitioning
>  * Even though this implementation is a first-class member of Apache Griffin, 
> yet it's based on the _custom_ connector trait.
> The proposed implementation aims to,
>  * Deprecate the current implementation in favor of the direct official 
> [elasticsearch-hadoop|[https://github.com/elastic/elasticsearch-hadoop/tree/master/spark/sql-20]]
>  library.
>  * This library is built on DataSource API built on spark 2.2.x+ and thus 
> brings support for filter pushdowns, column pruning, unified read and write 
> and additional optimizations.
>  * Many configuration options are available for ES connectivity, [check 
> here|[https://github.com/elastic/elasticsearch-hadoop/blob/master/mr/src/main/java/org/elasticsearch/hadoop/cfg/ConfigurationOptions.java]]
>  * Any filters can be applied as expressions directly on the data frame and 
> are pushed automatically to the source.
> The new implementation will look something like,
> {code:java}
> sparkSession.read.format("es").options( ??? ).load(""){code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Work logged] (GRIFFIN-326) New implementation for Elasticsearch Data Connector (Batch)

2020-06-04 Thread ASF GitHub Bot (Jira)



 [ 
https://issues.apache.org/jira/browse/GRIFFIN-326?focusedWorklogId=441223=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-441223
 ]

ASF GitHub Bot logged work on GRIFFIN-326:
--

Author: ASF GitHub Bot
Created on: 04/Jun/20 10:56
Start Date: 04/Jun/20 10:56
Worklog Time Spent: 10m 
  Work Description: guoyuepeng commented on pull request #569:
URL: https://github.com/apache/griffin/pull/569#issuecomment-638776604


   reviewed, will merge it.



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


Issue Time Tracking
---

Worklog Id: (was: 441223)
Time Spent: 3h 50m  (was: 3h 40m)

> New implementation for Elasticsearch Data Connector (Batch)
> ---
>
> Key: GRIFFIN-326
> URL: https://issues.apache.org/jira/browse/GRIFFIN-326
> Project: Griffin
>  Issue Type: Sub-task
>Reporter: Chitral Verma
>Priority: Major
>  Time Spent: 3h 50m
>  Remaining Estimate: 0h
>
> The current implementation of Elasticsearch relies on sending post requests 
> from the driver using either SQL or search mode for query filtering.
> This implementation has the following potential issues,
>  * Data is fetched for indexes (database scopes in ES) in bulk via 1 call on 
> the driver. If the index has a lot of data, due to the big response payload, 
> a bottleneck would be created on the driver.
>  * Further, the driver then needs to parse this response payload and then 
> parallelize it, this is again a driver side bottleneck as each JSON record 
> needs to be mapped to a set schema in a type-safe manner.
>  * Only _host_, _port_ and _version_ are the available options to configure 
> the connection to the ES node or cluster.
>  * Source partitioning logic is not carried forward when parallelizing 
> records, the records will be randomized due to the Spark's default 
> partitioning
>  * Even though this implementation is a first-class member of Apache Griffin, 
> yet it's based on the _custom_ connector trait.
> The proposed implementation aims to,
>  * Deprecate the current implementation in favor of the direct official 
> [elasticsearch-hadoop|[https://github.com/elastic/elasticsearch-hadoop/tree/master/spark/sql-20]]
>  library.
>  * This library is built on DataSource API built on spark 2.2.x+ and thus 
> brings support for filter pushdowns, column pruning, unified read and write 
> and additional optimizations.
>  * Many configuration options are available for ES connectivity, [check 
> here|[https://github.com/elastic/elasticsearch-hadoop/blob/master/mr/src/main/java/org/elasticsearch/hadoop/cfg/ConfigurationOptions.java]]
>  * Any filters can be applied as expressions directly on the data frame and 
> are pushed automatically to the source.
> The new implementation will look something like,
> {code:java}
> sparkSession.read.format("es").options( ??? ).load(""){code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Work logged] (GRIFFIN-326) New implementation for Elasticsearch Data Connector (Batch)

2020-06-01 Thread ASF GitHub Bot (Jira)



 [ 
https://issues.apache.org/jira/browse/GRIFFIN-326?focusedWorklogId=439685=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-439685
 ]

ASF GitHub Bot logged work on GRIFFIN-326:
--

Author: ASF GitHub Bot
Created on: 01/Jun/20 19:12
Start Date: 01/Jun/20 19:12
Worklog Time Spent: 10m 
  Work Description: chitralverma edited a comment on pull request #569:
URL: https://github.com/apache/griffin/pull/569#issuecomment-637049574


   @guoyuepeng @wankunde I've made the necessary changes in this PR to fix 
failures and the build is now a success. Can you please review and merge this 
now.
   
   Key Changes since failures:
   - As per @icesmartjuan suggestion, updated the version of `typescript`.
   - As per @wankunde suggestion, refactored `FileBasedDataConnectorTest`.
   - Refactored `ElasticSearchDataConnectorTest` as it was too 
resource-intensive for the builder.
   - Minor changes to other connectors as they were suppressing read exceptions.



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


Issue Time Tracking
---

Worklog Id: (was: 439685)
Time Spent: 3.5h  (was: 3h 20m)

> New implementation for Elasticsearch Data Connector (Batch)
> ---
>
> Key: GRIFFIN-326
> URL: https://issues.apache.org/jira/browse/GRIFFIN-326
> Project: Griffin
>  Issue Type: Sub-task
>Reporter: Chitral Verma
>Priority: Major
>  Time Spent: 3.5h
>  Remaining Estimate: 0h
>
> The current implementation of Elasticsearch relies on sending post requests 
> from the driver using either SQL or search mode for query filtering.
> This implementation has the following potential issues,
>  * Data is fetched for indexes (database scopes in ES) in bulk via 1 call on 
> the driver. If the index has a lot of data, due to the big response payload, 
> a bottleneck would be created on the driver.
>  * Further, the driver then needs to parse this response payload and then 
> parallelize it, this is again a driver side bottleneck as each JSON record 
> needs to be mapped to a set schema in a type-safe manner.
>  * Only _host_, _port_ and _version_ are the available options to configure 
> the connection to the ES node or cluster.
>  * Source partitioning logic is not carried forward when parallelizing 
> records, the records will be randomized due to the Spark's default 
> partitioning
>  * Even though this implementation is a first-class member of Apache Griffin, 
> yet it's based on the _custom_ connector trait.
> The proposed implementation aims to,
>  * Deprecate the current implementation in favor of the direct official 
> [elasticsearch-hadoop|[https://github.com/elastic/elasticsearch-hadoop/tree/master/spark/sql-20]]
>  library.
>  * This library is built on DataSource API built on spark 2.2.x+ and thus 
> brings support for filter pushdowns, column pruning, unified read and write 
> and additional optimizations.
>  * Many configuration options are available for ES connectivity, [check 
> here|[https://github.com/elastic/elasticsearch-hadoop/blob/master/mr/src/main/java/org/elasticsearch/hadoop/cfg/ConfigurationOptions.java]]
>  * Any filters can be applied as expressions directly on the data frame and 
> are pushed automatically to the source.
> The new implementation will look something like,
> {code:java}
> sparkSession.read.format("es").options( ??? ).load(""){code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Work logged] (GRIFFIN-326) New implementation for Elasticsearch Data Connector (Batch)

2020-06-01 Thread ASF GitHub Bot (Jira)



 [ 
https://issues.apache.org/jira/browse/GRIFFIN-326?focusedWorklogId=439683=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-439683
 ]

ASF GitHub Bot logged work on GRIFFIN-326:
--

Author: ASF GitHub Bot
Created on: 01/Jun/20 19:11
Start Date: 01/Jun/20 19:11
Worklog Time Spent: 10m 
  Work Description: chitralverma commented on pull request #569:
URL: https://github.com/apache/griffin/pull/569#issuecomment-637049574


   @guoyuepeng @wankunde I've made the necessary changes in this PR to fix 
failures, can you please review and merge this now.
   
   Key Changes since failures:
   - As per @icesmartjuan suggestion, updated the version of `typescript`.
   - As per @wankunde suggestion, refactored `FileBasedDataConnector`.
   - Refactored `ElasticSearchDataConnectorTest` as it was too 
resource-intensive for the builder.
   - Minor changes to other connectors as they were suppressing read exceptions.



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


Issue Time Tracking
---

Worklog Id: (was: 439683)
Time Spent: 3h 10m  (was: 3h)

> New implementation for Elasticsearch Data Connector (Batch)
> ---
>
> Key: GRIFFIN-326
> URL: https://issues.apache.org/jira/browse/GRIFFIN-326
> Project: Griffin
>  Issue Type: Sub-task
>Reporter: Chitral Verma
>Priority: Major
>  Time Spent: 3h 10m
>  Remaining Estimate: 0h
>
> The current implementation of Elasticsearch relies on sending post requests 
> from the driver using either SQL or search mode for query filtering.
> This implementation has the following potential issues,
>  * Data is fetched for indexes (database scopes in ES) in bulk via 1 call on 
> the driver. If the index has a lot of data, due to the big response payload, 
> a bottleneck would be created on the driver.
>  * Further, the driver then needs to parse this response payload and then 
> parallelize it, this is again a driver side bottleneck as each JSON record 
> needs to be mapped to a set schema in a type-safe manner.
>  * Only _host_, _port_ and _version_ are the available options to configure 
> the connection to the ES node or cluster.
>  * Source partitioning logic is not carried forward when parallelizing 
> records, the records will be randomized due to the Spark's default 
> partitioning
>  * Even though this implementation is a first-class member of Apache Griffin, 
> yet it's based on the _custom_ connector trait.
> The proposed implementation aims to,
>  * Deprecate the current implementation in favor of the direct official 
> [elasticsearch-hadoop|[https://github.com/elastic/elasticsearch-hadoop/tree/master/spark/sql-20]]
>  library.
>  * This library is built on DataSource API built on spark 2.2.x+ and thus 
> brings support for filter pushdowns, column pruning, unified read and write 
> and additional optimizations.
>  * Many configuration options are available for ES connectivity, [check 
> here|[https://github.com/elastic/elasticsearch-hadoop/blob/master/mr/src/main/java/org/elasticsearch/hadoop/cfg/ConfigurationOptions.java]]
>  * Any filters can be applied as expressions directly on the data frame and 
> are pushed automatically to the source.
> The new implementation will look something like,
> {code:java}
> sparkSession.read.format("es").options( ??? ).load(""){code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Work logged] (GRIFFIN-326) New implementation for Elasticsearch Data Connector (Batch)

2020-06-01 Thread ASF GitHub Bot (Jira)



 [ 
https://issues.apache.org/jira/browse/GRIFFIN-326?focusedWorklogId=439684=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-439684
 ]

ASF GitHub Bot logged work on GRIFFIN-326:
--

Author: ASF GitHub Bot
Created on: 01/Jun/20 19:11
Start Date: 01/Jun/20 19:11
Worklog Time Spent: 10m 
  Work Description: chitralverma edited a comment on pull request #569:
URL: https://github.com/apache/griffin/pull/569#issuecomment-637049574


   @guoyuepeng @wankunde I've made the necessary changes in this PR to fix 
failures and the build is now a success. Can you please review and merge this 
now.
   
   Key Changes since failures:
   - As per @icesmartjuan suggestion, updated the version of `typescript`.
   - As per @wankunde suggestion, refactored `FileBasedDataConnector`.
   - Refactored `ElasticSearchDataConnectorTest` as it was too 
resource-intensive for the builder.
   - Minor changes to other connectors as they were suppressing read exceptions.



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


Issue Time Tracking
---

Worklog Id: (was: 439684)
Time Spent: 3h 20m  (was: 3h 10m)

> New implementation for Elasticsearch Data Connector (Batch)
> ---
>
> Key: GRIFFIN-326
> URL: https://issues.apache.org/jira/browse/GRIFFIN-326
> Project: Griffin
>  Issue Type: Sub-task
>Reporter: Chitral Verma
>Priority: Major
>  Time Spent: 3h 20m
>  Remaining Estimate: 0h
>
> The current implementation of Elasticsearch relies on sending post requests 
> from the driver using either SQL or search mode for query filtering.
> This implementation has the following potential issues,
>  * Data is fetched for indexes (database scopes in ES) in bulk via 1 call on 
> the driver. If the index has a lot of data, due to the big response payload, 
> a bottleneck would be created on the driver.
>  * Further, the driver then needs to parse this response payload and then 
> parallelize it, this is again a driver side bottleneck as each JSON record 
> needs to be mapped to a set schema in a type-safe manner.
>  * Only _host_, _port_ and _version_ are the available options to configure 
> the connection to the ES node or cluster.
>  * Source partitioning logic is not carried forward when parallelizing 
> records, the records will be randomized due to the Spark's default 
> partitioning
>  * Even though this implementation is a first-class member of Apache Griffin, 
> yet it's based on the _custom_ connector trait.
> The proposed implementation aims to,
>  * Deprecate the current implementation in favor of the direct official 
> [elasticsearch-hadoop|[https://github.com/elastic/elasticsearch-hadoop/tree/master/spark/sql-20]]
>  library.
>  * This library is built on DataSource API built on spark 2.2.x+ and thus 
> brings support for filter pushdowns, column pruning, unified read and write 
> and additional optimizations.
>  * Many configuration options are available for ES connectivity, [check 
> here|[https://github.com/elastic/elasticsearch-hadoop/blob/master/mr/src/main/java/org/elasticsearch/hadoop/cfg/ConfigurationOptions.java]]
>  * Any filters can be applied as expressions directly on the data frame and 
> are pushed automatically to the source.
> The new implementation will look something like,
> {code:java}
> sparkSession.read.format("es").options( ??? ).load(""){code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Work logged] (GRIFFIN-326) New implementation for Elasticsearch Data Connector (Batch)

2020-05-15 Thread ASF GitHub Bot (Jira)



 [ 
https://issues.apache.org/jira/browse/GRIFFIN-326?focusedWorklogId=433610=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-433610
 ]

ASF GitHub Bot logged work on GRIFFIN-326:
--

Author: ASF GitHub Bot
Created on: 15/May/20 09:13
Start Date: 15/May/20 09:13
Worklog Time Spent: 10m 
  Work Description: chitralverma commented on pull request #569:
URL: https://github.com/apache/griffin/pull/569#issuecomment-629126531


   @icesmartjuan thanks, I'll try that and commit



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


Issue Time Tracking
---

Worklog Id: (was: 433610)
Time Spent: 3h  (was: 2h 50m)

> New implementation for Elasticsearch Data Connector (Batch)
> ---
>
> Key: GRIFFIN-326
> URL: https://issues.apache.org/jira/browse/GRIFFIN-326
> Project: Griffin
>  Issue Type: Sub-task
>Reporter: Chitral Verma
>Priority: Major
>  Time Spent: 3h
>  Remaining Estimate: 0h
>
> The current implementation of Elasticsearch relies on sending post requests 
> from the driver using either SQL or search mode for query filtering.
> This implementation has the following potential issues,
>  * Data is fetched for indexes (database scopes in ES) in bulk via 1 call on 
> the driver. If the index has a lot of data, due to the big response payload, 
> a bottleneck would be created on the driver.
>  * Further, the driver then needs to parse this response payload and then 
> parallelize it, this is again a driver side bottleneck as each JSON record 
> needs to be mapped to a set schema in a type-safe manner.
>  * Only _host_, _port_ and _version_ are the available options to configure 
> the connection to the ES node or cluster.
>  * Source partitioning logic is not carried forward when parallelizing 
> records, the records will be randomized due to the Spark's default 
> partitioning
>  * Even though this implementation is a first-class member of Apache Griffin, 
> yet it's based on the _custom_ connector trait.
> The proposed implementation aims to,
>  * Deprecate the current implementation in favor of the direct official 
> [elasticsearch-hadoop|[https://github.com/elastic/elasticsearch-hadoop/tree/master/spark/sql-20]]
>  library.
>  * This library is built on DataSource API built on spark 2.2.x+ and thus 
> brings support for filter pushdowns, column pruning, unified read and write 
> and additional optimizations.
>  * Many configuration options are available for ES connectivity, [check 
> here|[https://github.com/elastic/elasticsearch-hadoop/blob/master/mr/src/main/java/org/elasticsearch/hadoop/cfg/ConfigurationOptions.java]]
>  * Any filters can be applied as expressions directly on the data frame and 
> are pushed automatically to the source.
> The new implementation will look something like,
> {code:java}
> sparkSession.read.format("es").options( ??? ).load(""){code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Work logged] (GRIFFIN-326) New implementation for Elasticsearch Data Connector (Batch)

2020-05-15 Thread ASF GitHub Bot (Jira)



 [ 
https://issues.apache.org/jira/browse/GRIFFIN-326?focusedWorklogId=433609=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-433609
 ]

ASF GitHub Bot logged work on GRIFFIN-326:
--

Author: ASF GitHub Bot
Created on: 15/May/20 09:12
Start Date: 15/May/20 09:12
Worklog Time Spent: 10m 
  Work Description: icesmartjuan commented on pull request #569:
URL: https://github.com/apache/griffin/pull/569#issuecomment-629126063


   Hi @chitralverma , the type seems added in typescript 2.7, can you please 
have a try? (it works fine now)
   - please change typescript version from `"typescript": "~2.3.3",` to 
`"typescript": "~2.7",`  in `package.json` , 
   - remove `node_modules ` and `package-lock.json`, 
   - re-run `npm install`, 
   - then `npm start`, it should run successfully 



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


Issue Time Tracking
---

Worklog Id: (was: 433609)
Time Spent: 2h 50m  (was: 2h 40m)

> New implementation for Elasticsearch Data Connector (Batch)
> ---
>
> Key: GRIFFIN-326
> URL: https://issues.apache.org/jira/browse/GRIFFIN-326
> Project: Griffin
>  Issue Type: Sub-task
>Reporter: Chitral Verma
>Priority: Major
>  Time Spent: 2h 50m
>  Remaining Estimate: 0h
>
> The current implementation of Elasticsearch relies on sending post requests 
> from the driver using either SQL or search mode for query filtering.
> This implementation has the following potential issues,
>  * Data is fetched for indexes (database scopes in ES) in bulk via 1 call on 
> the driver. If the index has a lot of data, due to the big response payload, 
> a bottleneck would be created on the driver.
>  * Further, the driver then needs to parse this response payload and then 
> parallelize it, this is again a driver side bottleneck as each JSON record 
> needs to be mapped to a set schema in a type-safe manner.
>  * Only _host_, _port_ and _version_ are the available options to configure 
> the connection to the ES node or cluster.
>  * Source partitioning logic is not carried forward when parallelizing 
> records, the records will be randomized due to the Spark's default 
> partitioning
>  * Even though this implementation is a first-class member of Apache Griffin, 
> yet it's based on the _custom_ connector trait.
> The proposed implementation aims to,
>  * Deprecate the current implementation in favor of the direct official 
> [elasticsearch-hadoop|[https://github.com/elastic/elasticsearch-hadoop/tree/master/spark/sql-20]]
>  library.
>  * This library is built on DataSource API built on spark 2.2.x+ and thus 
> brings support for filter pushdowns, column pruning, unified read and write 
> and additional optimizations.
>  * Many configuration options are available for ES connectivity, [check 
> here|[https://github.com/elastic/elasticsearch-hadoop/blob/master/mr/src/main/java/org/elasticsearch/hadoop/cfg/ConfigurationOptions.java]]
>  * Any filters can be applied as expressions directly on the data frame and 
> are pushed automatically to the source.
> The new implementation will look something like,
> {code:java}
> sparkSession.read.format("es").options( ??? ).load(""){code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Work logged] (GRIFFIN-326) New implementation for Elasticsearch Data Connector (Batch)

2020-05-14 Thread ASF GitHub Bot (Jira)



 [ 
https://issues.apache.org/jira/browse/GRIFFIN-326?focusedWorklogId=433016=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-433016
 ]

ASF GitHub Bot logged work on GRIFFIN-326:
--

Author: ASF GitHub Bot
Created on: 14/May/20 07:03
Start Date: 14/May/20 07:03
Worklog Time Spent: 10m 
  Work Description: guoyuepeng commented on pull request #569:
URL: https://github.com/apache/griffin/pull/569#issuecomment-628432592


   re-trigger build



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


Issue Time Tracking
---

Worklog Id: (was: 433016)
Time Spent: 2h 40m  (was: 2.5h)

> New implementation for Elasticsearch Data Connector (Batch)
> ---
>
> Key: GRIFFIN-326
> URL: https://issues.apache.org/jira/browse/GRIFFIN-326
> Project: Griffin
>  Issue Type: Sub-task
>Reporter: Chitral Verma
>Priority: Major
>  Time Spent: 2h 40m
>  Remaining Estimate: 0h
>
> The current implementation of Elasticsearch relies on sending post requests 
> from the driver using either SQL or search mode for query filtering.
> This implementation has the following potential issues,
>  * Data is fetched for indexes (database scopes in ES) in bulk via 1 call on 
> the driver. If the index has a lot of data, due to the big response payload, 
> a bottleneck would be created on the driver.
>  * Further, the driver then needs to parse this response payload and then 
> parallelize it, this is again a driver side bottleneck as each JSON record 
> needs to be mapped to a set schema in a type-safe manner.
>  * Only _host_, _port_ and _version_ are the available options to configure 
> the connection to the ES node or cluster.
>  * Source partitioning logic is not carried forward when parallelizing 
> records, the records will be randomized due to the Spark's default 
> partitioning
>  * Even though this implementation is a first-class member of Apache Griffin, 
> yet it's based on the _custom_ connector trait.
> The proposed implementation aims to,
>  * Deprecate the current implementation in favor of the direct official 
> [elasticsearch-hadoop|[https://github.com/elastic/elasticsearch-hadoop/tree/master/spark/sql-20]]
>  library.
>  * This library is built on DataSource API built on spark 2.2.x+ and thus 
> brings support for filter pushdowns, column pruning, unified read and write 
> and additional optimizations.
>  * Many configuration options are available for ES connectivity, [check 
> here|[https://github.com/elastic/elasticsearch-hadoop/blob/master/mr/src/main/java/org/elasticsearch/hadoop/cfg/ConfigurationOptions.java]]
>  * Any filters can be applied as expressions directly on the data frame and 
> are pushed automatically to the source.
> The new implementation will look something like,
> {code:java}
> sparkSession.read.format("es").options( ??? ).load(""){code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Work logged] (GRIFFIN-326) New implementation for Elasticsearch Data Connector (Batch)

2020-05-14 Thread ASF GitHub Bot (Jira)



 [ 
https://issues.apache.org/jira/browse/GRIFFIN-326?focusedWorklogId=433010=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-433010
 ]

ASF GitHub Bot logged work on GRIFFIN-326:
--

Author: ASF GitHub Bot
Created on: 14/May/20 06:40
Start Date: 14/May/20 06:40
Worklog Time Spent: 10m 
  Work Description: guoyuepeng commented on pull request #569:
URL: https://github.com/apache/griffin/pull/569#issuecomment-628422615


   retest this, please



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


Issue Time Tracking
---

Worklog Id: (was: 433010)
Time Spent: 2.5h  (was: 2h 20m)

> New implementation for Elasticsearch Data Connector (Batch)
> ---
>
> Key: GRIFFIN-326
> URL: https://issues.apache.org/jira/browse/GRIFFIN-326
> Project: Griffin
>  Issue Type: Sub-task
>Reporter: Chitral Verma
>Priority: Major
>  Time Spent: 2.5h
>  Remaining Estimate: 0h
>
> The current implementation of Elasticsearch relies on sending post requests 
> from the driver using either SQL or search mode for query filtering.
> This implementation has the following potential issues,
>  * Data is fetched for indexes (database scopes in ES) in bulk via 1 call on 
> the driver. If the index has a lot of data, due to the big response payload, 
> a bottleneck would be created on the driver.
>  * Further, the driver then needs to parse this response payload and then 
> parallelize it, this is again a driver side bottleneck as each JSON record 
> needs to be mapped to a set schema in a type-safe manner.
>  * Only _host_, _port_ and _version_ are the available options to configure 
> the connection to the ES node or cluster.
>  * Source partitioning logic is not carried forward when parallelizing 
> records, the records will be randomized due to the Spark's default 
> partitioning
>  * Even though this implementation is a first-class member of Apache Griffin, 
> yet it's based on the _custom_ connector trait.
> The proposed implementation aims to,
>  * Deprecate the current implementation in favor of the direct official 
> [elasticsearch-hadoop|[https://github.com/elastic/elasticsearch-hadoop/tree/master/spark/sql-20]]
>  library.
>  * This library is built on DataSource API built on spark 2.2.x+ and thus 
> brings support for filter pushdowns, column pruning, unified read and write 
> and additional optimizations.
>  * Many configuration options are available for ES connectivity, [check 
> here|[https://github.com/elastic/elasticsearch-hadoop/blob/master/mr/src/main/java/org/elasticsearch/hadoop/cfg/ConfigurationOptions.java]]
>  * Any filters can be applied as expressions directly on the data frame and 
> are pushed automatically to the source.
> The new implementation will look something like,
> {code:java}
> sparkSession.read.format("es").options( ??? ).load(""){code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Work logged] (GRIFFIN-326) New implementation for Elasticsearch Data Connector (Batch)

2020-05-13 Thread ASF GitHub Bot (Jira)



 [ 
https://issues.apache.org/jira/browse/GRIFFIN-326?focusedWorklogId=432989=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-432989
 ]

ASF GitHub Bot logged work on GRIFFIN-326:
--

Author: ASF GitHub Bot
Created on: 14/May/20 05:45
Start Date: 14/May/20 05:45
Worklog Time Spent: 10m 
  Work Description: guoyuepeng commented on pull request #569:
URL: https://github.com/apache/griffin/pull/569#issuecomment-628400718


   > Anyway to retrigger a build without a blank commit?
   
   let me figure out how to trigger the build.



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


Issue Time Tracking
---

Worklog Id: (was: 432989)
Time Spent: 2h 20m  (was: 2h 10m)

> New implementation for Elasticsearch Data Connector (Batch)
> ---
>
> Key: GRIFFIN-326
> URL: https://issues.apache.org/jira/browse/GRIFFIN-326
> Project: Griffin
>  Issue Type: Sub-task
>Reporter: Chitral Verma
>Priority: Major
>  Time Spent: 2h 20m
>  Remaining Estimate: 0h
>
> The current implementation of Elasticsearch relies on sending post requests 
> from the driver using either SQL or search mode for query filtering.
> This implementation has the following potential issues,
>  * Data is fetched for indexes (database scopes in ES) in bulk via 1 call on 
> the driver. If the index has a lot of data, due to the big response payload, 
> a bottleneck would be created on the driver.
>  * Further, the driver then needs to parse this response payload and then 
> parallelize it, this is again a driver side bottleneck as each JSON record 
> needs to be mapped to a set schema in a type-safe manner.
>  * Only _host_, _port_ and _version_ are the available options to configure 
> the connection to the ES node or cluster.
>  * Source partitioning logic is not carried forward when parallelizing 
> records, the records will be randomized due to the Spark's default 
> partitioning
>  * Even though this implementation is a first-class member of Apache Griffin, 
> yet it's based on the _custom_ connector trait.
> The proposed implementation aims to,
>  * Deprecate the current implementation in favor of the direct official 
> [elasticsearch-hadoop|[https://github.com/elastic/elasticsearch-hadoop/tree/master/spark/sql-20]]
>  library.
>  * This library is built on DataSource API built on spark 2.2.x+ and thus 
> brings support for filter pushdowns, column pruning, unified read and write 
> and additional optimizations.
>  * Many configuration options are available for ES connectivity, [check 
> here|[https://github.com/elastic/elasticsearch-hadoop/blob/master/mr/src/main/java/org/elasticsearch/hadoop/cfg/ConfigurationOptions.java]]
>  * Any filters can be applied as expressions directly on the data frame and 
> are pushed automatically to the source.
> The new implementation will look something like,
> {code:java}
> sparkSession.read.format("es").options( ??? ).load(""){code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Work logged] (GRIFFIN-326) New implementation for Elasticsearch Data Connector (Batch)

2020-05-07 Thread ASF GitHub Bot (Jira)



 [ 
https://issues.apache.org/jira/browse/GRIFFIN-326?focusedWorklogId=431991=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-431991
 ]

ASF GitHub Bot logged work on GRIFFIN-326:
--

Author: ASF GitHub Bot
Created on: 08/May/20 05:55
Start Date: 08/May/20 05:55
Worklog Time Spent: 10m 
  Work Description: wankunde commented on pull request #569:
URL: https://github.com/apache/griffin/pull/569#issuecomment-625645288


   @chitralverma 
   
   https://github.com/DefinitelyTyped/DefinitelyTyped/issues/43977
   
   There seems to be mismatch with npm packages. @guoyuepeng  Can you help with 
this  ?



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


Issue Time Tracking
---

Worklog Id: (was: 431991)
Time Spent: 2h 10m  (was: 2h)

> New implementation for Elasticsearch Data Connector (Batch)
> ---
>
> Key: GRIFFIN-326
> URL: https://issues.apache.org/jira/browse/GRIFFIN-326
> Project: Griffin
>  Issue Type: Sub-task
>Reporter: Chitral Verma
>Priority: Major
>  Time Spent: 2h 10m
>  Remaining Estimate: 0h
>
> The current implementation of Elasticsearch relies on sending post requests 
> from the driver using either SQL or search mode for query filtering.
> This implementation has the following potential issues,
>  * Data is fetched for indexes (database scopes in ES) in bulk via 1 call on 
> the driver. If the index has a lot of data, due to the big response payload, 
> a bottleneck would be created on the driver.
>  * Further, the driver then needs to parse this response payload and then 
> parallelize it, this is again a driver side bottleneck as each JSON record 
> needs to be mapped to a set schema in a type-safe manner.
>  * Only _host_, _port_ and _version_ are the available options to configure 
> the connection to the ES node or cluster.
>  * Source partitioning logic is not carried forward when parallelizing 
> records, the records will be randomized due to the Spark's default 
> partitioning
>  * Even though this implementation is a first-class member of Apache Griffin, 
> yet it's based on the _custom_ connector trait.
> The proposed implementation aims to,
>  * Deprecate the current implementation in favor of the direct official 
> [elasticsearch-hadoop|[https://github.com/elastic/elasticsearch-hadoop/tree/master/spark/sql-20]]
>  library.
>  * This library is built on DataSource API built on spark 2.2.x+ and thus 
> brings support for filter pushdowns, column pruning, unified read and write 
> and additional optimizations.
>  * Many configuration options are available for ES connectivity, [check 
> here|[https://github.com/elastic/elasticsearch-hadoop/blob/master/mr/src/main/java/org/elasticsearch/hadoop/cfg/ConfigurationOptions.java]]
>  * Any filters can be applied as expressions directly on the data frame and 
> are pushed automatically to the source.
> The new implementation will look something like,
> {code:java}
> sparkSession.read.format("es").options( ??? ).load(""){code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Work logged] (GRIFFIN-326) New implementation for Elasticsearch Data Connector (Batch)

2020-05-02 Thread ASF GitHub Bot (Jira)



 [ 
https://issues.apache.org/jira/browse/GRIFFIN-326?focusedWorklogId=429861=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-429861
 ]

ASF GitHub Bot logged work on GRIFFIN-326:
--

Author: ASF GitHub Bot
Created on: 02/May/20 18:04
Start Date: 02/May/20 18:04
Worklog Time Spent: 10m 
  Work Description: chitralverma commented on pull request #569:
URL: https://github.com/apache/griffin/pull/569#issuecomment-622992211


   @wankunde the build has failed again, due to some error on the UI side. 
   not sure of the fix here.



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


Issue Time Tracking
---

Worklog Id: (was: 429861)
Time Spent: 2h  (was: 1h 50m)

> New implementation for Elasticsearch Data Connector (Batch)
> ---
>
> Key: GRIFFIN-326
> URL: https://issues.apache.org/jira/browse/GRIFFIN-326
> Project: Griffin
>  Issue Type: Sub-task
>Reporter: Chitral Verma
>Priority: Major
>  Time Spent: 2h
>  Remaining Estimate: 0h
>
> The current implementation of Elasticsearch relies on sending post requests 
> from the driver using either SQL or search mode for query filtering.
> This implementation has the following potential issues,
>  * Data is fetched for indexes (database scopes in ES) in bulk via 1 call on 
> the driver. If the index has a lot of data, due to the big response payload, 
> a bottleneck would be created on the driver.
>  * Further, the driver then needs to parse this response payload and then 
> parallelize it, this is again a driver side bottleneck as each JSON record 
> needs to be mapped to a set schema in a type-safe manner.
>  * Only _host_, _port_ and _version_ are the available options to configure 
> the connection to the ES node or cluster.
>  * Source partitioning logic is not carried forward when parallelizing 
> records, the records will be randomized due to the Spark's default 
> partitioning
>  * Even though this implementation is a first-class member of Apache Griffin, 
> yet it's based on the _custom_ connector trait.
> The proposed implementation aims to,
>  * Deprecate the current implementation in favor of the direct official 
> [elasticsearch-hadoop|[https://github.com/elastic/elasticsearch-hadoop/tree/master/spark/sql-20]]
>  library.
>  * This library is built on DataSource API built on spark 2.2.x+ and thus 
> brings support for filter pushdowns, column pruning, unified read and write 
> and additional optimizations.
>  * Many configuration options are available for ES connectivity, [check 
> here|[https://github.com/elastic/elasticsearch-hadoop/blob/master/mr/src/main/java/org/elasticsearch/hadoop/cfg/ConfigurationOptions.java]]
>  * Any filters can be applied as expressions directly on the data frame and 
> are pushed automatically to the source.
> The new implementation will look something like,
> {code:java}
> sparkSession.read.format("es").options( ??? ).load(""){code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Work logged] (GRIFFIN-326) New implementation for Elasticsearch Data Connector (Batch)

2020-04-13 Thread ASF GitHub Bot (Jira)



 [ 
https://issues.apache.org/jira/browse/GRIFFIN-326?focusedWorklogId=421137=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-421137
 ]

ASF GitHub Bot logged work on GRIFFIN-326:
--

Author: ASF GitHub Bot
Created on: 13/Apr/20 07:39
Start Date: 13/Apr/20 07:39
Worklog Time Spent: 10m 
  Work Description: DragonPrince1992 commented on issue #569: [GRIFFIN-326] 
New Data Connector for Elasticsearch
URL: https://github.com/apache/griffin/pull/569#issuecomment-612790960
 
 
   Can I add you a WeChat?It's easier to communicate.
   
   
   
   
   
   
   
   
   
   
   --原始邮件--
   发件人:"Chitral Verma" New implementation for Elasticsearch Data Connector (Batch)
> ---
>
> Key: GRIFFIN-326
> URL: https://issues.apache.org/jira/browse/GRIFFIN-326
> Project: Griffin
>  Issue Type: Sub-task
>Reporter: Chitral Verma
>Priority: Major
>  Time Spent: 1h 50m
>  Remaining Estimate: 0h
>
> The current implementation of Elasticsearch relies on sending post requests 
> from the driver using either SQL or search mode for query filtering.
> This implementation has the following potential issues,
>  * Data is fetched for indexes (database scopes in ES) in bulk via 1 call on 
> the driver. If the index has a lot of data, due to the big response payload, 
> a bottleneck would be created on the driver.
>  * Further, the driver then needs to parse this response payload and then 
> parallelize it, this is again a driver side bottleneck as each JSON record 
> needs to be mapped to a set schema in a type-safe manner.
>  * Only _host_, _port_ and _version_ are the available options to configure 
> the connection to the ES node or cluster.
>  * Source partitioning logic is not carried forward when parallelizing 
> records, the records will be randomized due to the Spark's default 
> partitioning
>  * Even though this implementation is a first-class member of Apache Griffin, 
> yet it's based on the _custom_ connector trait.
> The proposed implementation aims to,
>  * Deprecate the current implementation in favor of the direct official 
> [elasticsearch-hadoop|[https://github.com/elastic/elasticsearch-hadoop/tree/master/spark/sql-20]]
>  library.
>  * This library is built on DataSource API built on spark 2.2.x+ and thus 
> brings support for filter pushdowns, column pruning, unified read and write 
> and additional optimizations.
>  * Many configuration options are available for ES connectivity, [check 
> here|[https://github.com/elastic/elasticsearch-hadoop/blob/master/mr/src/main/java/org/elasticsearch/hadoop/cfg/ConfigurationOptions.java]]
>  * Any filters can be applied as expressions directly on the data frame and 
> are pushed automatically to the source.
> The new implementation will look something like,
> {code:java}
> sparkSession.read.format("es").options( ??? ).load(""){code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Work logged] (GRIFFIN-326) New implementation for Elasticsearch Data Connector (Batch)

2020-04-13 Thread ASF GitHub Bot (Jira)



 [ 
https://issues.apache.org/jira/browse/GRIFFIN-326?focusedWorklogId=421124=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-421124
 ]

ASF GitHub Bot logged work on GRIFFIN-326:
--

Author: ASF GitHub Bot
Created on: 13/Apr/20 07:18
Start Date: 13/Apr/20 07:18
Worklog Time Spent: 10m 
  Work Description: DragonPrince1992 commented on issue #569: [GRIFFIN-326] 
New Data Connector for Elasticsearch
URL: https://github.com/apache/griffin/pull/569#issuecomment-612784719
 
 
   I can't get it.Can I fix this as configuration?Under 
file\griffin-master\measure\src\main\resources\env-batch.json
   
   
   
   
   
   
   
   --原始邮件--
   发件人:"Chitral Verma" New implementation for Elasticsearch Data Connector (Batch)
> ---
>
> Key: GRIFFIN-326
> URL: https://issues.apache.org/jira/browse/GRIFFIN-326
> Project: Griffin
>  Issue Type: Sub-task
>Reporter: Chitral Verma
>Priority: Major
>  Time Spent: 1h 40m
>  Remaining Estimate: 0h
>
> The current implementation of Elasticsearch relies on sending post requests 
> from the driver using either SQL or search mode for query filtering.
> This implementation has the following potential issues,
>  * Data is fetched for indexes (database scopes in ES) in bulk via 1 call on 
> the driver. If the index has a lot of data, due to the big response payload, 
> a bottleneck would be created on the driver.
>  * Further, the driver then needs to parse this response payload and then 
> parallelize it, this is again a driver side bottleneck as each JSON record 
> needs to be mapped to a set schema in a type-safe manner.
>  * Only _host_, _port_ and _version_ are the available options to configure 
> the connection to the ES node or cluster.
>  * Source partitioning logic is not carried forward when parallelizing 
> records, the records will be randomized due to the Spark's default 
> partitioning
>  * Even though this implementation is a first-class member of Apache Griffin, 
> yet it's based on the _custom_ connector trait.
> The proposed implementation aims to,
>  * Deprecate the current implementation in favor of the direct official 
> [elasticsearch-hadoop|[https://github.com/elastic/elasticsearch-hadoop/tree/master/spark/sql-20]]
>  library.
>  * This library is built on DataSource API built on spark 2.2.x+ and thus 
> brings support for filter pushdowns, column pruning, unified read and write 
> and additional optimizations.
>  * Many configuration options are available for ES connectivity, [check 
> here|[https://github.com/elastic/elasticsearch-hadoop/blob/master/mr/src/main/java/org/elasticsearch/hadoop/cfg/ConfigurationOptions.java]]
>  * Any filters can be applied as expressions directly on the data frame and 
> are pushed automatically to the source.
> The new implementation will look something like,
> {code:java}
> sparkSession.read.format("es").options( ??? ).load(""){code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Work logged] (GRIFFIN-326) New implementation for Elasticsearch Data Connector (Batch)

2020-04-13 Thread ASF GitHub Bot (Jira)



 [ 
https://issues.apache.org/jira/browse/GRIFFIN-326?focusedWorklogId=421119=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-421119
 ]

ASF GitHub Bot logged work on GRIFFIN-326:
--

Author: ASF GitHub Bot
Created on: 13/Apr/20 07:09
Start Date: 13/Apr/20 07:09
Worklog Time Spent: 10m 
  Work Description: chitralverma commented on issue #569: [GRIFFIN-326] New 
Data Connector for Elasticsearch
URL: https://github.com/apache/griffin/pull/569#issuecomment-612782390
 
 
   also, you have to use `es.nodes` not `es.hostname`
   
   This PR has not been closed yet
   
 

This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


Issue Time Tracking
---

Worklog Id: (was: 421119)
Time Spent: 1.5h  (was: 1h 20m)

> New implementation for Elasticsearch Data Connector (Batch)
> ---
>
> Key: GRIFFIN-326
> URL: https://issues.apache.org/jira/browse/GRIFFIN-326
> Project: Griffin
>  Issue Type: Sub-task
>Reporter: Chitral Verma
>Priority: Major
>  Time Spent: 1.5h
>  Remaining Estimate: 0h
>
> The current implementation of Elasticsearch relies on sending post requests 
> from the driver using either SQL or search mode for query filtering.
> This implementation has the following potential issues,
>  * Data is fetched for indexes (database scopes in ES) in bulk via 1 call on 
> the driver. If the index has a lot of data, due to the big response payload, 
> a bottleneck would be created on the driver.
>  * Further, the driver then needs to parse this response payload and then 
> parallelize it, this is again a driver side bottleneck as each JSON record 
> needs to be mapped to a set schema in a type-safe manner.
>  * Only _host_, _port_ and _version_ are the available options to configure 
> the connection to the ES node or cluster.
>  * Source partitioning logic is not carried forward when parallelizing 
> records, the records will be randomized due to the Spark's default 
> partitioning
>  * Even though this implementation is a first-class member of Apache Griffin, 
> yet it's based on the _custom_ connector trait.
> The proposed implementation aims to,
>  * Deprecate the current implementation in favor of the direct official 
> [elasticsearch-hadoop|[https://github.com/elastic/elasticsearch-hadoop/tree/master/spark/sql-20]]
>  library.
>  * This library is built on DataSource API built on spark 2.2.x+ and thus 
> brings support for filter pushdowns, column pruning, unified read and write 
> and additional optimizations.
>  * Many configuration options are available for ES connectivity, [check 
> here|[https://github.com/elastic/elasticsearch-hadoop/blob/master/mr/src/main/java/org/elasticsearch/hadoop/cfg/ConfigurationOptions.java]]
>  * Any filters can be applied as expressions directly on the data frame and 
> are pushed automatically to the source.
> The new implementation will look something like,
> {code:java}
> sparkSession.read.format("es").options( ??? ).load(""){code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Work logged] (GRIFFIN-326) New implementation for Elasticsearch Data Connector (Batch)

2020-04-13 Thread ASF GitHub Bot (Jira)



 [ 
https://issues.apache.org/jira/browse/GRIFFIN-326?focusedWorklogId=421117=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-421117
 ]

ASF GitHub Bot logged work on GRIFFIN-326:
--

Author: ASF GitHub Bot
Created on: 13/Apr/20 07:07
Start Date: 13/Apr/20 07:07
Worklog Time Spent: 10m 
  Work Description: DragonPrince1992 commented on issue #569: [GRIFFIN-326] 
New Data Connector for Elasticsearch
URL: https://github.com/apache/griffin/pull/569#issuecomment-612781759
 
 
   
   
   
   I could not seehealthy graphics ,how to solve it
   
   
   --原始邮件--
   发件人:"Chitral Verma" New implementation for Elasticsearch Data Connector (Batch)
> ---
>
> Key: GRIFFIN-326
> URL: https://issues.apache.org/jira/browse/GRIFFIN-326
> Project: Griffin
>  Issue Type: Sub-task
>Reporter: Chitral Verma
>Priority: Major
>  Time Spent: 1h 20m
>  Remaining Estimate: 0h
>
> The current implementation of Elasticsearch relies on sending post requests 
> from the driver using either SQL or search mode for query filtering.
> This implementation has the following potential issues,
>  * Data is fetched for indexes (database scopes in ES) in bulk via 1 call on 
> the driver. If the index has a lot of data, due to the big response payload, 
> a bottleneck would be created on the driver.
>  * Further, the driver then needs to parse this response payload and then 
> parallelize it, this is again a driver side bottleneck as each JSON record 
> needs to be mapped to a set schema in a type-safe manner.
>  * Only _host_, _port_ and _version_ are the available options to configure 
> the connection to the ES node or cluster.
>  * Source partitioning logic is not carried forward when parallelizing 
> records, the records will be randomized due to the Spark's default 
> partitioning
>  * Even though this implementation is a first-class member of Apache Griffin, 
> yet it's based on the _custom_ connector trait.
> The proposed implementation aims to,
>  * Deprecate the current implementation in favor of the direct official 
> [elasticsearch-hadoop|[https://github.com/elastic/elasticsearch-hadoop/tree/master/spark/sql-20]]
>  library.
>  * This library is built on DataSource API built on spark 2.2.x+ and thus 
> brings support for filter pushdowns, column pruning, unified read and write 
> and additional optimizations.
>  * Many configuration options are available for ES connectivity, [check 
> here|[https://github.com/elastic/elasticsearch-hadoop/blob/master/mr/src/main/java/org/elasticsearch/hadoop/cfg/ConfigurationOptions.java]]
>  * Any filters can be applied as expressions directly on the data frame and 
> are pushed automatically to the source.
> The new implementation will look something like,
> {code:java}
> sparkSession.read.format("es").options( ??? ).load(""){code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Work logged] (GRIFFIN-326) New implementation for Elasticsearch Data Connector (Batch)

2020-04-13 Thread ASF GitHub Bot (Jira)



 [ 
https://issues.apache.org/jira/browse/GRIFFIN-326?focusedWorklogId=421110=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-421110
 ]

ASF GitHub Bot logged work on GRIFFIN-326:
--

Author: ASF GitHub Bot
Created on: 13/Apr/20 06:58
Start Date: 13/Apr/20 06:58
Worklog Time Spent: 10m 
  Work Description: DragonPrince1992 commented on issue #569: [GRIFFIN-326] 
New Data Connector for Elasticsearch
URL: https://github.com/apache/griffin/pull/569#issuecomment-612779282
 
 
   es的hostname都配置正确，在job执行完成之后，es里边查询数据是没有的，这个是怎么回事？
 

This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


Issue Time Tracking
---

Worklog Id: (was: 421110)
Time Spent: 1h  (was: 50m)

> New implementation for Elasticsearch Data Connector (Batch)
> ---
>
> Key: GRIFFIN-326
> URL: https://issues.apache.org/jira/browse/GRIFFIN-326
> Project: Griffin
>  Issue Type: Sub-task
>Reporter: Chitral Verma
>Priority: Major
>  Time Spent: 1h
>  Remaining Estimate: 0h
>
> The current implementation of Elasticsearch relies on sending post requests 
> from the driver using either SQL or search mode for query filtering.
> This implementation has the following potential issues,
>  * Data is fetched for indexes (database scopes in ES) in bulk via 1 call on 
> the driver. If the index has a lot of data, due to the big response payload, 
> a bottleneck would be created on the driver.
>  * Further, the driver then needs to parse this response payload and then 
> parallelize it, this is again a driver side bottleneck as each JSON record 
> needs to be mapped to a set schema in a type-safe manner.
>  * Only _host_, _port_ and _version_ are the available options to configure 
> the connection to the ES node or cluster.
>  * Source partitioning logic is not carried forward when parallelizing 
> records, the records will be randomized due to the Spark's default 
> partitioning
>  * Even though this implementation is a first-class member of Apache Griffin, 
> yet it's based on the _custom_ connector trait.
> The proposed implementation aims to,
>  * Deprecate the current implementation in favor of the direct official 
> [elasticsearch-hadoop|[https://github.com/elastic/elasticsearch-hadoop/tree/master/spark/sql-20]]
>  library.
>  * This library is built on DataSource API built on spark 2.2.x+ and thus 
> brings support for filter pushdowns, column pruning, unified read and write 
> and additional optimizations.
>  * Many configuration options are available for ES connectivity, [check 
> here|[https://github.com/elastic/elasticsearch-hadoop/blob/master/mr/src/main/java/org/elasticsearch/hadoop/cfg/ConfigurationOptions.java]]
>  * Any filters can be applied as expressions directly on the data frame and 
> are pushed automatically to the source.
> The new implementation will look something like,
> {code:java}
> sparkSession.read.format("es").options( ??? ).load(""){code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Work logged] (GRIFFIN-326) New implementation for Elasticsearch Data Connector (Batch)

2020-04-12 Thread ASF GitHub Bot (Jira)



 [ 
https://issues.apache.org/jira/browse/GRIFFIN-326?focusedWorklogId=420898=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-420898
 ]

ASF GitHub Bot logged work on GRIFFIN-326:
--

Author: ASF GitHub Bot
Created on: 12/Apr/20 07:06
Start Date: 12/Apr/20 07:06
Worklog Time Spent: 10m 
  Work Description: wankunde commented on issue #569: [GRIFFIN-326] New 
Data Connector for Elasticsearch
URL: https://github.com/apache/griffin/pull/569#issuecomment-612574030
 
 
   @chitralverma @guoyuepeng 
   
   `FileBasedDataConnectorTest` will generate some test data files in 
`file://${getClass.getResource("/").getPath}` directory, I am not sure travis 
has some restrictions on this directory. If this problem occurs again, 
@chitralverma could put those files directly into test resources directory.
   
   @chitralverma could you update your PR with little change and try again? 
 

This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


Issue Time Tracking
---

Worklog Id: (was: 420898)
Time Spent: 50m  (was: 40m)

> New implementation for Elasticsearch Data Connector (Batch)
> ---
>
> Key: GRIFFIN-326
> URL: https://issues.apache.org/jira/browse/GRIFFIN-326
> Project: Griffin
>  Issue Type: Sub-task
>Reporter: Chitral Verma
>Priority: Major
>  Time Spent: 50m
>  Remaining Estimate: 0h
>
> The current implementation of Elasticsearch relies on sending post requests 
> from the driver using either SQL or search mode for query filtering.
> This implementation has the following potential issues,
>  * Data is fetched for indexes (database scopes in ES) in bulk via 1 call on 
> the driver. If the index has a lot of data, due to the big response payload, 
> a bottleneck would be created on the driver.
>  * Further, the driver then needs to parse this response payload and then 
> parallelize it, this is again a driver side bottleneck as each JSON record 
> needs to be mapped to a set schema in a type-safe manner.
>  * Only _host_, _port_ and _version_ are the available options to configure 
> the connection to the ES node or cluster.
>  * Source partitioning logic is not carried forward when parallelizing 
> records, the records will be randomized due to the Spark's default 
> partitioning
>  * Even though this implementation is a first-class member of Apache Griffin, 
> yet it's based on the _custom_ connector trait.
> The proposed implementation aims to,
>  * Deprecate the current implementation in favor of the direct official 
> [elasticsearch-hadoop|[https://github.com/elastic/elasticsearch-hadoop/tree/master/spark/sql-20]]
>  library.
>  * This library is built on DataSource API built on spark 2.2.x+ and thus 
> brings support for filter pushdowns, column pruning, unified read and write 
> and additional optimizations.
>  * Many configuration options are available for ES connectivity, [check 
> here|[https://github.com/elastic/elasticsearch-hadoop/blob/master/mr/src/main/java/org/elasticsearch/hadoop/cfg/ConfigurationOptions.java]]
>  * Any filters can be applied as expressions directly on the data frame and 
> are pushed automatically to the source.
> The new implementation will look something like,
> {code:java}
> sparkSession.read.format("es").options( ??? ).load(""){code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Work logged] (GRIFFIN-326) New implementation for Elasticsearch Data Connector (Batch)

2020-04-11 Thread ASF GitHub Bot (Jira)



 [ 
https://issues.apache.org/jira/browse/GRIFFIN-326?focusedWorklogId=420850=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-420850
 ]

ASF GitHub Bot logged work on GRIFFIN-326:
--

Author: ASF GitHub Bot
Created on: 12/Apr/20 01:01
Start Date: 12/Apr/20 01:01
Worklog Time Spent: 10m 
  Work Description: guoyuepeng commented on issue #569: [GRIFFIN-326] New 
Data Connector for Elasticsearch
URL: https://github.com/apache/griffin/pull/569#issuecomment-612545643
 
 
   No output has been received in the last 10m0s, this potentially indicates a 
stalled build or something wrong with the build itself.
   Check the details on how to adjust your build configuration on: 
https://docs.travis-ci.com/user/common-build-problems/#build-times-out-because-no-output-was-received
   
   @wankunde 
   The same issue as before?
 

This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


Issue Time Tracking
---

Worklog Id: (was: 420850)
Time Spent: 40m  (was: 0.5h)

> New implementation for Elasticsearch Data Connector (Batch)
> ---
>
> Key: GRIFFIN-326
> URL: https://issues.apache.org/jira/browse/GRIFFIN-326
> Project: Griffin
>  Issue Type: Sub-task
>Reporter: Chitral Verma
>Priority: Major
>  Time Spent: 40m
>  Remaining Estimate: 0h
>
> The current implementation of Elasticsearch relies on sending post requests 
> from the driver using either SQL or search mode for query filtering.
> This implementation has the following potential issues,
>  * Data is fetched for indexes (database scopes in ES) in bulk via 1 call on 
> the driver. If the index has a lot of data, due to the big response payload, 
> a bottleneck would be created on the driver.
>  * Further, the driver then needs to parse this response payload and then 
> parallelize it, this is again a driver side bottleneck as each JSON record 
> needs to be mapped to a set schema in a type-safe manner.
>  * Only _host_, _port_ and _version_ are the available options to configure 
> the connection to the ES node or cluster.
>  * Source partitioning logic is not carried forward when parallelizing 
> records, the records will be randomized due to the Spark's default 
> partitioning
>  * Even though this implementation is a first-class member of Apache Griffin, 
> yet it's based on the _custom_ connector trait.
> The proposed implementation aims to,
>  * Deprecate the current implementation in favor of the direct official 
> [elasticsearch-hadoop|[https://github.com/elastic/elasticsearch-hadoop/tree/master/spark/sql-20]]
>  library.
>  * This library is built on DataSource API built on spark 2.2.x+ and thus 
> brings support for filter pushdowns, column pruning, unified read and write 
> and additional optimizations.
>  * Many configuration options are available for ES connectivity, [check 
> here|[https://github.com/elastic/elasticsearch-hadoop/blob/master/mr/src/main/java/org/elasticsearch/hadoop/cfg/ConfigurationOptions.java]]
>  * Any filters can be applied as expressions directly on the data frame and 
> are pushed automatically to the source.
> The new implementation will look something like,
> {code:java}
> sparkSession.read.format("es").options( ??? ).load(""){code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Work logged] (GRIFFIN-326) New implementation for Elasticsearch Data Connector (Batch)

2020-04-11 Thread ASF GitHub Bot (Jira)



 [ 
https://issues.apache.org/jira/browse/GRIFFIN-326?focusedWorklogId=420849=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-420849
 ]

ASF GitHub Bot logged work on GRIFFIN-326:
--

Author: ASF GitHub Bot
Created on: 12/Apr/20 00:58
Start Date: 12/Apr/20 00:58
Worklog Time Spent: 10m 
  Work Description: guoyuepeng commented on issue #569: [GRIFFIN-326] New 
Data Connector for Elasticsearch
URL: https://github.com/apache/griffin/pull/569#issuecomment-612545643
 
 
   Not sure what happened on CI, let me check.
 

This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


Issue Time Tracking
---

Worklog Id: (was: 420849)
Time Spent: 0.5h  (was: 20m)

> New implementation for Elasticsearch Data Connector (Batch)
> ---
>
> Key: GRIFFIN-326
> URL: https://issues.apache.org/jira/browse/GRIFFIN-326
> Project: Griffin
>  Issue Type: Sub-task
>Reporter: Chitral Verma
>Priority: Major
>  Time Spent: 0.5h
>  Remaining Estimate: 0h
>
> The current implementation of Elasticsearch relies on sending post requests 
> from the driver using either SQL or search mode for query filtering.
> This implementation has the following potential issues,
>  * Data is fetched for indexes (database scopes in ES) in bulk via 1 call on 
> the driver. If the index has a lot of data, due to the big response payload, 
> a bottleneck would be created on the driver.
>  * Further, the driver then needs to parse this response payload and then 
> parallelize it, this is again a driver side bottleneck as each JSON record 
> needs to be mapped to a set schema in a type-safe manner.
>  * Only _host_, _port_ and _version_ are the available options to configure 
> the connection to the ES node or cluster.
>  * Source partitioning logic is not carried forward when parallelizing 
> records, the records will be randomized due to the Spark's default 
> partitioning
>  * Even though this implementation is a first-class member of Apache Griffin, 
> yet it's based on the _custom_ connector trait.
> The proposed implementation aims to,
>  * Deprecate the current implementation in favor of the direct official 
> [elasticsearch-hadoop|[https://github.com/elastic/elasticsearch-hadoop/tree/master/spark/sql-20]]
>  library.
>  * This library is built on DataSource API built on spark 2.2.x+ and thus 
> brings support for filter pushdowns, column pruning, unified read and write 
> and additional optimizations.
>  * Many configuration options are available for ES connectivity, [check 
> here|[https://github.com/elastic/elasticsearch-hadoop/blob/master/mr/src/main/java/org/elasticsearch/hadoop/cfg/ConfigurationOptions.java]]
>  * Any filters can be applied as expressions directly on the data frame and 
> are pushed automatically to the source.
> The new implementation will look something like,
> {code:java}
> sparkSession.read.format("es").options( ??? ).load(""){code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Work logged] (GRIFFIN-326) New implementation for Elasticsearch Data Connector (Batch)

2020-04-07 Thread ASF GitHub Bot (Jira)



 [ 
https://issues.apache.org/jira/browse/GRIFFIN-326?focusedWorklogId=417480=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-417480
 ]

ASF GitHub Bot logged work on GRIFFIN-326:
--

Author: ASF GitHub Bot
Created on: 07/Apr/20 06:45
Start Date: 07/Apr/20 06:45
Worklog Time Spent: 10m 
  Work Description: chitralverma commented on issue #569: [GRIFFIN-326] New 
Data Connector for Elasticsearch
URL: https://github.com/apache/griffin/pull/569#issuecomment-610206820
 
 
   @wankunde @guoyuepeng Can you please review this. Thanks.
   
   Also, I think the build is stuck?
 

This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


Issue Time Tracking
---

Worklog Id: (was: 417480)
Time Spent: 20m  (was: 10m)

> New implementation for Elasticsearch Data Connector (Batch)
> ---
>
> Key: GRIFFIN-326
> URL: https://issues.apache.org/jira/browse/GRIFFIN-326
> Project: Griffin
>  Issue Type: Improvement
>Reporter: Chitral Verma
>Priority: Major
>  Time Spent: 20m
>  Remaining Estimate: 0h
>
> The current implementation of Elasticsearch relies on sending post requests 
> from the driver using either SQL or search mode for query filtering.
> This implementation has the following potential issues,
>  * Data is fetched for indexes (database scopes in ES) in bulk via 1 call on 
> the driver. If the index has a lot of data, due to the big response payload, 
> a bottleneck would be created on the driver.
>  * Further, the driver then needs to parse this response payload and then 
> parallelize it, this is again a driver side bottleneck as each JSON record 
> needs to be mapped to a set schema in a type-safe manner.
>  * Only _host_, _port_ and _version_ are the available options to configure 
> the connection to the ES node or cluster.
>  * Source partitioning logic is not carried forward when parallelizing 
> records, the records will be randomized due to the Spark's default 
> partitioning
>  * Even though this implementation is a first-class member of Apache Griffin, 
> yet it's based on the _custom_ connector trait.
> The proposed implementation aims to,
>  * Deprecate the current implementation in favor of the direct official 
> [elasticsearch-hadoop|[https://github.com/elastic/elasticsearch-hadoop/tree/master/spark/sql-20]]
>  library.
>  * This library is built on DataSource API built on spark 2.2.x+ and thus 
> brings support for filter pushdowns, column pruning, unified read and write 
> and additional optimizations.
>  * Many configuration options are available for ES connectivity, [check 
> here|[https://github.com/elastic/elasticsearch-hadoop/blob/master/mr/src/main/java/org/elasticsearch/hadoop/cfg/ConfigurationOptions.java]]
>  * Any filters can be applied as expressions directly on the data frame and 
> are pushed automatically to the source.
> The new implementation will look something like,
> {code:java}
> sparkSession.read.format("es").options( ??? ).load(""){code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Work logged] (GRIFFIN-326) New implementation for Elasticsearch Data Connector (Batch)

2020-04-06 Thread ASF GitHub Bot (Jira)



 [ 
https://issues.apache.org/jira/browse/GRIFFIN-326?focusedWorklogId=416685=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-416685
 ]

ASF GitHub Bot logged work on GRIFFIN-326:
--

Author: ASF GitHub Bot
Created on: 06/Apr/20 15:30
Start Date: 06/Apr/20 15:30
Worklog Time Spent: 10m 
  Work Description: chitralverma commented on pull request #569: 
[GRIFFIN-326] New Data Connector for Elasticsearch
URL: https://github.com/apache/griffin/pull/569
 
 
   **What changes were proposed in this pull request?**
   
   This ticket proposes the following changes,
   - Deprecate the current implementation in favour of the direct 
implementation in the official 
[elasticsearch-hadoop](https://github.com/elastic/elasticsearch-hadoop/tree/master/spark/sql-20)
 library.
   - This library is built on DataSource API built on spark 2.2.x+ and thus 
brings support for filter pushdowns, column pruning, unified read and write and 
additional optimizations.
   - Many configuration options are available for ES connectivity, [check 
here](https://github.com/elastic/elasticsearch-hadoop/blob/master/mr/src/main/java/org/elasticsearch/hadoop/cfg/ConfigurationOptions.java).
   - Any filters can be applied as expressions directly on the data frame and 
are pushed automatically to the source.
   
   **Does this PR introduce any user-facing change?**
   Yes. As mentioned above, the old connector has been deprecated and config 
structure for Elasticsearch data connector has changed now.
   
   **How was this patch tested?**
   Griffin test suite and additional unit test cases
 

This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


Issue Time Tracking
---

Worklog Id: (was: 416685)
Remaining Estimate: 0h
Time Spent: 10m

> New implementation for Elasticsearch Data Connector (Batch)
> ---
>
> Key: GRIFFIN-326
> URL: https://issues.apache.org/jira/browse/GRIFFIN-326
> Project: Griffin
>  Issue Type: Improvement
>Reporter: Chitral Verma
>Priority: Major
>  Time Spent: 10m
>  Remaining Estimate: 0h
>
> The current implementation of Elasticsearch relies on sending post requests 
> from the driver using either SQL or search mode for query filtering.
> This implementation has the following potential issues,
>  * Data is fetched for indexes (database scopes in ES) in bulk via 1 call on 
> the driver. If the index has a lot of data, due to the big response payload, 
> a bottleneck would be created on the driver.
>  * Further, the driver then needs to parse this response payload and then 
> parallelize it, this is again a driver side bottleneck as each JSON record 
> needs to be mapped to a set schema in a type-safe manner.
>  * Only _host_, _port_ and _version_ are the available options to configure 
> the connection to the ES node or cluster.
>  * Source partitioning logic is not carried forward when parallelizing 
> records, the records will be randomized due to the Spark's default 
> partitioning
>  * Even though this implementation is a first-class member of Apache Griffin, 
> yet it's based on the _custom_ connector trait.
> The proposed implementation aims to,
>  * Deprecate the current implementation in favor of the direct official 
> [elasticsearch-hadoop|[https://github.com/elastic/elasticsearch-hadoop/tree/master/spark/sql-20]]
>  library.
>  * This library is built on DataSource API built on spark 2.2.x+ and thus 
> brings support for filter pushdowns, column pruning, unified read and write 
> and additional optimizations.
>  * Many configuration options are available for ES connectivity, [check 
> here|[https://github.com/elastic/elasticsearch-hadoop/blob/master/mr/src/main/java/org/elasticsearch/hadoop/cfg/ConfigurationOptions.java]]
>  * Any filters can be applied as expressions directly on the data frame and 
> are pushed automatically to the source.
> The new implementation will look something like,
> {code:java}
> sparkSession.read.format("es").options( ??? ).load(""){code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Work logged] (GRIFFIN-326) New implementation for Elasticsearch Data Connector (Batch)

[jira] [Work logged] (GRIFFIN-326) New implementation for Elasticsearch Data Connector (Batch)

[jira] [Work logged] (GRIFFIN-326) New implementation for Elasticsearch Data Connector (Batch)

[jira] [Work logged] (GRIFFIN-326) New implementation for Elasticsearch Data Connector (Batch)

[jira] [Work logged] (GRIFFIN-326) New implementation for Elasticsearch Data Connector (Batch)

[jira] [Work logged] (GRIFFIN-326) New implementation for Elasticsearch Data Connector (Batch)

[jira] [Work logged] (GRIFFIN-326) New implementation for Elasticsearch Data Connector (Batch)

[jira] [Work logged] (GRIFFIN-326) New implementation for Elasticsearch Data Connector (Batch)

[jira] [Work logged] (GRIFFIN-326) New implementation for Elasticsearch Data Connector (Batch)

[jira] [Work logged] (GRIFFIN-326) New implementation for Elasticsearch Data Connector (Batch)

[jira] [Work logged] (GRIFFIN-326) New implementation for Elasticsearch Data Connector (Batch)

[jira] [Work logged] (GRIFFIN-326) New implementation for Elasticsearch Data Connector (Batch)

[jira] [Work logged] (GRIFFIN-326) New implementation for Elasticsearch Data Connector (Batch)

[jira] [Work logged] (GRIFFIN-326) New implementation for Elasticsearch Data Connector (Batch)

[jira] [Work logged] (GRIFFIN-326) New implementation for Elasticsearch Data Connector (Batch)

[jira] [Work logged] (GRIFFIN-326) New implementation for Elasticsearch Data Connector (Batch)

[jira] [Work logged] (GRIFFIN-326) New implementation for Elasticsearch Data Connector (Batch)

[jira] [Work logged] (GRIFFIN-326) New implementation for Elasticsearch Data Connector (Batch)

[jira] [Work logged] (GRIFFIN-326) New implementation for Elasticsearch Data Connector (Batch)

[jira] [Work logged] (GRIFFIN-326) New implementation for Elasticsearch Data Connector (Batch)

[jira] [Work logged] (GRIFFIN-326) New implementation for Elasticsearch Data Connector (Batch)

[jira] [Work logged] (GRIFFIN-326) New implementation for Elasticsearch Data Connector (Batch)

[jira] [Work logged] (GRIFFIN-326) New implementation for Elasticsearch Data Connector (Batch)

[jira] [Work logged] (GRIFFIN-326) New implementation for Elasticsearch Data Connector (Batch)

24 matches

Site Navigation

Mail list logo

Footer information