[jira] [Work logged] (GRIFFIN-326) New implementation for Elasticsearch Data Connector (Batch)

2020-06-04 Thread ASF GitHub Bot (Jira)


 [ 
https://issues.apache.org/jira/browse/GRIFFIN-326?focusedWorklogId=441228=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-441228
 ]

ASF GitHub Bot logged work on GRIFFIN-326:
--

Author: ASF GitHub Bot
Created on: 04/Jun/20 11:09
Start Date: 04/Jun/20 11:09
Worklog Time Spent: 10m 
  Work Description: chitralverma commented on pull request #569:
URL: https://github.com/apache/griffin/pull/569#issuecomment-638782290


   @guoyuepeng Seems like a build has been running for this for month now. Can 
you check this.
   
   
https://travis-ci.org/github/apache/griffin/builds/694599100?utm_source=github_status_medium=notification



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


Issue Time Tracking
---

Worklog Id: (was: 441228)
Time Spent: 4h 20m  (was: 4h 10m)

> New implementation for Elasticsearch Data Connector (Batch)
> ---
>
> Key: GRIFFIN-326
> URL: https://issues.apache.org/jira/browse/GRIFFIN-326
> Project: Griffin
>  Issue Type: Sub-task
>Reporter: Chitral Verma
>Priority: Major
>  Time Spent: 4h 20m
>  Remaining Estimate: 0h
>
> The current implementation of Elasticsearch relies on sending post requests 
> from the driver using either SQL or search mode for query filtering.
> This implementation has the following potential issues,
>  * Data is fetched for indexes (database scopes in ES) in bulk via 1 call on 
> the driver. If the index has a lot of data, due to the big response payload, 
> a bottleneck would be created on the driver.
>  * Further, the driver then needs to parse this response payload and then 
> parallelize it, this is again a driver side bottleneck as each JSON record 
> needs to be mapped to a set schema in a type-safe manner.
>  * Only _host_, _port_ and _version_ are the available options to configure 
> the connection to the ES node or cluster.
>  * Source partitioning logic is not carried forward when parallelizing 
> records, the records will be randomized due to the Spark's default 
> partitioning
>  * Even though this implementation is a first-class member of Apache Griffin, 
> yet it's based on the _custom_ connector trait.
> The proposed implementation aims to,
>  * Deprecate the current implementation in favor of the direct official 
> [elasticsearch-hadoop|[https://github.com/elastic/elasticsearch-hadoop/tree/master/spark/sql-20]]
>  library.
>  * This library is built on DataSource API built on spark 2.2.x+ and thus 
> brings support for filter pushdowns, column pruning, unified read and write 
> and additional optimizations.
>  * Many configuration options are available for ES connectivity, [check 
> here|[https://github.com/elastic/elasticsearch-hadoop/blob/master/mr/src/main/java/org/elasticsearch/hadoop/cfg/ConfigurationOptions.java]]
>  * Any filters can be applied as expressions directly on the data frame and 
> are pushed automatically to the source.
> The new implementation will look something like,
> {code:java}
> sparkSession.read.format("es").options( ??? ).load(""){code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[GitHub] [griffin] chitralverma commented on pull request #569: [GRIFFIN-326] New Data Connector for Elasticsearch

2020-06-04 Thread GitBox


chitralverma commented on pull request #569:
URL: https://github.com/apache/griffin/pull/569#issuecomment-638782290


   @guoyuepeng Seems like a build has been running for this for month now. Can 
you check this.
   
   
https://travis-ci.org/github/apache/griffin/builds/694599100?utm_source=github_status_medium=notification



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [griffin] guoyuepeng commented on a change in pull request #573: Elastic Search index for each application instance

2020-06-04 Thread GitBox


guoyuepeng commented on a change in pull request #573:
URL: https://github.com/apache/griffin/pull/573#discussion_r435173641



##
File path: service/src/main/resources/application.properties
##
@@ -62,6 +62,7 @@ fs.defaultFS=
 elasticsearch.host=localhost
 elasticsearch.port=9200
 elasticsearch.scheme=http
+index=griffin

Review comment:
   elasticsearch.index make sense to me.





This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[jira] [Work logged] (GRIFFIN-326) New implementation for Elasticsearch Data Connector (Batch)

2020-06-04 Thread ASF GitHub Bot (Jira)


 [ 
https://issues.apache.org/jira/browse/GRIFFIN-326?focusedWorklogId=441227=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-441227
 ]

ASF GitHub Bot logged work on GRIFFIN-326:
--

Author: ASF GitHub Bot
Created on: 04/Jun/20 11:01
Start Date: 04/Jun/20 11:01
Worklog Time Spent: 10m 
  Work Description: asfgit closed pull request #569:
URL: https://github.com/apache/griffin/pull/569


   



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


Issue Time Tracking
---

Worklog Id: (was: 441227)
Time Spent: 4h 10m  (was: 4h)

> New implementation for Elasticsearch Data Connector (Batch)
> ---
>
> Key: GRIFFIN-326
> URL: https://issues.apache.org/jira/browse/GRIFFIN-326
> Project: Griffin
>  Issue Type: Sub-task
>Reporter: Chitral Verma
>Priority: Major
>  Time Spent: 4h 10m
>  Remaining Estimate: 0h
>
> The current implementation of Elasticsearch relies on sending post requests 
> from the driver using either SQL or search mode for query filtering.
> This implementation has the following potential issues,
>  * Data is fetched for indexes (database scopes in ES) in bulk via 1 call on 
> the driver. If the index has a lot of data, due to the big response payload, 
> a bottleneck would be created on the driver.
>  * Further, the driver then needs to parse this response payload and then 
> parallelize it, this is again a driver side bottleneck as each JSON record 
> needs to be mapped to a set schema in a type-safe manner.
>  * Only _host_, _port_ and _version_ are the available options to configure 
> the connection to the ES node or cluster.
>  * Source partitioning logic is not carried forward when parallelizing 
> records, the records will be randomized due to the Spark's default 
> partitioning
>  * Even though this implementation is a first-class member of Apache Griffin, 
> yet it's based on the _custom_ connector trait.
> The proposed implementation aims to,
>  * Deprecate the current implementation in favor of the direct official 
> [elasticsearch-hadoop|[https://github.com/elastic/elasticsearch-hadoop/tree/master/spark/sql-20]]
>  library.
>  * This library is built on DataSource API built on spark 2.2.x+ and thus 
> brings support for filter pushdowns, column pruning, unified read and write 
> and additional optimizations.
>  * Many configuration options are available for ES connectivity, [check 
> here|[https://github.com/elastic/elasticsearch-hadoop/blob/master/mr/src/main/java/org/elasticsearch/hadoop/cfg/ConfigurationOptions.java]]
>  * Any filters can be applied as expressions directly on the data frame and 
> are pushed automatically to the source.
> The new implementation will look something like,
> {code:java}
> sparkSession.read.format("es").options( ??? ).load(""){code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[GitHub] [griffin] asfgit closed pull request #569: [GRIFFIN-326] New Data Connector for Elasticsearch

2020-06-04 Thread GitBox


asfgit closed pull request #569:
URL: https://github.com/apache/griffin/pull/569


   



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[jira] [Work logged] (GRIFFIN-326) New implementation for Elasticsearch Data Connector (Batch)

2020-06-04 Thread ASF GitHub Bot (Jira)


 [ 
https://issues.apache.org/jira/browse/GRIFFIN-326?focusedWorklogId=441224=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-441224
 ]

ASF GitHub Bot logged work on GRIFFIN-326:
--

Author: ASF GitHub Bot
Created on: 04/Jun/20 10:56
Start Date: 04/Jun/20 10:56
Worklog Time Spent: 10m 
  Work Description: guoyuepeng edited a comment on pull request #569:
URL: https://github.com/apache/griffin/pull/569#issuecomment-638776604


   LGTM, will merge it.



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


Issue Time Tracking
---

Worklog Id: (was: 441224)
Time Spent: 4h  (was: 3h 50m)

> New implementation for Elasticsearch Data Connector (Batch)
> ---
>
> Key: GRIFFIN-326
> URL: https://issues.apache.org/jira/browse/GRIFFIN-326
> Project: Griffin
>  Issue Type: Sub-task
>Reporter: Chitral Verma
>Priority: Major
>  Time Spent: 4h
>  Remaining Estimate: 0h
>
> The current implementation of Elasticsearch relies on sending post requests 
> from the driver using either SQL or search mode for query filtering.
> This implementation has the following potential issues,
>  * Data is fetched for indexes (database scopes in ES) in bulk via 1 call on 
> the driver. If the index has a lot of data, due to the big response payload, 
> a bottleneck would be created on the driver.
>  * Further, the driver then needs to parse this response payload and then 
> parallelize it, this is again a driver side bottleneck as each JSON record 
> needs to be mapped to a set schema in a type-safe manner.
>  * Only _host_, _port_ and _version_ are the available options to configure 
> the connection to the ES node or cluster.
>  * Source partitioning logic is not carried forward when parallelizing 
> records, the records will be randomized due to the Spark's default 
> partitioning
>  * Even though this implementation is a first-class member of Apache Griffin, 
> yet it's based on the _custom_ connector trait.
> The proposed implementation aims to,
>  * Deprecate the current implementation in favor of the direct official 
> [elasticsearch-hadoop|[https://github.com/elastic/elasticsearch-hadoop/tree/master/spark/sql-20]]
>  library.
>  * This library is built on DataSource API built on spark 2.2.x+ and thus 
> brings support for filter pushdowns, column pruning, unified read and write 
> and additional optimizations.
>  * Many configuration options are available for ES connectivity, [check 
> here|[https://github.com/elastic/elasticsearch-hadoop/blob/master/mr/src/main/java/org/elasticsearch/hadoop/cfg/ConfigurationOptions.java]]
>  * Any filters can be applied as expressions directly on the data frame and 
> are pushed automatically to the source.
> The new implementation will look something like,
> {code:java}
> sparkSession.read.format("es").options( ??? ).load(""){code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[GitHub] [griffin] guoyuepeng edited a comment on pull request #569: [GRIFFIN-326] New Data Connector for Elasticsearch

2020-06-04 Thread GitBox


guoyuepeng edited a comment on pull request #569:
URL: https://github.com/apache/griffin/pull/569#issuecomment-638776604


   LGTM, will merge it.



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[jira] [Work logged] (GRIFFIN-326) New implementation for Elasticsearch Data Connector (Batch)

2020-06-04 Thread ASF GitHub Bot (Jira)


 [ 
https://issues.apache.org/jira/browse/GRIFFIN-326?focusedWorklogId=441223=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-441223
 ]

ASF GitHub Bot logged work on GRIFFIN-326:
--

Author: ASF GitHub Bot
Created on: 04/Jun/20 10:56
Start Date: 04/Jun/20 10:56
Worklog Time Spent: 10m 
  Work Description: guoyuepeng commented on pull request #569:
URL: https://github.com/apache/griffin/pull/569#issuecomment-638776604


   reviewed, will merge it.



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


Issue Time Tracking
---

Worklog Id: (was: 441223)
Time Spent: 3h 50m  (was: 3h 40m)

> New implementation for Elasticsearch Data Connector (Batch)
> ---
>
> Key: GRIFFIN-326
> URL: https://issues.apache.org/jira/browse/GRIFFIN-326
> Project: Griffin
>  Issue Type: Sub-task
>Reporter: Chitral Verma
>Priority: Major
>  Time Spent: 3h 50m
>  Remaining Estimate: 0h
>
> The current implementation of Elasticsearch relies on sending post requests 
> from the driver using either SQL or search mode for query filtering.
> This implementation has the following potential issues,
>  * Data is fetched for indexes (database scopes in ES) in bulk via 1 call on 
> the driver. If the index has a lot of data, due to the big response payload, 
> a bottleneck would be created on the driver.
>  * Further, the driver then needs to parse this response payload and then 
> parallelize it, this is again a driver side bottleneck as each JSON record 
> needs to be mapped to a set schema in a type-safe manner.
>  * Only _host_, _port_ and _version_ are the available options to configure 
> the connection to the ES node or cluster.
>  * Source partitioning logic is not carried forward when parallelizing 
> records, the records will be randomized due to the Spark's default 
> partitioning
>  * Even though this implementation is a first-class member of Apache Griffin, 
> yet it's based on the _custom_ connector trait.
> The proposed implementation aims to,
>  * Deprecate the current implementation in favor of the direct official 
> [elasticsearch-hadoop|[https://github.com/elastic/elasticsearch-hadoop/tree/master/spark/sql-20]]
>  library.
>  * This library is built on DataSource API built on spark 2.2.x+ and thus 
> brings support for filter pushdowns, column pruning, unified read and write 
> and additional optimizations.
>  * Many configuration options are available for ES connectivity, [check 
> here|[https://github.com/elastic/elasticsearch-hadoop/blob/master/mr/src/main/java/org/elasticsearch/hadoop/cfg/ConfigurationOptions.java]]
>  * Any filters can be applied as expressions directly on the data frame and 
> are pushed automatically to the source.
> The new implementation will look something like,
> {code:java}
> sparkSession.read.format("es").options( ??? ).load(""){code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[GitHub] [griffin] guoyuepeng commented on pull request #569: [GRIFFIN-326] New Data Connector for Elasticsearch

2020-06-04 Thread GitBox


guoyuepeng commented on pull request #569:
URL: https://github.com/apache/griffin/pull/569#issuecomment-638776604


   reviewed, will merge it.



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org