[jira] [Work logged] (GRIFFIN-326) New implementation for Elasticsearch Data Connector (Batch)
[ https://issues.apache.org/jira/browse/GRIFFIN-326?focusedWorklogId=441228=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-441228 ] ASF GitHub Bot logged work on GRIFFIN-326: -- Author: ASF GitHub Bot Created on: 04/Jun/20 11:09 Start Date: 04/Jun/20 11:09 Worklog Time Spent: 10m Work Description: chitralverma commented on pull request #569: URL: https://github.com/apache/griffin/pull/569#issuecomment-638782290 @guoyuepeng Seems like a build has been running for this for month now. Can you check this. https://travis-ci.org/github/apache/griffin/builds/694599100?utm_source=github_status_medium=notification This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org Issue Time Tracking --- Worklog Id: (was: 441228) Time Spent: 4h 20m (was: 4h 10m) > New implementation for Elasticsearch Data Connector (Batch) > --- > > Key: GRIFFIN-326 > URL: https://issues.apache.org/jira/browse/GRIFFIN-326 > Project: Griffin > Issue Type: Sub-task >Reporter: Chitral Verma >Priority: Major > Time Spent: 4h 20m > Remaining Estimate: 0h > > The current implementation of Elasticsearch relies on sending post requests > from the driver using either SQL or search mode for query filtering. > This implementation has the following potential issues, > * Data is fetched for indexes (database scopes in ES) in bulk via 1 call on > the driver. If the index has a lot of data, due to the big response payload, > a bottleneck would be created on the driver. > * Further, the driver then needs to parse this response payload and then > parallelize it, this is again a driver side bottleneck as each JSON record > needs to be mapped to a set schema in a type-safe manner. > * Only _host_, _port_ and _version_ are the available options to configure > the connection to the ES node or cluster. > * Source partitioning logic is not carried forward when parallelizing > records, the records will be randomized due to the Spark's default > partitioning > * Even though this implementation is a first-class member of Apache Griffin, > yet it's based on the _custom_ connector trait. > The proposed implementation aims to, > * Deprecate the current implementation in favor of the direct official > [elasticsearch-hadoop|[https://github.com/elastic/elasticsearch-hadoop/tree/master/spark/sql-20]] > library. > * This library is built on DataSource API built on spark 2.2.x+ and thus > brings support for filter pushdowns, column pruning, unified read and write > and additional optimizations. > * Many configuration options are available for ES connectivity, [check > here|[https://github.com/elastic/elasticsearch-hadoop/blob/master/mr/src/main/java/org/elasticsearch/hadoop/cfg/ConfigurationOptions.java]] > * Any filters can be applied as expressions directly on the data frame and > are pushed automatically to the source. > The new implementation will look something like, > {code:java} > sparkSession.read.format("es").options( ??? ).load(""){code} -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Work logged] (GRIFFIN-326) New implementation for Elasticsearch Data Connector (Batch)
[ https://issues.apache.org/jira/browse/GRIFFIN-326?focusedWorklogId=441227=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-441227 ] ASF GitHub Bot logged work on GRIFFIN-326: -- Author: ASF GitHub Bot Created on: 04/Jun/20 11:01 Start Date: 04/Jun/20 11:01 Worklog Time Spent: 10m Work Description: asfgit closed pull request #569: URL: https://github.com/apache/griffin/pull/569 This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org Issue Time Tracking --- Worklog Id: (was: 441227) Time Spent: 4h 10m (was: 4h) > New implementation for Elasticsearch Data Connector (Batch) > --- > > Key: GRIFFIN-326 > URL: https://issues.apache.org/jira/browse/GRIFFIN-326 > Project: Griffin > Issue Type: Sub-task >Reporter: Chitral Verma >Priority: Major > Time Spent: 4h 10m > Remaining Estimate: 0h > > The current implementation of Elasticsearch relies on sending post requests > from the driver using either SQL or search mode for query filtering. > This implementation has the following potential issues, > * Data is fetched for indexes (database scopes in ES) in bulk via 1 call on > the driver. If the index has a lot of data, due to the big response payload, > a bottleneck would be created on the driver. > * Further, the driver then needs to parse this response payload and then > parallelize it, this is again a driver side bottleneck as each JSON record > needs to be mapped to a set schema in a type-safe manner. > * Only _host_, _port_ and _version_ are the available options to configure > the connection to the ES node or cluster. > * Source partitioning logic is not carried forward when parallelizing > records, the records will be randomized due to the Spark's default > partitioning > * Even though this implementation is a first-class member of Apache Griffin, > yet it's based on the _custom_ connector trait. > The proposed implementation aims to, > * Deprecate the current implementation in favor of the direct official > [elasticsearch-hadoop|[https://github.com/elastic/elasticsearch-hadoop/tree/master/spark/sql-20]] > library. > * This library is built on DataSource API built on spark 2.2.x+ and thus > brings support for filter pushdowns, column pruning, unified read and write > and additional optimizations. > * Many configuration options are available for ES connectivity, [check > here|[https://github.com/elastic/elasticsearch-hadoop/blob/master/mr/src/main/java/org/elasticsearch/hadoop/cfg/ConfigurationOptions.java]] > * Any filters can be applied as expressions directly on the data frame and > are pushed automatically to the source. > The new implementation will look something like, > {code:java} > sparkSession.read.format("es").options( ??? ).load(""){code} -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Work logged] (GRIFFIN-326) New implementation for Elasticsearch Data Connector (Batch)
[ https://issues.apache.org/jira/browse/GRIFFIN-326?focusedWorklogId=441224=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-441224 ] ASF GitHub Bot logged work on GRIFFIN-326: -- Author: ASF GitHub Bot Created on: 04/Jun/20 10:56 Start Date: 04/Jun/20 10:56 Worklog Time Spent: 10m Work Description: guoyuepeng edited a comment on pull request #569: URL: https://github.com/apache/griffin/pull/569#issuecomment-638776604 LGTM, will merge it. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org Issue Time Tracking --- Worklog Id: (was: 441224) Time Spent: 4h (was: 3h 50m) > New implementation for Elasticsearch Data Connector (Batch) > --- > > Key: GRIFFIN-326 > URL: https://issues.apache.org/jira/browse/GRIFFIN-326 > Project: Griffin > Issue Type: Sub-task >Reporter: Chitral Verma >Priority: Major > Time Spent: 4h > Remaining Estimate: 0h > > The current implementation of Elasticsearch relies on sending post requests > from the driver using either SQL or search mode for query filtering. > This implementation has the following potential issues, > * Data is fetched for indexes (database scopes in ES) in bulk via 1 call on > the driver. If the index has a lot of data, due to the big response payload, > a bottleneck would be created on the driver. > * Further, the driver then needs to parse this response payload and then > parallelize it, this is again a driver side bottleneck as each JSON record > needs to be mapped to a set schema in a type-safe manner. > * Only _host_, _port_ and _version_ are the available options to configure > the connection to the ES node or cluster. > * Source partitioning logic is not carried forward when parallelizing > records, the records will be randomized due to the Spark's default > partitioning > * Even though this implementation is a first-class member of Apache Griffin, > yet it's based on the _custom_ connector trait. > The proposed implementation aims to, > * Deprecate the current implementation in favor of the direct official > [elasticsearch-hadoop|[https://github.com/elastic/elasticsearch-hadoop/tree/master/spark/sql-20]] > library. > * This library is built on DataSource API built on spark 2.2.x+ and thus > brings support for filter pushdowns, column pruning, unified read and write > and additional optimizations. > * Many configuration options are available for ES connectivity, [check > here|[https://github.com/elastic/elasticsearch-hadoop/blob/master/mr/src/main/java/org/elasticsearch/hadoop/cfg/ConfigurationOptions.java]] > * Any filters can be applied as expressions directly on the data frame and > are pushed automatically to the source. > The new implementation will look something like, > {code:java} > sparkSession.read.format("es").options( ??? ).load(""){code} -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Work logged] (GRIFFIN-326) New implementation for Elasticsearch Data Connector (Batch)
[ https://issues.apache.org/jira/browse/GRIFFIN-326?focusedWorklogId=441223=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-441223 ] ASF GitHub Bot logged work on GRIFFIN-326: -- Author: ASF GitHub Bot Created on: 04/Jun/20 10:56 Start Date: 04/Jun/20 10:56 Worklog Time Spent: 10m Work Description: guoyuepeng commented on pull request #569: URL: https://github.com/apache/griffin/pull/569#issuecomment-638776604 reviewed, will merge it. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org Issue Time Tracking --- Worklog Id: (was: 441223) Time Spent: 3h 50m (was: 3h 40m) > New implementation for Elasticsearch Data Connector (Batch) > --- > > Key: GRIFFIN-326 > URL: https://issues.apache.org/jira/browse/GRIFFIN-326 > Project: Griffin > Issue Type: Sub-task >Reporter: Chitral Verma >Priority: Major > Time Spent: 3h 50m > Remaining Estimate: 0h > > The current implementation of Elasticsearch relies on sending post requests > from the driver using either SQL or search mode for query filtering. > This implementation has the following potential issues, > * Data is fetched for indexes (database scopes in ES) in bulk via 1 call on > the driver. If the index has a lot of data, due to the big response payload, > a bottleneck would be created on the driver. > * Further, the driver then needs to parse this response payload and then > parallelize it, this is again a driver side bottleneck as each JSON record > needs to be mapped to a set schema in a type-safe manner. > * Only _host_, _port_ and _version_ are the available options to configure > the connection to the ES node or cluster. > * Source partitioning logic is not carried forward when parallelizing > records, the records will be randomized due to the Spark's default > partitioning > * Even though this implementation is a first-class member of Apache Griffin, > yet it's based on the _custom_ connector trait. > The proposed implementation aims to, > * Deprecate the current implementation in favor of the direct official > [elasticsearch-hadoop|[https://github.com/elastic/elasticsearch-hadoop/tree/master/spark/sql-20]] > library. > * This library is built on DataSource API built on spark 2.2.x+ and thus > brings support for filter pushdowns, column pruning, unified read and write > and additional optimizations. > * Many configuration options are available for ES connectivity, [check > here|[https://github.com/elastic/elasticsearch-hadoop/blob/master/mr/src/main/java/org/elasticsearch/hadoop/cfg/ConfigurationOptions.java]] > * Any filters can be applied as expressions directly on the data frame and > are pushed automatically to the source. > The new implementation will look something like, > {code:java} > sparkSession.read.format("es").options( ??? ).load(""){code} -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Work logged] (GRIFFIN-326) New implementation for Elasticsearch Data Connector (Batch)
[ https://issues.apache.org/jira/browse/GRIFFIN-326?focusedWorklogId=439685=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-439685 ] ASF GitHub Bot logged work on GRIFFIN-326: -- Author: ASF GitHub Bot Created on: 01/Jun/20 19:12 Start Date: 01/Jun/20 19:12 Worklog Time Spent: 10m Work Description: chitralverma edited a comment on pull request #569: URL: https://github.com/apache/griffin/pull/569#issuecomment-637049574 @guoyuepeng @wankunde I've made the necessary changes in this PR to fix failures and the build is now a success. Can you please review and merge this now. Key Changes since failures: - As per @icesmartjuan suggestion, updated the version of `typescript`. - As per @wankunde suggestion, refactored `FileBasedDataConnectorTest`. - Refactored `ElasticSearchDataConnectorTest` as it was too resource-intensive for the builder. - Minor changes to other connectors as they were suppressing read exceptions. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org Issue Time Tracking --- Worklog Id: (was: 439685) Time Spent: 3.5h (was: 3h 20m) > New implementation for Elasticsearch Data Connector (Batch) > --- > > Key: GRIFFIN-326 > URL: https://issues.apache.org/jira/browse/GRIFFIN-326 > Project: Griffin > Issue Type: Sub-task >Reporter: Chitral Verma >Priority: Major > Time Spent: 3.5h > Remaining Estimate: 0h > > The current implementation of Elasticsearch relies on sending post requests > from the driver using either SQL or search mode for query filtering. > This implementation has the following potential issues, > * Data is fetched for indexes (database scopes in ES) in bulk via 1 call on > the driver. If the index has a lot of data, due to the big response payload, > a bottleneck would be created on the driver. > * Further, the driver then needs to parse this response payload and then > parallelize it, this is again a driver side bottleneck as each JSON record > needs to be mapped to a set schema in a type-safe manner. > * Only _host_, _port_ and _version_ are the available options to configure > the connection to the ES node or cluster. > * Source partitioning logic is not carried forward when parallelizing > records, the records will be randomized due to the Spark's default > partitioning > * Even though this implementation is a first-class member of Apache Griffin, > yet it's based on the _custom_ connector trait. > The proposed implementation aims to, > * Deprecate the current implementation in favor of the direct official > [elasticsearch-hadoop|[https://github.com/elastic/elasticsearch-hadoop/tree/master/spark/sql-20]] > library. > * This library is built on DataSource API built on spark 2.2.x+ and thus > brings support for filter pushdowns, column pruning, unified read and write > and additional optimizations. > * Many configuration options are available for ES connectivity, [check > here|[https://github.com/elastic/elasticsearch-hadoop/blob/master/mr/src/main/java/org/elasticsearch/hadoop/cfg/ConfigurationOptions.java]] > * Any filters can be applied as expressions directly on the data frame and > are pushed automatically to the source. > The new implementation will look something like, > {code:java} > sparkSession.read.format("es").options( ??? ).load(""){code} -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Work logged] (GRIFFIN-326) New implementation for Elasticsearch Data Connector (Batch)
[ https://issues.apache.org/jira/browse/GRIFFIN-326?focusedWorklogId=439683=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-439683 ] ASF GitHub Bot logged work on GRIFFIN-326: -- Author: ASF GitHub Bot Created on: 01/Jun/20 19:11 Start Date: 01/Jun/20 19:11 Worklog Time Spent: 10m Work Description: chitralverma commented on pull request #569: URL: https://github.com/apache/griffin/pull/569#issuecomment-637049574 @guoyuepeng @wankunde I've made the necessary changes in this PR to fix failures, can you please review and merge this now. Key Changes since failures: - As per @icesmartjuan suggestion, updated the version of `typescript`. - As per @wankunde suggestion, refactored `FileBasedDataConnector`. - Refactored `ElasticSearchDataConnectorTest` as it was too resource-intensive for the builder. - Minor changes to other connectors as they were suppressing read exceptions. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org Issue Time Tracking --- Worklog Id: (was: 439683) Time Spent: 3h 10m (was: 3h) > New implementation for Elasticsearch Data Connector (Batch) > --- > > Key: GRIFFIN-326 > URL: https://issues.apache.org/jira/browse/GRIFFIN-326 > Project: Griffin > Issue Type: Sub-task >Reporter: Chitral Verma >Priority: Major > Time Spent: 3h 10m > Remaining Estimate: 0h > > The current implementation of Elasticsearch relies on sending post requests > from the driver using either SQL or search mode for query filtering. > This implementation has the following potential issues, > * Data is fetched for indexes (database scopes in ES) in bulk via 1 call on > the driver. If the index has a lot of data, due to the big response payload, > a bottleneck would be created on the driver. > * Further, the driver then needs to parse this response payload and then > parallelize it, this is again a driver side bottleneck as each JSON record > needs to be mapped to a set schema in a type-safe manner. > * Only _host_, _port_ and _version_ are the available options to configure > the connection to the ES node or cluster. > * Source partitioning logic is not carried forward when parallelizing > records, the records will be randomized due to the Spark's default > partitioning > * Even though this implementation is a first-class member of Apache Griffin, > yet it's based on the _custom_ connector trait. > The proposed implementation aims to, > * Deprecate the current implementation in favor of the direct official > [elasticsearch-hadoop|[https://github.com/elastic/elasticsearch-hadoop/tree/master/spark/sql-20]] > library. > * This library is built on DataSource API built on spark 2.2.x+ and thus > brings support for filter pushdowns, column pruning, unified read and write > and additional optimizations. > * Many configuration options are available for ES connectivity, [check > here|[https://github.com/elastic/elasticsearch-hadoop/blob/master/mr/src/main/java/org/elasticsearch/hadoop/cfg/ConfigurationOptions.java]] > * Any filters can be applied as expressions directly on the data frame and > are pushed automatically to the source. > The new implementation will look something like, > {code:java} > sparkSession.read.format("es").options( ??? ).load(""){code} -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Work logged] (GRIFFIN-326) New implementation for Elasticsearch Data Connector (Batch)
[ https://issues.apache.org/jira/browse/GRIFFIN-326?focusedWorklogId=439684=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-439684 ] ASF GitHub Bot logged work on GRIFFIN-326: -- Author: ASF GitHub Bot Created on: 01/Jun/20 19:11 Start Date: 01/Jun/20 19:11 Worklog Time Spent: 10m Work Description: chitralverma edited a comment on pull request #569: URL: https://github.com/apache/griffin/pull/569#issuecomment-637049574 @guoyuepeng @wankunde I've made the necessary changes in this PR to fix failures and the build is now a success. Can you please review and merge this now. Key Changes since failures: - As per @icesmartjuan suggestion, updated the version of `typescript`. - As per @wankunde suggestion, refactored `FileBasedDataConnector`. - Refactored `ElasticSearchDataConnectorTest` as it was too resource-intensive for the builder. - Minor changes to other connectors as they were suppressing read exceptions. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org Issue Time Tracking --- Worklog Id: (was: 439684) Time Spent: 3h 20m (was: 3h 10m) > New implementation for Elasticsearch Data Connector (Batch) > --- > > Key: GRIFFIN-326 > URL: https://issues.apache.org/jira/browse/GRIFFIN-326 > Project: Griffin > Issue Type: Sub-task >Reporter: Chitral Verma >Priority: Major > Time Spent: 3h 20m > Remaining Estimate: 0h > > The current implementation of Elasticsearch relies on sending post requests > from the driver using either SQL or search mode for query filtering. > This implementation has the following potential issues, > * Data is fetched for indexes (database scopes in ES) in bulk via 1 call on > the driver. If the index has a lot of data, due to the big response payload, > a bottleneck would be created on the driver. > * Further, the driver then needs to parse this response payload and then > parallelize it, this is again a driver side bottleneck as each JSON record > needs to be mapped to a set schema in a type-safe manner. > * Only _host_, _port_ and _version_ are the available options to configure > the connection to the ES node or cluster. > * Source partitioning logic is not carried forward when parallelizing > records, the records will be randomized due to the Spark's default > partitioning > * Even though this implementation is a first-class member of Apache Griffin, > yet it's based on the _custom_ connector trait. > The proposed implementation aims to, > * Deprecate the current implementation in favor of the direct official > [elasticsearch-hadoop|[https://github.com/elastic/elasticsearch-hadoop/tree/master/spark/sql-20]] > library. > * This library is built on DataSource API built on spark 2.2.x+ and thus > brings support for filter pushdowns, column pruning, unified read and write > and additional optimizations. > * Many configuration options are available for ES connectivity, [check > here|[https://github.com/elastic/elasticsearch-hadoop/blob/master/mr/src/main/java/org/elasticsearch/hadoop/cfg/ConfigurationOptions.java]] > * Any filters can be applied as expressions directly on the data frame and > are pushed automatically to the source. > The new implementation will look something like, > {code:java} > sparkSession.read.format("es").options( ??? ).load(""){code} -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Work logged] (GRIFFIN-326) New implementation for Elasticsearch Data Connector (Batch)
[ https://issues.apache.org/jira/browse/GRIFFIN-326?focusedWorklogId=433610=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-433610 ] ASF GitHub Bot logged work on GRIFFIN-326: -- Author: ASF GitHub Bot Created on: 15/May/20 09:13 Start Date: 15/May/20 09:13 Worklog Time Spent: 10m Work Description: chitralverma commented on pull request #569: URL: https://github.com/apache/griffin/pull/569#issuecomment-629126531 @icesmartjuan thanks, I'll try that and commit This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org Issue Time Tracking --- Worklog Id: (was: 433610) Time Spent: 3h (was: 2h 50m) > New implementation for Elasticsearch Data Connector (Batch) > --- > > Key: GRIFFIN-326 > URL: https://issues.apache.org/jira/browse/GRIFFIN-326 > Project: Griffin > Issue Type: Sub-task >Reporter: Chitral Verma >Priority: Major > Time Spent: 3h > Remaining Estimate: 0h > > The current implementation of Elasticsearch relies on sending post requests > from the driver using either SQL or search mode for query filtering. > This implementation has the following potential issues, > * Data is fetched for indexes (database scopes in ES) in bulk via 1 call on > the driver. If the index has a lot of data, due to the big response payload, > a bottleneck would be created on the driver. > * Further, the driver then needs to parse this response payload and then > parallelize it, this is again a driver side bottleneck as each JSON record > needs to be mapped to a set schema in a type-safe manner. > * Only _host_, _port_ and _version_ are the available options to configure > the connection to the ES node or cluster. > * Source partitioning logic is not carried forward when parallelizing > records, the records will be randomized due to the Spark's default > partitioning > * Even though this implementation is a first-class member of Apache Griffin, > yet it's based on the _custom_ connector trait. > The proposed implementation aims to, > * Deprecate the current implementation in favor of the direct official > [elasticsearch-hadoop|[https://github.com/elastic/elasticsearch-hadoop/tree/master/spark/sql-20]] > library. > * This library is built on DataSource API built on spark 2.2.x+ and thus > brings support for filter pushdowns, column pruning, unified read and write > and additional optimizations. > * Many configuration options are available for ES connectivity, [check > here|[https://github.com/elastic/elasticsearch-hadoop/blob/master/mr/src/main/java/org/elasticsearch/hadoop/cfg/ConfigurationOptions.java]] > * Any filters can be applied as expressions directly on the data frame and > are pushed automatically to the source. > The new implementation will look something like, > {code:java} > sparkSession.read.format("es").options( ??? ).load(""){code} -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Work logged] (GRIFFIN-326) New implementation for Elasticsearch Data Connector (Batch)
[ https://issues.apache.org/jira/browse/GRIFFIN-326?focusedWorklogId=433609=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-433609 ] ASF GitHub Bot logged work on GRIFFIN-326: -- Author: ASF GitHub Bot Created on: 15/May/20 09:12 Start Date: 15/May/20 09:12 Worklog Time Spent: 10m Work Description: icesmartjuan commented on pull request #569: URL: https://github.com/apache/griffin/pull/569#issuecomment-629126063 Hi @chitralverma , the type seems added in typescript 2.7, can you please have a try? (it works fine now) - please change typescript version from `"typescript": "~2.3.3",` to `"typescript": "~2.7",` in `package.json` , - remove `node_modules ` and `package-lock.json`, - re-run `npm install`, - then `npm start`, it should run successfully This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org Issue Time Tracking --- Worklog Id: (was: 433609) Time Spent: 2h 50m (was: 2h 40m) > New implementation for Elasticsearch Data Connector (Batch) > --- > > Key: GRIFFIN-326 > URL: https://issues.apache.org/jira/browse/GRIFFIN-326 > Project: Griffin > Issue Type: Sub-task >Reporter: Chitral Verma >Priority: Major > Time Spent: 2h 50m > Remaining Estimate: 0h > > The current implementation of Elasticsearch relies on sending post requests > from the driver using either SQL or search mode for query filtering. > This implementation has the following potential issues, > * Data is fetched for indexes (database scopes in ES) in bulk via 1 call on > the driver. If the index has a lot of data, due to the big response payload, > a bottleneck would be created on the driver. > * Further, the driver then needs to parse this response payload and then > parallelize it, this is again a driver side bottleneck as each JSON record > needs to be mapped to a set schema in a type-safe manner. > * Only _host_, _port_ and _version_ are the available options to configure > the connection to the ES node or cluster. > * Source partitioning logic is not carried forward when parallelizing > records, the records will be randomized due to the Spark's default > partitioning > * Even though this implementation is a first-class member of Apache Griffin, > yet it's based on the _custom_ connector trait. > The proposed implementation aims to, > * Deprecate the current implementation in favor of the direct official > [elasticsearch-hadoop|[https://github.com/elastic/elasticsearch-hadoop/tree/master/spark/sql-20]] > library. > * This library is built on DataSource API built on spark 2.2.x+ and thus > brings support for filter pushdowns, column pruning, unified read and write > and additional optimizations. > * Many configuration options are available for ES connectivity, [check > here|[https://github.com/elastic/elasticsearch-hadoop/blob/master/mr/src/main/java/org/elasticsearch/hadoop/cfg/ConfigurationOptions.java]] > * Any filters can be applied as expressions directly on the data frame and > are pushed automatically to the source. > The new implementation will look something like, > {code:java} > sparkSession.read.format("es").options( ??? ).load(""){code} -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Work logged] (GRIFFIN-326) New implementation for Elasticsearch Data Connector (Batch)
[ https://issues.apache.org/jira/browse/GRIFFIN-326?focusedWorklogId=433016=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-433016 ] ASF GitHub Bot logged work on GRIFFIN-326: -- Author: ASF GitHub Bot Created on: 14/May/20 07:03 Start Date: 14/May/20 07:03 Worklog Time Spent: 10m Work Description: guoyuepeng commented on pull request #569: URL: https://github.com/apache/griffin/pull/569#issuecomment-628432592 re-trigger build This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org Issue Time Tracking --- Worklog Id: (was: 433016) Time Spent: 2h 40m (was: 2.5h) > New implementation for Elasticsearch Data Connector (Batch) > --- > > Key: GRIFFIN-326 > URL: https://issues.apache.org/jira/browse/GRIFFIN-326 > Project: Griffin > Issue Type: Sub-task >Reporter: Chitral Verma >Priority: Major > Time Spent: 2h 40m > Remaining Estimate: 0h > > The current implementation of Elasticsearch relies on sending post requests > from the driver using either SQL or search mode for query filtering. > This implementation has the following potential issues, > * Data is fetched for indexes (database scopes in ES) in bulk via 1 call on > the driver. If the index has a lot of data, due to the big response payload, > a bottleneck would be created on the driver. > * Further, the driver then needs to parse this response payload and then > parallelize it, this is again a driver side bottleneck as each JSON record > needs to be mapped to a set schema in a type-safe manner. > * Only _host_, _port_ and _version_ are the available options to configure > the connection to the ES node or cluster. > * Source partitioning logic is not carried forward when parallelizing > records, the records will be randomized due to the Spark's default > partitioning > * Even though this implementation is a first-class member of Apache Griffin, > yet it's based on the _custom_ connector trait. > The proposed implementation aims to, > * Deprecate the current implementation in favor of the direct official > [elasticsearch-hadoop|[https://github.com/elastic/elasticsearch-hadoop/tree/master/spark/sql-20]] > library. > * This library is built on DataSource API built on spark 2.2.x+ and thus > brings support for filter pushdowns, column pruning, unified read and write > and additional optimizations. > * Many configuration options are available for ES connectivity, [check > here|[https://github.com/elastic/elasticsearch-hadoop/blob/master/mr/src/main/java/org/elasticsearch/hadoop/cfg/ConfigurationOptions.java]] > * Any filters can be applied as expressions directly on the data frame and > are pushed automatically to the source. > The new implementation will look something like, > {code:java} > sparkSession.read.format("es").options( ??? ).load(""){code} -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Work logged] (GRIFFIN-326) New implementation for Elasticsearch Data Connector (Batch)
[ https://issues.apache.org/jira/browse/GRIFFIN-326?focusedWorklogId=433010=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-433010 ] ASF GitHub Bot logged work on GRIFFIN-326: -- Author: ASF GitHub Bot Created on: 14/May/20 06:40 Start Date: 14/May/20 06:40 Worklog Time Spent: 10m Work Description: guoyuepeng commented on pull request #569: URL: https://github.com/apache/griffin/pull/569#issuecomment-628422615 retest this, please This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org Issue Time Tracking --- Worklog Id: (was: 433010) Time Spent: 2.5h (was: 2h 20m) > New implementation for Elasticsearch Data Connector (Batch) > --- > > Key: GRIFFIN-326 > URL: https://issues.apache.org/jira/browse/GRIFFIN-326 > Project: Griffin > Issue Type: Sub-task >Reporter: Chitral Verma >Priority: Major > Time Spent: 2.5h > Remaining Estimate: 0h > > The current implementation of Elasticsearch relies on sending post requests > from the driver using either SQL or search mode for query filtering. > This implementation has the following potential issues, > * Data is fetched for indexes (database scopes in ES) in bulk via 1 call on > the driver. If the index has a lot of data, due to the big response payload, > a bottleneck would be created on the driver. > * Further, the driver then needs to parse this response payload and then > parallelize it, this is again a driver side bottleneck as each JSON record > needs to be mapped to a set schema in a type-safe manner. > * Only _host_, _port_ and _version_ are the available options to configure > the connection to the ES node or cluster. > * Source partitioning logic is not carried forward when parallelizing > records, the records will be randomized due to the Spark's default > partitioning > * Even though this implementation is a first-class member of Apache Griffin, > yet it's based on the _custom_ connector trait. > The proposed implementation aims to, > * Deprecate the current implementation in favor of the direct official > [elasticsearch-hadoop|[https://github.com/elastic/elasticsearch-hadoop/tree/master/spark/sql-20]] > library. > * This library is built on DataSource API built on spark 2.2.x+ and thus > brings support for filter pushdowns, column pruning, unified read and write > and additional optimizations. > * Many configuration options are available for ES connectivity, [check > here|[https://github.com/elastic/elasticsearch-hadoop/blob/master/mr/src/main/java/org/elasticsearch/hadoop/cfg/ConfigurationOptions.java]] > * Any filters can be applied as expressions directly on the data frame and > are pushed automatically to the source. > The new implementation will look something like, > {code:java} > sparkSession.read.format("es").options( ??? ).load(""){code} -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Work logged] (GRIFFIN-326) New implementation for Elasticsearch Data Connector (Batch)
[ https://issues.apache.org/jira/browse/GRIFFIN-326?focusedWorklogId=432989=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-432989 ] ASF GitHub Bot logged work on GRIFFIN-326: -- Author: ASF GitHub Bot Created on: 14/May/20 05:45 Start Date: 14/May/20 05:45 Worklog Time Spent: 10m Work Description: guoyuepeng commented on pull request #569: URL: https://github.com/apache/griffin/pull/569#issuecomment-628400718 > Anyway to retrigger a build without a blank commit? let me figure out how to trigger the build. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org Issue Time Tracking --- Worklog Id: (was: 432989) Time Spent: 2h 20m (was: 2h 10m) > New implementation for Elasticsearch Data Connector (Batch) > --- > > Key: GRIFFIN-326 > URL: https://issues.apache.org/jira/browse/GRIFFIN-326 > Project: Griffin > Issue Type: Sub-task >Reporter: Chitral Verma >Priority: Major > Time Spent: 2h 20m > Remaining Estimate: 0h > > The current implementation of Elasticsearch relies on sending post requests > from the driver using either SQL or search mode for query filtering. > This implementation has the following potential issues, > * Data is fetched for indexes (database scopes in ES) in bulk via 1 call on > the driver. If the index has a lot of data, due to the big response payload, > a bottleneck would be created on the driver. > * Further, the driver then needs to parse this response payload and then > parallelize it, this is again a driver side bottleneck as each JSON record > needs to be mapped to a set schema in a type-safe manner. > * Only _host_, _port_ and _version_ are the available options to configure > the connection to the ES node or cluster. > * Source partitioning logic is not carried forward when parallelizing > records, the records will be randomized due to the Spark's default > partitioning > * Even though this implementation is a first-class member of Apache Griffin, > yet it's based on the _custom_ connector trait. > The proposed implementation aims to, > * Deprecate the current implementation in favor of the direct official > [elasticsearch-hadoop|[https://github.com/elastic/elasticsearch-hadoop/tree/master/spark/sql-20]] > library. > * This library is built on DataSource API built on spark 2.2.x+ and thus > brings support for filter pushdowns, column pruning, unified read and write > and additional optimizations. > * Many configuration options are available for ES connectivity, [check > here|[https://github.com/elastic/elasticsearch-hadoop/blob/master/mr/src/main/java/org/elasticsearch/hadoop/cfg/ConfigurationOptions.java]] > * Any filters can be applied as expressions directly on the data frame and > are pushed automatically to the source. > The new implementation will look something like, > {code:java} > sparkSession.read.format("es").options( ??? ).load(""){code} -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Work logged] (GRIFFIN-326) New implementation for Elasticsearch Data Connector (Batch)
[ https://issues.apache.org/jira/browse/GRIFFIN-326?focusedWorklogId=431991=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-431991 ] ASF GitHub Bot logged work on GRIFFIN-326: -- Author: ASF GitHub Bot Created on: 08/May/20 05:55 Start Date: 08/May/20 05:55 Worklog Time Spent: 10m Work Description: wankunde commented on pull request #569: URL: https://github.com/apache/griffin/pull/569#issuecomment-625645288 @chitralverma https://github.com/DefinitelyTyped/DefinitelyTyped/issues/43977 There seems to be mismatch with npm packages. @guoyuepeng Can you help with this ? This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org Issue Time Tracking --- Worklog Id: (was: 431991) Time Spent: 2h 10m (was: 2h) > New implementation for Elasticsearch Data Connector (Batch) > --- > > Key: GRIFFIN-326 > URL: https://issues.apache.org/jira/browse/GRIFFIN-326 > Project: Griffin > Issue Type: Sub-task >Reporter: Chitral Verma >Priority: Major > Time Spent: 2h 10m > Remaining Estimate: 0h > > The current implementation of Elasticsearch relies on sending post requests > from the driver using either SQL or search mode for query filtering. > This implementation has the following potential issues, > * Data is fetched for indexes (database scopes in ES) in bulk via 1 call on > the driver. If the index has a lot of data, due to the big response payload, > a bottleneck would be created on the driver. > * Further, the driver then needs to parse this response payload and then > parallelize it, this is again a driver side bottleneck as each JSON record > needs to be mapped to a set schema in a type-safe manner. > * Only _host_, _port_ and _version_ are the available options to configure > the connection to the ES node or cluster. > * Source partitioning logic is not carried forward when parallelizing > records, the records will be randomized due to the Spark's default > partitioning > * Even though this implementation is a first-class member of Apache Griffin, > yet it's based on the _custom_ connector trait. > The proposed implementation aims to, > * Deprecate the current implementation in favor of the direct official > [elasticsearch-hadoop|[https://github.com/elastic/elasticsearch-hadoop/tree/master/spark/sql-20]] > library. > * This library is built on DataSource API built on spark 2.2.x+ and thus > brings support for filter pushdowns, column pruning, unified read and write > and additional optimizations. > * Many configuration options are available for ES connectivity, [check > here|[https://github.com/elastic/elasticsearch-hadoop/blob/master/mr/src/main/java/org/elasticsearch/hadoop/cfg/ConfigurationOptions.java]] > * Any filters can be applied as expressions directly on the data frame and > are pushed automatically to the source. > The new implementation will look something like, > {code:java} > sparkSession.read.format("es").options( ??? ).load(""){code} -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Work logged] (GRIFFIN-326) New implementation for Elasticsearch Data Connector (Batch)
[ https://issues.apache.org/jira/browse/GRIFFIN-326?focusedWorklogId=429861=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-429861 ] ASF GitHub Bot logged work on GRIFFIN-326: -- Author: ASF GitHub Bot Created on: 02/May/20 18:04 Start Date: 02/May/20 18:04 Worklog Time Spent: 10m Work Description: chitralverma commented on pull request #569: URL: https://github.com/apache/griffin/pull/569#issuecomment-622992211 @wankunde the build has failed again, due to some error on the UI side. not sure of the fix here. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org Issue Time Tracking --- Worklog Id: (was: 429861) Time Spent: 2h (was: 1h 50m) > New implementation for Elasticsearch Data Connector (Batch) > --- > > Key: GRIFFIN-326 > URL: https://issues.apache.org/jira/browse/GRIFFIN-326 > Project: Griffin > Issue Type: Sub-task >Reporter: Chitral Verma >Priority: Major > Time Spent: 2h > Remaining Estimate: 0h > > The current implementation of Elasticsearch relies on sending post requests > from the driver using either SQL or search mode for query filtering. > This implementation has the following potential issues, > * Data is fetched for indexes (database scopes in ES) in bulk via 1 call on > the driver. If the index has a lot of data, due to the big response payload, > a bottleneck would be created on the driver. > * Further, the driver then needs to parse this response payload and then > parallelize it, this is again a driver side bottleneck as each JSON record > needs to be mapped to a set schema in a type-safe manner. > * Only _host_, _port_ and _version_ are the available options to configure > the connection to the ES node or cluster. > * Source partitioning logic is not carried forward when parallelizing > records, the records will be randomized due to the Spark's default > partitioning > * Even though this implementation is a first-class member of Apache Griffin, > yet it's based on the _custom_ connector trait. > The proposed implementation aims to, > * Deprecate the current implementation in favor of the direct official > [elasticsearch-hadoop|[https://github.com/elastic/elasticsearch-hadoop/tree/master/spark/sql-20]] > library. > * This library is built on DataSource API built on spark 2.2.x+ and thus > brings support for filter pushdowns, column pruning, unified read and write > and additional optimizations. > * Many configuration options are available for ES connectivity, [check > here|[https://github.com/elastic/elasticsearch-hadoop/blob/master/mr/src/main/java/org/elasticsearch/hadoop/cfg/ConfigurationOptions.java]] > * Any filters can be applied as expressions directly on the data frame and > are pushed automatically to the source. > The new implementation will look something like, > {code:java} > sparkSession.read.format("es").options( ??? ).load(""){code} -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Work logged] (GRIFFIN-326) New implementation for Elasticsearch Data Connector (Batch)
[ https://issues.apache.org/jira/browse/GRIFFIN-326?focusedWorklogId=421137=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-421137 ] ASF GitHub Bot logged work on GRIFFIN-326: -- Author: ASF GitHub Bot Created on: 13/Apr/20 07:39 Start Date: 13/Apr/20 07:39 Worklog Time Spent: 10m Work Description: DragonPrince1992 commented on issue #569: [GRIFFIN-326] New Data Connector for Elasticsearch URL: https://github.com/apache/griffin/pull/569#issuecomment-612790960 Can I add you a WeChat?It's easier to communicate. --原始邮件-- 发件人:"Chitral Verma" New implementation for Elasticsearch Data Connector (Batch) > --- > > Key: GRIFFIN-326 > URL: https://issues.apache.org/jira/browse/GRIFFIN-326 > Project: Griffin > Issue Type: Sub-task >Reporter: Chitral Verma >Priority: Major > Time Spent: 1h 50m > Remaining Estimate: 0h > > The current implementation of Elasticsearch relies on sending post requests > from the driver using either SQL or search mode for query filtering. > This implementation has the following potential issues, > * Data is fetched for indexes (database scopes in ES) in bulk via 1 call on > the driver. If the index has a lot of data, due to the big response payload, > a bottleneck would be created on the driver. > * Further, the driver then needs to parse this response payload and then > parallelize it, this is again a driver side bottleneck as each JSON record > needs to be mapped to a set schema in a type-safe manner. > * Only _host_, _port_ and _version_ are the available options to configure > the connection to the ES node or cluster. > * Source partitioning logic is not carried forward when parallelizing > records, the records will be randomized due to the Spark's default > partitioning > * Even though this implementation is a first-class member of Apache Griffin, > yet it's based on the _custom_ connector trait. > The proposed implementation aims to, > * Deprecate the current implementation in favor of the direct official > [elasticsearch-hadoop|[https://github.com/elastic/elasticsearch-hadoop/tree/master/spark/sql-20]] > library. > * This library is built on DataSource API built on spark 2.2.x+ and thus > brings support for filter pushdowns, column pruning, unified read and write > and additional optimizations. > * Many configuration options are available for ES connectivity, [check > here|[https://github.com/elastic/elasticsearch-hadoop/blob/master/mr/src/main/java/org/elasticsearch/hadoop/cfg/ConfigurationOptions.java]] > * Any filters can be applied as expressions directly on the data frame and > are pushed automatically to the source. > The new implementation will look something like, > {code:java} > sparkSession.read.format("es").options( ??? ).load(""){code} -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Work logged] (GRIFFIN-326) New implementation for Elasticsearch Data Connector (Batch)
[ https://issues.apache.org/jira/browse/GRIFFIN-326?focusedWorklogId=421124=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-421124 ] ASF GitHub Bot logged work on GRIFFIN-326: -- Author: ASF GitHub Bot Created on: 13/Apr/20 07:18 Start Date: 13/Apr/20 07:18 Worklog Time Spent: 10m Work Description: DragonPrince1992 commented on issue #569: [GRIFFIN-326] New Data Connector for Elasticsearch URL: https://github.com/apache/griffin/pull/569#issuecomment-612784719 I can't get it.Can I fix this as configuration?Under file\griffin-master\measure\src\main\resources\env-batch.json --原始邮件-- 发件人:"Chitral Verma" New implementation for Elasticsearch Data Connector (Batch) > --- > > Key: GRIFFIN-326 > URL: https://issues.apache.org/jira/browse/GRIFFIN-326 > Project: Griffin > Issue Type: Sub-task >Reporter: Chitral Verma >Priority: Major > Time Spent: 1h 40m > Remaining Estimate: 0h > > The current implementation of Elasticsearch relies on sending post requests > from the driver using either SQL or search mode for query filtering. > This implementation has the following potential issues, > * Data is fetched for indexes (database scopes in ES) in bulk via 1 call on > the driver. If the index has a lot of data, due to the big response payload, > a bottleneck would be created on the driver. > * Further, the driver then needs to parse this response payload and then > parallelize it, this is again a driver side bottleneck as each JSON record > needs to be mapped to a set schema in a type-safe manner. > * Only _host_, _port_ and _version_ are the available options to configure > the connection to the ES node or cluster. > * Source partitioning logic is not carried forward when parallelizing > records, the records will be randomized due to the Spark's default > partitioning > * Even though this implementation is a first-class member of Apache Griffin, > yet it's based on the _custom_ connector trait. > The proposed implementation aims to, > * Deprecate the current implementation in favor of the direct official > [elasticsearch-hadoop|[https://github.com/elastic/elasticsearch-hadoop/tree/master/spark/sql-20]] > library. > * This library is built on DataSource API built on spark 2.2.x+ and thus > brings support for filter pushdowns, column pruning, unified read and write > and additional optimizations. > * Many configuration options are available for ES connectivity, [check > here|[https://github.com/elastic/elasticsearch-hadoop/blob/master/mr/src/main/java/org/elasticsearch/hadoop/cfg/ConfigurationOptions.java]] > * Any filters can be applied as expressions directly on the data frame and > are pushed automatically to the source. > The new implementation will look something like, > {code:java} > sparkSession.read.format("es").options( ??? ).load(""){code} -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Work logged] (GRIFFIN-326) New implementation for Elasticsearch Data Connector (Batch)
[ https://issues.apache.org/jira/browse/GRIFFIN-326?focusedWorklogId=421119=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-421119 ] ASF GitHub Bot logged work on GRIFFIN-326: -- Author: ASF GitHub Bot Created on: 13/Apr/20 07:09 Start Date: 13/Apr/20 07:09 Worklog Time Spent: 10m Work Description: chitralverma commented on issue #569: [GRIFFIN-326] New Data Connector for Elasticsearch URL: https://github.com/apache/griffin/pull/569#issuecomment-612782390 also, you have to use `es.nodes` not `es.hostname` This PR has not been closed yet This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org Issue Time Tracking --- Worklog Id: (was: 421119) Time Spent: 1.5h (was: 1h 20m) > New implementation for Elasticsearch Data Connector (Batch) > --- > > Key: GRIFFIN-326 > URL: https://issues.apache.org/jira/browse/GRIFFIN-326 > Project: Griffin > Issue Type: Sub-task >Reporter: Chitral Verma >Priority: Major > Time Spent: 1.5h > Remaining Estimate: 0h > > The current implementation of Elasticsearch relies on sending post requests > from the driver using either SQL or search mode for query filtering. > This implementation has the following potential issues, > * Data is fetched for indexes (database scopes in ES) in bulk via 1 call on > the driver. If the index has a lot of data, due to the big response payload, > a bottleneck would be created on the driver. > * Further, the driver then needs to parse this response payload and then > parallelize it, this is again a driver side bottleneck as each JSON record > needs to be mapped to a set schema in a type-safe manner. > * Only _host_, _port_ and _version_ are the available options to configure > the connection to the ES node or cluster. > * Source partitioning logic is not carried forward when parallelizing > records, the records will be randomized due to the Spark's default > partitioning > * Even though this implementation is a first-class member of Apache Griffin, > yet it's based on the _custom_ connector trait. > The proposed implementation aims to, > * Deprecate the current implementation in favor of the direct official > [elasticsearch-hadoop|[https://github.com/elastic/elasticsearch-hadoop/tree/master/spark/sql-20]] > library. > * This library is built on DataSource API built on spark 2.2.x+ and thus > brings support for filter pushdowns, column pruning, unified read and write > and additional optimizations. > * Many configuration options are available for ES connectivity, [check > here|[https://github.com/elastic/elasticsearch-hadoop/blob/master/mr/src/main/java/org/elasticsearch/hadoop/cfg/ConfigurationOptions.java]] > * Any filters can be applied as expressions directly on the data frame and > are pushed automatically to the source. > The new implementation will look something like, > {code:java} > sparkSession.read.format("es").options( ??? ).load(""){code} -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Work logged] (GRIFFIN-326) New implementation for Elasticsearch Data Connector (Batch)
[ https://issues.apache.org/jira/browse/GRIFFIN-326?focusedWorklogId=421117=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-421117 ] ASF GitHub Bot logged work on GRIFFIN-326: -- Author: ASF GitHub Bot Created on: 13/Apr/20 07:07 Start Date: 13/Apr/20 07:07 Worklog Time Spent: 10m Work Description: DragonPrince1992 commented on issue #569: [GRIFFIN-326] New Data Connector for Elasticsearch URL: https://github.com/apache/griffin/pull/569#issuecomment-612781759 I could not seehealthy graphics ,how to solve it --原始邮件-- 发件人:"Chitral Verma" New implementation for Elasticsearch Data Connector (Batch) > --- > > Key: GRIFFIN-326 > URL: https://issues.apache.org/jira/browse/GRIFFIN-326 > Project: Griffin > Issue Type: Sub-task >Reporter: Chitral Verma >Priority: Major > Time Spent: 1h 20m > Remaining Estimate: 0h > > The current implementation of Elasticsearch relies on sending post requests > from the driver using either SQL or search mode for query filtering. > This implementation has the following potential issues, > * Data is fetched for indexes (database scopes in ES) in bulk via 1 call on > the driver. If the index has a lot of data, due to the big response payload, > a bottleneck would be created on the driver. > * Further, the driver then needs to parse this response payload and then > parallelize it, this is again a driver side bottleneck as each JSON record > needs to be mapped to a set schema in a type-safe manner. > * Only _host_, _port_ and _version_ are the available options to configure > the connection to the ES node or cluster. > * Source partitioning logic is not carried forward when parallelizing > records, the records will be randomized due to the Spark's default > partitioning > * Even though this implementation is a first-class member of Apache Griffin, > yet it's based on the _custom_ connector trait. > The proposed implementation aims to, > * Deprecate the current implementation in favor of the direct official > [elasticsearch-hadoop|[https://github.com/elastic/elasticsearch-hadoop/tree/master/spark/sql-20]] > library. > * This library is built on DataSource API built on spark 2.2.x+ and thus > brings support for filter pushdowns, column pruning, unified read and write > and additional optimizations. > * Many configuration options are available for ES connectivity, [check > here|[https://github.com/elastic/elasticsearch-hadoop/blob/master/mr/src/main/java/org/elasticsearch/hadoop/cfg/ConfigurationOptions.java]] > * Any filters can be applied as expressions directly on the data frame and > are pushed automatically to the source. > The new implementation will look something like, > {code:java} > sparkSession.read.format("es").options( ??? ).load(""){code} -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Work logged] (GRIFFIN-326) New implementation for Elasticsearch Data Connector (Batch)
[ https://issues.apache.org/jira/browse/GRIFFIN-326?focusedWorklogId=421110=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-421110 ] ASF GitHub Bot logged work on GRIFFIN-326: -- Author: ASF GitHub Bot Created on: 13/Apr/20 06:58 Start Date: 13/Apr/20 06:58 Worklog Time Spent: 10m Work Description: DragonPrince1992 commented on issue #569: [GRIFFIN-326] New Data Connector for Elasticsearch URL: https://github.com/apache/griffin/pull/569#issuecomment-612779282 es的hostname都配置正确,在job执行完成之后,es里边查询数据是没有的,这个是怎么回事? This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org Issue Time Tracking --- Worklog Id: (was: 421110) Time Spent: 1h (was: 50m) > New implementation for Elasticsearch Data Connector (Batch) > --- > > Key: GRIFFIN-326 > URL: https://issues.apache.org/jira/browse/GRIFFIN-326 > Project: Griffin > Issue Type: Sub-task >Reporter: Chitral Verma >Priority: Major > Time Spent: 1h > Remaining Estimate: 0h > > The current implementation of Elasticsearch relies on sending post requests > from the driver using either SQL or search mode for query filtering. > This implementation has the following potential issues, > * Data is fetched for indexes (database scopes in ES) in bulk via 1 call on > the driver. If the index has a lot of data, due to the big response payload, > a bottleneck would be created on the driver. > * Further, the driver then needs to parse this response payload and then > parallelize it, this is again a driver side bottleneck as each JSON record > needs to be mapped to a set schema in a type-safe manner. > * Only _host_, _port_ and _version_ are the available options to configure > the connection to the ES node or cluster. > * Source partitioning logic is not carried forward when parallelizing > records, the records will be randomized due to the Spark's default > partitioning > * Even though this implementation is a first-class member of Apache Griffin, > yet it's based on the _custom_ connector trait. > The proposed implementation aims to, > * Deprecate the current implementation in favor of the direct official > [elasticsearch-hadoop|[https://github.com/elastic/elasticsearch-hadoop/tree/master/spark/sql-20]] > library. > * This library is built on DataSource API built on spark 2.2.x+ and thus > brings support for filter pushdowns, column pruning, unified read and write > and additional optimizations. > * Many configuration options are available for ES connectivity, [check > here|[https://github.com/elastic/elasticsearch-hadoop/blob/master/mr/src/main/java/org/elasticsearch/hadoop/cfg/ConfigurationOptions.java]] > * Any filters can be applied as expressions directly on the data frame and > are pushed automatically to the source. > The new implementation will look something like, > {code:java} > sparkSession.read.format("es").options( ??? ).load(""){code} -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Work logged] (GRIFFIN-326) New implementation for Elasticsearch Data Connector (Batch)
[ https://issues.apache.org/jira/browse/GRIFFIN-326?focusedWorklogId=420898=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-420898 ] ASF GitHub Bot logged work on GRIFFIN-326: -- Author: ASF GitHub Bot Created on: 12/Apr/20 07:06 Start Date: 12/Apr/20 07:06 Worklog Time Spent: 10m Work Description: wankunde commented on issue #569: [GRIFFIN-326] New Data Connector for Elasticsearch URL: https://github.com/apache/griffin/pull/569#issuecomment-612574030 @chitralverma @guoyuepeng `FileBasedDataConnectorTest` will generate some test data files in `file://${getClass.getResource("/").getPath}` directory, I am not sure travis has some restrictions on this directory. If this problem occurs again, @chitralverma could put those files directly into test resources directory. @chitralverma could you update your PR with little change and try again? This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org Issue Time Tracking --- Worklog Id: (was: 420898) Time Spent: 50m (was: 40m) > New implementation for Elasticsearch Data Connector (Batch) > --- > > Key: GRIFFIN-326 > URL: https://issues.apache.org/jira/browse/GRIFFIN-326 > Project: Griffin > Issue Type: Sub-task >Reporter: Chitral Verma >Priority: Major > Time Spent: 50m > Remaining Estimate: 0h > > The current implementation of Elasticsearch relies on sending post requests > from the driver using either SQL or search mode for query filtering. > This implementation has the following potential issues, > * Data is fetched for indexes (database scopes in ES) in bulk via 1 call on > the driver. If the index has a lot of data, due to the big response payload, > a bottleneck would be created on the driver. > * Further, the driver then needs to parse this response payload and then > parallelize it, this is again a driver side bottleneck as each JSON record > needs to be mapped to a set schema in a type-safe manner. > * Only _host_, _port_ and _version_ are the available options to configure > the connection to the ES node or cluster. > * Source partitioning logic is not carried forward when parallelizing > records, the records will be randomized due to the Spark's default > partitioning > * Even though this implementation is a first-class member of Apache Griffin, > yet it's based on the _custom_ connector trait. > The proposed implementation aims to, > * Deprecate the current implementation in favor of the direct official > [elasticsearch-hadoop|[https://github.com/elastic/elasticsearch-hadoop/tree/master/spark/sql-20]] > library. > * This library is built on DataSource API built on spark 2.2.x+ and thus > brings support for filter pushdowns, column pruning, unified read and write > and additional optimizations. > * Many configuration options are available for ES connectivity, [check > here|[https://github.com/elastic/elasticsearch-hadoop/blob/master/mr/src/main/java/org/elasticsearch/hadoop/cfg/ConfigurationOptions.java]] > * Any filters can be applied as expressions directly on the data frame and > are pushed automatically to the source. > The new implementation will look something like, > {code:java} > sparkSession.read.format("es").options( ??? ).load(""){code} -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Work logged] (GRIFFIN-326) New implementation for Elasticsearch Data Connector (Batch)
[ https://issues.apache.org/jira/browse/GRIFFIN-326?focusedWorklogId=420850=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-420850 ] ASF GitHub Bot logged work on GRIFFIN-326: -- Author: ASF GitHub Bot Created on: 12/Apr/20 01:01 Start Date: 12/Apr/20 01:01 Worklog Time Spent: 10m Work Description: guoyuepeng commented on issue #569: [GRIFFIN-326] New Data Connector for Elasticsearch URL: https://github.com/apache/griffin/pull/569#issuecomment-612545643 No output has been received in the last 10m0s, this potentially indicates a stalled build or something wrong with the build itself. Check the details on how to adjust your build configuration on: https://docs.travis-ci.com/user/common-build-problems/#build-times-out-because-no-output-was-received @wankunde The same issue as before? This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org Issue Time Tracking --- Worklog Id: (was: 420850) Time Spent: 40m (was: 0.5h) > New implementation for Elasticsearch Data Connector (Batch) > --- > > Key: GRIFFIN-326 > URL: https://issues.apache.org/jira/browse/GRIFFIN-326 > Project: Griffin > Issue Type: Sub-task >Reporter: Chitral Verma >Priority: Major > Time Spent: 40m > Remaining Estimate: 0h > > The current implementation of Elasticsearch relies on sending post requests > from the driver using either SQL or search mode for query filtering. > This implementation has the following potential issues, > * Data is fetched for indexes (database scopes in ES) in bulk via 1 call on > the driver. If the index has a lot of data, due to the big response payload, > a bottleneck would be created on the driver. > * Further, the driver then needs to parse this response payload and then > parallelize it, this is again a driver side bottleneck as each JSON record > needs to be mapped to a set schema in a type-safe manner. > * Only _host_, _port_ and _version_ are the available options to configure > the connection to the ES node or cluster. > * Source partitioning logic is not carried forward when parallelizing > records, the records will be randomized due to the Spark's default > partitioning > * Even though this implementation is a first-class member of Apache Griffin, > yet it's based on the _custom_ connector trait. > The proposed implementation aims to, > * Deprecate the current implementation in favor of the direct official > [elasticsearch-hadoop|[https://github.com/elastic/elasticsearch-hadoop/tree/master/spark/sql-20]] > library. > * This library is built on DataSource API built on spark 2.2.x+ and thus > brings support for filter pushdowns, column pruning, unified read and write > and additional optimizations. > * Many configuration options are available for ES connectivity, [check > here|[https://github.com/elastic/elasticsearch-hadoop/blob/master/mr/src/main/java/org/elasticsearch/hadoop/cfg/ConfigurationOptions.java]] > * Any filters can be applied as expressions directly on the data frame and > are pushed automatically to the source. > The new implementation will look something like, > {code:java} > sparkSession.read.format("es").options( ??? ).load(""){code} -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Work logged] (GRIFFIN-326) New implementation for Elasticsearch Data Connector (Batch)
[ https://issues.apache.org/jira/browse/GRIFFIN-326?focusedWorklogId=420849=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-420849 ] ASF GitHub Bot logged work on GRIFFIN-326: -- Author: ASF GitHub Bot Created on: 12/Apr/20 00:58 Start Date: 12/Apr/20 00:58 Worklog Time Spent: 10m Work Description: guoyuepeng commented on issue #569: [GRIFFIN-326] New Data Connector for Elasticsearch URL: https://github.com/apache/griffin/pull/569#issuecomment-612545643 Not sure what happened on CI, let me check. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org Issue Time Tracking --- Worklog Id: (was: 420849) Time Spent: 0.5h (was: 20m) > New implementation for Elasticsearch Data Connector (Batch) > --- > > Key: GRIFFIN-326 > URL: https://issues.apache.org/jira/browse/GRIFFIN-326 > Project: Griffin > Issue Type: Sub-task >Reporter: Chitral Verma >Priority: Major > Time Spent: 0.5h > Remaining Estimate: 0h > > The current implementation of Elasticsearch relies on sending post requests > from the driver using either SQL or search mode for query filtering. > This implementation has the following potential issues, > * Data is fetched for indexes (database scopes in ES) in bulk via 1 call on > the driver. If the index has a lot of data, due to the big response payload, > a bottleneck would be created on the driver. > * Further, the driver then needs to parse this response payload and then > parallelize it, this is again a driver side bottleneck as each JSON record > needs to be mapped to a set schema in a type-safe manner. > * Only _host_, _port_ and _version_ are the available options to configure > the connection to the ES node or cluster. > * Source partitioning logic is not carried forward when parallelizing > records, the records will be randomized due to the Spark's default > partitioning > * Even though this implementation is a first-class member of Apache Griffin, > yet it's based on the _custom_ connector trait. > The proposed implementation aims to, > * Deprecate the current implementation in favor of the direct official > [elasticsearch-hadoop|[https://github.com/elastic/elasticsearch-hadoop/tree/master/spark/sql-20]] > library. > * This library is built on DataSource API built on spark 2.2.x+ and thus > brings support for filter pushdowns, column pruning, unified read and write > and additional optimizations. > * Many configuration options are available for ES connectivity, [check > here|[https://github.com/elastic/elasticsearch-hadoop/blob/master/mr/src/main/java/org/elasticsearch/hadoop/cfg/ConfigurationOptions.java]] > * Any filters can be applied as expressions directly on the data frame and > are pushed automatically to the source. > The new implementation will look something like, > {code:java} > sparkSession.read.format("es").options( ??? ).load(""){code} -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Work logged] (GRIFFIN-326) New implementation for Elasticsearch Data Connector (Batch)
[ https://issues.apache.org/jira/browse/GRIFFIN-326?focusedWorklogId=417480=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-417480 ] ASF GitHub Bot logged work on GRIFFIN-326: -- Author: ASF GitHub Bot Created on: 07/Apr/20 06:45 Start Date: 07/Apr/20 06:45 Worklog Time Spent: 10m Work Description: chitralverma commented on issue #569: [GRIFFIN-326] New Data Connector for Elasticsearch URL: https://github.com/apache/griffin/pull/569#issuecomment-610206820 @wankunde @guoyuepeng Can you please review this. Thanks. Also, I think the build is stuck? This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org Issue Time Tracking --- Worklog Id: (was: 417480) Time Spent: 20m (was: 10m) > New implementation for Elasticsearch Data Connector (Batch) > --- > > Key: GRIFFIN-326 > URL: https://issues.apache.org/jira/browse/GRIFFIN-326 > Project: Griffin > Issue Type: Improvement >Reporter: Chitral Verma >Priority: Major > Time Spent: 20m > Remaining Estimate: 0h > > The current implementation of Elasticsearch relies on sending post requests > from the driver using either SQL or search mode for query filtering. > This implementation has the following potential issues, > * Data is fetched for indexes (database scopes in ES) in bulk via 1 call on > the driver. If the index has a lot of data, due to the big response payload, > a bottleneck would be created on the driver. > * Further, the driver then needs to parse this response payload and then > parallelize it, this is again a driver side bottleneck as each JSON record > needs to be mapped to a set schema in a type-safe manner. > * Only _host_, _port_ and _version_ are the available options to configure > the connection to the ES node or cluster. > * Source partitioning logic is not carried forward when parallelizing > records, the records will be randomized due to the Spark's default > partitioning > * Even though this implementation is a first-class member of Apache Griffin, > yet it's based on the _custom_ connector trait. > The proposed implementation aims to, > * Deprecate the current implementation in favor of the direct official > [elasticsearch-hadoop|[https://github.com/elastic/elasticsearch-hadoop/tree/master/spark/sql-20]] > library. > * This library is built on DataSource API built on spark 2.2.x+ and thus > brings support for filter pushdowns, column pruning, unified read and write > and additional optimizations. > * Many configuration options are available for ES connectivity, [check > here|[https://github.com/elastic/elasticsearch-hadoop/blob/master/mr/src/main/java/org/elasticsearch/hadoop/cfg/ConfigurationOptions.java]] > * Any filters can be applied as expressions directly on the data frame and > are pushed automatically to the source. > The new implementation will look something like, > {code:java} > sparkSession.read.format("es").options( ??? ).load(""){code} -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Work logged] (GRIFFIN-326) New implementation for Elasticsearch Data Connector (Batch)
[ https://issues.apache.org/jira/browse/GRIFFIN-326?focusedWorklogId=416685=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-416685 ] ASF GitHub Bot logged work on GRIFFIN-326: -- Author: ASF GitHub Bot Created on: 06/Apr/20 15:30 Start Date: 06/Apr/20 15:30 Worklog Time Spent: 10m Work Description: chitralverma commented on pull request #569: [GRIFFIN-326] New Data Connector for Elasticsearch URL: https://github.com/apache/griffin/pull/569 **What changes were proposed in this pull request?** This ticket proposes the following changes, - Deprecate the current implementation in favour of the direct implementation in the official [elasticsearch-hadoop](https://github.com/elastic/elasticsearch-hadoop/tree/master/spark/sql-20) library. - This library is built on DataSource API built on spark 2.2.x+ and thus brings support for filter pushdowns, column pruning, unified read and write and additional optimizations. - Many configuration options are available for ES connectivity, [check here](https://github.com/elastic/elasticsearch-hadoop/blob/master/mr/src/main/java/org/elasticsearch/hadoop/cfg/ConfigurationOptions.java). - Any filters can be applied as expressions directly on the data frame and are pushed automatically to the source. **Does this PR introduce any user-facing change?** Yes. As mentioned above, the old connector has been deprecated and config structure for Elasticsearch data connector has changed now. **How was this patch tested?** Griffin test suite and additional unit test cases This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org Issue Time Tracking --- Worklog Id: (was: 416685) Remaining Estimate: 0h Time Spent: 10m > New implementation for Elasticsearch Data Connector (Batch) > --- > > Key: GRIFFIN-326 > URL: https://issues.apache.org/jira/browse/GRIFFIN-326 > Project: Griffin > Issue Type: Improvement >Reporter: Chitral Verma >Priority: Major > Time Spent: 10m > Remaining Estimate: 0h > > The current implementation of Elasticsearch relies on sending post requests > from the driver using either SQL or search mode for query filtering. > This implementation has the following potential issues, > * Data is fetched for indexes (database scopes in ES) in bulk via 1 call on > the driver. If the index has a lot of data, due to the big response payload, > a bottleneck would be created on the driver. > * Further, the driver then needs to parse this response payload and then > parallelize it, this is again a driver side bottleneck as each JSON record > needs to be mapped to a set schema in a type-safe manner. > * Only _host_, _port_ and _version_ are the available options to configure > the connection to the ES node or cluster. > * Source partitioning logic is not carried forward when parallelizing > records, the records will be randomized due to the Spark's default > partitioning > * Even though this implementation is a first-class member of Apache Griffin, > yet it's based on the _custom_ connector trait. > The proposed implementation aims to, > * Deprecate the current implementation in favor of the direct official > [elasticsearch-hadoop|[https://github.com/elastic/elasticsearch-hadoop/tree/master/spark/sql-20]] > library. > * This library is built on DataSource API built on spark 2.2.x+ and thus > brings support for filter pushdowns, column pruning, unified read and write > and additional optimizations. > * Many configuration options are available for ES connectivity, [check > here|[https://github.com/elastic/elasticsearch-hadoop/blob/master/mr/src/main/java/org/elasticsearch/hadoop/cfg/ConfigurationOptions.java]] > * Any filters can be applied as expressions directly on the data frame and > are pushed automatically to the source. > The new implementation will look something like, > {code:java} > sparkSession.read.format("es").options( ??? ).load(""){code} -- This message was sent by Atlassian Jira (v8.3.4#803005)