[ https://issues.apache.org/jira/browse/BAHIR-110?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16082935#comment-16082935 ]
ASF GitHub Bot commented on BAHIR-110: -------------------------------------- Github user emlaver commented on a diff in the pull request: https://github.com/apache/bahir/pull/45#discussion_r126801991 --- Diff: sql-cloudant/README.md --- @@ -52,39 +51,71 @@ Here each subsequent configuration overrides the previous one. Thus, configurati ### Configuration in application.conf -Default values are defined in [here](cloudant-spark-sql/src/main/resources/application.conf). +Default values are defined in [here](src/main/resources/application.conf). ### Configuration on SparkConf Name | Default | Meaning --- |:---:| --- +cloudant.apiReceiver|"_all_docs"| API endpoint for RelationProvider when loading or saving data from Cloudant to DataFrames or SQL temporary tables. Select between "_all_docs" or "_changes" endpoint. cloudant.protocol|https|protocol to use to transfer data: http or https -cloudant.host||cloudant host url -cloudant.username||cloudant userid -cloudant.password||cloudant password +cloudant.host| |cloudant host url +cloudant.username| |cloudant userid +cloudant.password| |cloudant password cloudant.useQuery|false|By default, _all_docs endpoint is used if configuration 'view' and 'index' (see below) are not set. When useQuery is enabled, _find endpoint will be used in place of _all_docs when query condition is not on primary key field (_id), so that query predicates may be driven into datastore. cloudant.queryLimit|25|The maximum number of results returned when querying the _find endpoint. jsonstore.rdd.partitions|10|the number of partitions intent used to drive JsonStoreRDD loading query result in parallel. The actual number is calculated based on total rows returned and satisfying maxInPartition and minInPartition jsonstore.rdd.maxInPartition|-1|the max rows in a partition. -1 means unlimited jsonstore.rdd.minInPartition|10|the min rows in a partition. jsonstore.rdd.requestTimeout|900000| the request timeout in milliseconds bulkSize|200| the bulk save size -schemaSampleSize| "-1" | the sample size for RDD schema discovery. 1 means we are using only first document for schema discovery; -1 means all documents; 0 will be treated as 1; any number N means min(N, total) docs -createDBOnSave|"false"| whether to create a new database during save operation. If false, a database should already exist. If true, a new database will be created. If true, and a database with a provided name already exists, an error will be raised. +schemaSampleSize|-1| the sample size for RDD schema discovery. 1 means we are using only first document for schema discovery; -1 means all documents; 0 will be treated as 1; any number N means min(N, total) docs +createDBOnSave|false| whether to create a new database during save operation. If false, a database should already exist. If true, a new database will be created. If true, and a database with a provided name already exists, an error will be raised. + +The `cloudant.apiReceiver` option allows for _changes or _all_docs API endpoint to be called while loading Cloudant data into Spark DataFrames or SQL Tables, +or saving data from DataFrames or SQL Tables to a Cloudant database. + +**Note:** When using `_changes` API, please consider: +1. Results are partially ordered and may not be be presented in order in +which documents were updated. +2. In case of shards' unavailability, you may see duplicate results (changes that have been seen already) +3. Can use `selector` option to filter Cloudant docs during load +4. Supports a real snapshot of the database and represents it in a single point of time. +5. Only supports single threaded + + +When using `_all_docs` API: +1. Supports parallel reads (using offset and range) +2. Using partitions may not represent the true snapshot of a database. Some docs + may be added or deleted in the database between loading data into different + Spark partitions. + +Performance of `_changes` API is still better in most cases (even with single threaded support). +During several performance tests using 200 MB to 15 GB Cloudant databases, load time from Cloudant to Spark using +`_changes` feed was faster to complete every time compared to `_all_docs`. --- End diff -- After our Slack discussion and @yanglei99 comments by e-mail, I've removed this in 5588c87. > Replace use of _all_docs API with _changes API in all receivers > --------------------------------------------------------------- > > Key: BAHIR-110 > URL: https://issues.apache.org/jira/browse/BAHIR-110 > Project: Bahir > Issue Type: Improvement > Reporter: Esteban Laver > Original Estimate: 216h > Remaining Estimate: 216h > > Today we use the _changes API for Spark streaming receiver and _all_docs API > for non-streaming receiver. _all_docs API supports parallel reads (using > offset and range) but performance of _changes API is still better in most > cases (even with single threaded support). > With this ticket we want to: > a) re-implement all receivers using _changes API > b) compare performance between the two implementations based on _changes and > _all_docs > Based on the results in b) we could decide to either > - replace _all_docs implementation with _changes based implementation OR > - allow customers to pick one (with a solid documentation about pros and > cons) -- This message was sent by Atlassian JIRA (v6.4.14#64029)