[jira] [Commented] (BAHIR-110) Replace use of _all_docs API with _changes API in all receivers

ASF GitHub Bot (JIRA) Fri, 07 Jul 2017 06:53:30 -0700

    [ 
https://issues.apache.org/jira/browse/BAHIR-110?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16078105#comment-16078105
 ]


ASF GitHub Bot commented on BAHIR-110:
--------------------------------------

Github user mayya-sharipova commented on a diff in the pull request:

    https://github.com/apache/bahir/pull/45#discussion_r126151630
  
    --- Diff: sql-cloudant/README.md ---
    @@ -52,39 +51,71 @@ Here each subsequent configuration overrides the 
previous one. Thus, configurati
     
     
     ### Configuration in application.conf
    -Default values are defined in 
[here](cloudant-spark-sql/src/main/resources/application.conf).
    +Default values are defined in [here](src/main/resources/application.conf).
     
     ### Configuration on SparkConf
     
     Name | Default | Meaning
     --- |:---:| ---
    +cloudant.apiReceiver|"_all_docs"| API endpoint for RelationProvider when 
loading or saving data from Cloudant to DataFrames or SQL temporary tables. 
Select between "_all_docs" or "_changes" endpoint.
     cloudant.protocol|https|protocol to use to transfer data: http or https
    -cloudant.host||cloudant host url
    -cloudant.username||cloudant userid
    -cloudant.password||cloudant password
    +cloudant.host| |cloudant host url
    +cloudant.username| |cloudant userid
    +cloudant.password| |cloudant password
     cloudant.useQuery|false|By default, _all_docs endpoint is used if 
configuration 'view' and 'index' (see below) are not set. When useQuery is 
enabled, _find endpoint will be used in place of _all_docs when query condition 
is not on primary key field (_id), so that query predicates may be driven into 
datastore. 
     cloudant.queryLimit|25|The maximum number of results returned when 
querying the _find endpoint.
     jsonstore.rdd.partitions|10|the number of partitions intent used to drive 
JsonStoreRDD loading query result in parallel. The actual number is calculated 
based on total rows returned and satisfying maxInPartition and minInPartition
     jsonstore.rdd.maxInPartition|-1|the max rows in a partition. -1 means 
unlimited
     jsonstore.rdd.minInPartition|10|the min rows in a partition.
     jsonstore.rdd.requestTimeout|900000| the request timeout in milliseconds
     bulkSize|200| the bulk save size
    -schemaSampleSize| "-1" | the sample size for RDD schema discovery. 1 means 
we are using only first document for schema discovery; -1 means all documents; 
0 will be treated as 1; any number N means min(N, total) docs 
    -createDBOnSave|"false"| whether to create a new database during save 
operation. If false, a database should already exist. If true, a new database 
will be created. If true, and a database with a provided name already exists, 
an error will be raised. 
    +schemaSampleSize|-1| the sample size for RDD schema discovery. 1 means we 
are using only first document for schema discovery; -1 means all documents; 0 
will be treated as 1; any number N means min(N, total) docs 
    +createDBOnSave|false| whether to create a new database during save 
operation. If false, a database should already exist. If true, a new database 
will be created. If true, and a database with a provided name already exists, 
an error will be raised. 
    +
    +The `cloudant.apiReceiver` option allows for _changes or _all_docs API 
endpoint to be called while loading Cloudant data into Spark DataFrames or SQL 
Tables,
    +or saving data from DataFrames or SQL Tables to a Cloudant database.
    +
    +**Note:** When using `_changes` API, please consider: 
    +1. Results are partially ordered and may not be be presented in order in 
    +which documents were updated.
    +2. In case of shards' unavailability, you may see duplicate results 
(changes that have been seen already)
    +3. Can use `selector` option to filter Cloudant docs during load
    +4. Supports a real snapshot of the database and represents it in a single 
point of time.
    +5. Only supports single threaded
    +
    +
    +When using `_all_docs` API:
    +1. Supports parallel reads (using offset and range)
    +2. Using partitions may not represent the true snapshot of a database.  
Some docs
    +   may be added or deleted in the database between loading data into 
different 
    +   Spark partitions.
    +
    +Performance of `_changes` API is still better in most cases (even with 
single threaded support). 
    +During several performance tests using 200 MB to 15 GB Cloudant databases, 
load time from Cloudant to Spark using 
    +`_changes` feed was faster to complete every time compared to `_all_docs`.
    + 
    +See 
[CloudantChangesDFSuite](src/test/scala/org/apache/bahir/cloudant/CloudantChangesDFSuite.scala)
 
    +for examples of loading data into a Spark DataFrame with `_changes` API.
     
     ### Configuration on Spark SQL Temporary Table or DataFrame
     
     Besides all the configurations passed to a temporary table or dataframe 
through SparkConf, it is also possible to set the following configurations in 
temporary table or dataframe using OPTIONS: 
     
     Name | Default | Meaning
     --- |:---:| ---
    -database||cloudant database name
    -view||cloudant view w/o the database name. only used for load.
    -index||cloudant search index w/o the database name. only used for load 
data with less than or equal to 200 results.
    -path||cloudant: as database name if database is not present
    -schemaSampleSize|"-1"| the sample size used to discover the schema for 
this temp table. -1 scans all documents
     bulkSize|200| the bulk save size
    -createDBOnSave|"false"| whether to create a new database during save 
operation. If false, a database should already exist. If true, a new database 
will be created. If true, and a database with a provided name already exists, 
an error will be raised. 
    +createDBOnSave|false| whether to create a new database during save 
operation. If false, a database should already exist. If true, a new database 
will be created. If true, and a database with a provided name already exists, 
an error will be raised. 
    +database| | Cloudant database name
    +index| | Cloudant search index w/o the database name. only used for load 
data with less than or equal to 200 results.
    +path| | Cloudant: as database name if database is not present
    +schemaSampleSize|-1| the sample size used to discover the schema for this 
temp table. -1 scans all documents
    +selector|all documents| a selector written in Cloudant Query syntax, 
specifying conditions for selecting documents when the `cloudant.apiReceiver` 
option is set to `_changes`. Only documents satisfying the selector's 
conditions will be retrieved from Cloudant and loaded into Spark.
    +storageLevel|MEMORY_ONLY_SER| the storage level when persisting Spark data 
sets during load when `cloudant.apiReceiver` option equals `_changes`
    --- End diff --
    
    We should also state here other options, or at least provide a link to 
other options.


> Replace use of _all_docs API with _changes API in all receivers
> ---------------------------------------------------------------
>
>                 Key: BAHIR-110
>                 URL: https://issues.apache.org/jira/browse/BAHIR-110
>             Project: Bahir
>          Issue Type: Improvement
>            Reporter: Esteban Laver
>   Original Estimate: 216h
>  Remaining Estimate: 216h
>
> Today we use the _changes API for Spark streaming receiver and _all_docs API 
> for non-streaming receiver. _all_docs API supports parallel reads (using 
> offset and range) but performance of _changes API is still better in most 
> cases (even with single threaded support).
> With this ticket we want to:
> a) re-implement all receivers using _changes API
> b) compare performance between the two implementations based on _changes and 
> _all_docs
> Based on the results in b) we could decide to either
> - replace _all_docs implementation with _changes based implementation OR
> - allow customers to pick one (with a solid documentation about pros and 
> cons) 



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

[jira] [Commented] (BAHIR-110) Replace use of _all_docs API with _changes API in all receivers

Reply via email to