[jira] [Commented] (BAHIR-100) Providing MQTT Spark Streaming to return encoded Byte[] message without corruption

2017-07-10 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/BAHIR-100?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16081553#comment-16081553
 ] 

ASF GitHub Bot commented on BAHIR-100:
--

Github user tedyu commented on a diff in the pull request:

https://github.com/apache/bahir/pull/47#discussion_r126583567
  
--- Diff: streaming-mqtt/README.md ---
@@ -52,12 +52,14 @@ this actor can be configured to handle failures, etc.
 
 val lines = MQTTUtils.createStream(ssc, brokerUrl, topic)
 val lines = MQTTUtils.createPairedStream(ssc, brokerUrl, topic)
+val lines = MQTTUtils.createPairedByteArrayStreamStream(ssc, 
brokerUrl, topic)
--- End diff --

Where is createPairedByteArrayStreamStream defined ?


> Providing MQTT Spark Streaming to return encoded Byte[] message without 
> corruption
> --
>
> Key: BAHIR-100
> URL: https://issues.apache.org/jira/browse/BAHIR-100
> Project: Bahir
>  Issue Type: New Feature
>  Components: Spark Streaming Connectors
>Reporter: Anntinu Josy
>Assignee: Anntinu Josy
>  Labels: mqtt, spark, streaming
>   Original Estimate: 168h
>  Remaining Estimate: 168h
>
> Now a days Network bandwidth is becoming a serious resource that need to be 
> conserver in IoT ecosystem, For this puropse we are using different byte[] 
> based encoding such as Protocol Buffer and flat Buffer, Once this encoded 
> message is converted into string the data becomes corrupted, So same byte[] 
> format need to be preserved when forwarded.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


BAHIR-100

2017-07-10 Thread Rosenstark, David
Hi I have sent PR for BAHIR-100 and it now passes. What are the next steps 
needed to get it merged?
-
Intel Electronics Ltd.

This e-mail and any attachments may contain confidential material for
the sole use of the intended recipient(s). Any review or distribution
by others is strictly prohibited. If you are not the intended
recipient, please contact the sender and delete all copies.


[jira] [Commented] (BAHIR-110) Replace use of _all_docs API with _changes API in all receivers

2017-07-10 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/BAHIR-110?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16080456#comment-16080456
 ] 

ASF GitHub Bot commented on BAHIR-110:
--

Github user emlaver commented on a diff in the pull request:

https://github.com/apache/bahir/pull/45#discussion_r126446901
  
--- Diff: 
sql-cloudant/src/main/scala/org/apache/bahir/cloudant/CloudantConfig.scala ---
@@ -30,81 +28,83 @@ import org.apache.bahir.cloudant.common._
 */
 
 class CloudantConfig(val protocol: String, val host: String,
-val dbName: String, val indexName: String = null, val viewName: String 
= null)
+val dbName: String, val indexName: String, val viewName: String)
 (implicit val username: String, val password: String,
 val partitions: Int, val maxInPartition: Int, val minInPartition: Int,
 val requestTimeout: Long, val bulkSize: Int, val schemaSampleSize: Int,
-val createDBOnSave: Boolean, val selector: String, val useQuery: 
Boolean = false,
-val queryLimit: Int)
-extends Serializable{
+val createDBOnSave: Boolean, val apiReceiver: String,
+val useQuery: Boolean = false, val queryLimit: Int)
+extends Serializable {
 
-  private lazy val dbUrl = {protocol + "://" + host + "/" + dbName}
+  lazy val dbUrl: String = {protocol + "://" + host + "/" + dbName}
 
   val pkField = "_id"
-  val defaultIndex = "_all_docs" // "_changes" does not work for partition
+  val defaultIndex: String = apiReceiver
   val default_filter: String = "*:*"
 
-  def getContinuousChangesUrl(): String = {
-var url = dbUrl + 
"/_changes?include_docs=true=continuous=3000"
-if (selector != null) {
-  url = url + "=_selector"
-}
-url
-  }
-
-  def getSelector() : String = {
-selector
-  }
-
-  def getDbUrl(): String = {
+  def getDbUrl: String = {
 dbUrl
   }
 
-  def getSchemaSampleSize(): Int = {
+  def getSchemaSampleSize: Int = {
 schemaSampleSize
   }
 
-  def getCreateDBonSave(): Boolean = {
+  def getCreateDBonSave: Boolean = {
 createDBOnSave
   }
 
-  def getTotalUrl(url: String): String = {
-if (url.contains('?')) {
-  url + "=1"
-} else {
-  url + "?limit=1"
-}
-  }
-
-  def getDbname(): String = {
-dbName
-  }
-
-  def queryEnabled(): Boolean = {useQuery && indexName==null && 
viewName==null}
-
-  def allowPartition(queryUsed: Boolean): Boolean = {indexName==null && 
!queryUsed}
-
-  def getAllDocsUrl(limit: Int, excludeDDoc: Boolean = false): String = {
+  def getLastNum(result: JsValue): JsValue = (result \ "last_seq").get
 
+  /* Url containing limit for docs in a Cloudant database.
+  * If a view is not defined, use the _all_docs endpoint.
+  * @return url with one doc limit for retrieving total doc count
+  */
+  def getUrl(limit: Int, excludeDDoc: Boolean = false): String = {
 if (viewName == null) {
-  val baseUrl = (
-  if ( excludeDDoc) dbUrl + 
"/_all_docs?startkey=%22_design0/%22_docs=true"
-  else dbUrl + "/_all_docs?include_docs=true"
-  )
-  if (limit == JsonStoreConfigManager.ALL_DOCS_LIMIT) {
+  val baseUrl = {
+if (excludeDDoc) {
+  dbUrl + "/_all_docs?startkey=%22_design0/%22_docs=true"
--- End diff --

I'll open a new JIRA issue for this.


> Replace use of _all_docs API with _changes API in all receivers
> ---
>
> Key: BAHIR-110
> URL: https://issues.apache.org/jira/browse/BAHIR-110
> Project: Bahir
>  Issue Type: Improvement
>Reporter: Esteban Laver
>   Original Estimate: 216h
>  Remaining Estimate: 216h
>
> Today we use the _changes API for Spark streaming receiver and _all_docs API 
> for non-streaming receiver. _all_docs API supports parallel reads (using 
> offset and range) but performance of _changes API is still better in most 
> cases (even with single threaded support).
> With this ticket we want to:
> a) re-implement all receivers using _changes API
> b) compare performance between the two implementations based on _changes and 
> _all_docs
> Based on the results in b) we could decide to either
> - replace _all_docs implementation with _changes based implementation OR
> - allow customers to pick one (with a solid documentation about pros and 
> cons) 



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Commented] (BAHIR-110) Replace use of _all_docs API with _changes API in all receivers

2017-07-10 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/BAHIR-110?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16080452#comment-16080452
 ] 

ASF GitHub Bot commented on BAHIR-110:
--

Github user emlaver commented on a diff in the pull request:

https://github.com/apache/bahir/pull/45#discussion_r126444627
  
--- Diff: sql-cloudant/README.md ---
@@ -31,15 +31,14 @@ The `--packages` argument can also be used with 
`bin/spark-submit`.
 
 Submit a job in Python:
 
-spark-submit  --master local[4] --jars   
 
+spark-submit  --master local[4] --packages 
org.apache.bahir:spark-sql-cloudant_2.11:2.2.0-SNAPSHOT   
--- End diff --

Good point, this should probably be something like `--packages 
org.apache.bahir:spark-sql-cloudant_SCALA_VERSION:PACKAGE_VERSION`


> Replace use of _all_docs API with _changes API in all receivers
> ---
>
> Key: BAHIR-110
> URL: https://issues.apache.org/jira/browse/BAHIR-110
> Project: Bahir
>  Issue Type: Improvement
>Reporter: Esteban Laver
>   Original Estimate: 216h
>  Remaining Estimate: 216h
>
> Today we use the _changes API for Spark streaming receiver and _all_docs API 
> for non-streaming receiver. _all_docs API supports parallel reads (using 
> offset and range) but performance of _changes API is still better in most 
> cases (even with single threaded support).
> With this ticket we want to:
> a) re-implement all receivers using _changes API
> b) compare performance between the two implementations based on _changes and 
> _all_docs
> Based on the results in b) we could decide to either
> - replace _all_docs implementation with _changes based implementation OR
> - allow customers to pick one (with a solid documentation about pros and 
> cons) 



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Commented] (BAHIR-110) Replace use of _all_docs API with _changes API in all receivers

2017-07-10 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/BAHIR-110?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16080453#comment-16080453
 ] 

ASF GitHub Bot commented on BAHIR-110:
--

Github user emlaver commented on a diff in the pull request:

https://github.com/apache/bahir/pull/45#discussion_r126445660
  
--- Diff: 
sql-cloudant/src/main/scala/org/apache/bahir/cloudant/CloudantChangesConfig.scala
 ---
@@ -0,0 +1,64 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+package org.apache.bahir.cloudant
+
+import org.apache.spark.storage.StorageLevel
+
+class CloudantChangesConfig(protocol: String, host: String, dbName: String,
+indexName: String = null, viewName: String = 
null)
+   (username: String, password: String, 
partitions: Int,
+maxInPartition: Int, minInPartition: Int, 
requestTimeout: Long,
+bulkSize: Int, schemaSampleSize: Int,
+createDBOnSave: Boolean, apiReceiver: String, 
selector: String,
+storageLevel: StorageLevel, useQuery: Boolean, 
queryLimit: Int)
+  extends CloudantConfig(protocol, host, dbName, indexName, 
viewName)(username, password,
+partitions, maxInPartition, minInPartition, requestTimeout, bulkSize, 
schemaSampleSize,
+createDBOnSave, apiReceiver, useQuery, queryLimit) {
+
+  override val defaultIndex: String = apiReceiver
+
+  def getSelector : String = {
+if (selector != null && !selector.isEmpty) {
+  selector
+} else {
+  // Exclude design docs
+  "{ \"_id\": { \"$regex\": \"^(?!.*_design/)\" } }"
--- End diff --

I don't think it needs the `.*`.  It's probably worth having adding a test 
case to verify this.


> Replace use of _all_docs API with _changes API in all receivers
> ---
>
> Key: BAHIR-110
> URL: https://issues.apache.org/jira/browse/BAHIR-110
> Project: Bahir
>  Issue Type: Improvement
>Reporter: Esteban Laver
>   Original Estimate: 216h
>  Remaining Estimate: 216h
>
> Today we use the _changes API for Spark streaming receiver and _all_docs API 
> for non-streaming receiver. _all_docs API supports parallel reads (using 
> offset and range) but performance of _changes API is still better in most 
> cases (even with single threaded support).
> With this ticket we want to:
> a) re-implement all receivers using _changes API
> b) compare performance between the two implementations based on _changes and 
> _all_docs
> Based on the results in b) we could decide to either
> - replace _all_docs implementation with _changes based implementation OR
> - allow customers to pick one (with a solid documentation about pros and 
> cons) 



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Commented] (BAHIR-110) Replace use of _all_docs API with _changes API in all receivers

2017-07-10 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/BAHIR-110?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16080435#comment-16080435
 ] 

ASF GitHub Bot commented on BAHIR-110:
--

Github user ricellis commented on a diff in the pull request:

https://github.com/apache/bahir/pull/45#discussion_r126428690
  
--- Diff: 
sql-cloudant/src/main/scala/org/apache/bahir/cloudant/CloudantConfig.scala ---
@@ -30,81 +28,83 @@ import org.apache.bahir.cloudant.common._
 */
 
 class CloudantConfig(val protocol: String, val host: String,
-val dbName: String, val indexName: String = null, val viewName: String 
= null)
+val dbName: String, val indexName: String, val viewName: String)
 (implicit val username: String, val password: String,
 val partitions: Int, val maxInPartition: Int, val minInPartition: Int,
 val requestTimeout: Long, val bulkSize: Int, val schemaSampleSize: Int,
-val createDBOnSave: Boolean, val selector: String, val useQuery: 
Boolean = false,
-val queryLimit: Int)
-extends Serializable{
+val createDBOnSave: Boolean, val apiReceiver: String,
+val useQuery: Boolean = false, val queryLimit: Int)
+extends Serializable {
 
-  private lazy val dbUrl = {protocol + "://" + host + "/" + dbName}
+  lazy val dbUrl: String = {protocol + "://" + host + "/" + dbName}
 
   val pkField = "_id"
-  val defaultIndex = "_all_docs" // "_changes" does not work for partition
+  val defaultIndex: String = apiReceiver
   val default_filter: String = "*:*"
 
-  def getContinuousChangesUrl(): String = {
-var url = dbUrl + 
"/_changes?include_docs=true=continuous=3000"
-if (selector != null) {
-  url = url + "=_selector"
-}
-url
-  }
-
-  def getSelector() : String = {
-selector
-  }
-
-  def getDbUrl(): String = {
+  def getDbUrl: String = {
 dbUrl
   }
 
-  def getSchemaSampleSize(): Int = {
+  def getSchemaSampleSize: Int = {
 schemaSampleSize
   }
 
-  def getCreateDBonSave(): Boolean = {
+  def getCreateDBonSave: Boolean = {
 createDBOnSave
   }
 
-  def getTotalUrl(url: String): String = {
-if (url.contains('?')) {
-  url + "=1"
-} else {
-  url + "?limit=1"
-}
-  }
-
-  def getDbname(): String = {
-dbName
-  }
-
-  def queryEnabled(): Boolean = {useQuery && indexName==null && 
viewName==null}
-
-  def allowPartition(queryUsed: Boolean): Boolean = {indexName==null && 
!queryUsed}
-
-  def getAllDocsUrl(limit: Int, excludeDDoc: Boolean = false): String = {
+  def getLastNum(result: JsValue): JsValue = (result \ "last_seq").get
 
+  /* Url containing limit for docs in a Cloudant database.
+  * If a view is not defined, use the _all_docs endpoint.
+  * @return url with one doc limit for retrieving total doc count
+  */
+  def getUrl(limit: Int, excludeDDoc: Boolean = false): String = {
 if (viewName == null) {
-  val baseUrl = (
-  if ( excludeDDoc) dbUrl + 
"/_all_docs?startkey=%22_design0/%22_docs=true"
-  else dbUrl + "/_all_docs?include_docs=true"
-  )
-  if (limit == JsonStoreConfigManager.ALL_DOCS_LIMIT) {
+  val baseUrl = {
+if (excludeDDoc) {
+  dbUrl + "/_all_docs?startkey=%22_design0/%22_docs=true"
--- End diff --

See 
https://github.com/cloudant/java-cloudant/issues/344#issuecomment-276938689
This `startkey` makes the assumption that no document IDs will start with 
upper case letters. Possibly one for another issue.


> Replace use of _all_docs API with _changes API in all receivers
> ---
>
> Key: BAHIR-110
> URL: https://issues.apache.org/jira/browse/BAHIR-110
> Project: Bahir
>  Issue Type: Improvement
>Reporter: Esteban Laver
>   Original Estimate: 216h
>  Remaining Estimate: 216h
>
> Today we use the _changes API for Spark streaming receiver and _all_docs API 
> for non-streaming receiver. _all_docs API supports parallel reads (using 
> offset and range) but performance of _changes API is still better in most 
> cases (even with single threaded support).
> With this ticket we want to:
> a) re-implement all receivers using _changes API
> b) compare performance between the two implementations based on _changes and 
> _all_docs
> Based on the results in b) we could decide to either
> - replace _all_docs implementation with _changes based implementation OR
> - allow customers to pick one (with a solid documentation about pros and 
> cons) 



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Commented] (BAHIR-110) Replace use of _all_docs API with _changes API in all receivers

2017-07-10 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/BAHIR-110?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16080437#comment-16080437
 ] 

ASF GitHub Bot commented on BAHIR-110:
--

Github user ricellis commented on a diff in the pull request:

https://github.com/apache/bahir/pull/45#discussion_r126403065
  
--- Diff: sql-cloudant/README.md ---
@@ -52,39 +51,71 @@ Here each subsequent configuration overrides the 
previous one. Thus, configurati
 
 
 ### Configuration in application.conf
-Default values are defined in 
[here](cloudant-spark-sql/src/main/resources/application.conf).
+Default values are defined in [here](src/main/resources/application.conf).
 
 ### Configuration on SparkConf
 
 Name | Default | Meaning
 --- |:---:| ---
+cloudant.apiReceiver|"_all_docs"| API endpoint for RelationProvider when 
loading or saving data from Cloudant to DataFrames or SQL temporary tables. 
Select between "_all_docs" or "_changes" endpoint.
 cloudant.protocol|https|protocol to use to transfer data: http or https
-cloudant.host||cloudant host url
-cloudant.username||cloudant userid
-cloudant.password||cloudant password
+cloudant.host| |cloudant host url
+cloudant.username| |cloudant userid
+cloudant.password| |cloudant password
 cloudant.useQuery|false|By default, _all_docs endpoint is used if 
configuration 'view' and 'index' (see below) are not set. When useQuery is 
enabled, _find endpoint will be used in place of _all_docs when query condition 
is not on primary key field (_id), so that query predicates may be driven into 
datastore. 
 cloudant.queryLimit|25|The maximum number of results returned when 
querying the _find endpoint.
 jsonstore.rdd.partitions|10|the number of partitions intent used to drive 
JsonStoreRDD loading query result in parallel. The actual number is calculated 
based on total rows returned and satisfying maxInPartition and minInPartition
 jsonstore.rdd.maxInPartition|-1|the max rows in a partition. -1 means 
unlimited
 jsonstore.rdd.minInPartition|10|the min rows in a partition.
 jsonstore.rdd.requestTimeout|90| the request timeout in milliseconds
 bulkSize|200| the bulk save size
-schemaSampleSize| "-1" | the sample size for RDD schema discovery. 1 means 
we are using only first document for schema discovery; -1 means all documents; 
0 will be treated as 1; any number N means min(N, total) docs 
-createDBOnSave|"false"| whether to create a new database during save 
operation. If false, a database should already exist. If true, a new database 
will be created. If true, and a database with a provided name already exists, 
an error will be raised. 
+schemaSampleSize|-1| the sample size for RDD schema discovery. 1 means we 
are using only first document for schema discovery; -1 means all documents; 0 
will be treated as 1; any number N means min(N, total) docs 
+createDBOnSave|false| whether to create a new database during save 
operation. If false, a database should already exist. If true, a new database 
will be created. If true, and a database with a provided name already exists, 
an error will be raised. 
+
+The `cloudant.apiReceiver` option allows for _changes or _all_docs API 
endpoint to be called while loading Cloudant data into Spark DataFrames or SQL 
Tables,
--- End diff --

`allows` doesn't seem right here, maybe `sets` as in
>option sets the `_changes` or `_all_docs`

?


> Replace use of _all_docs API with _changes API in all receivers
> ---
>
> Key: BAHIR-110
> URL: https://issues.apache.org/jira/browse/BAHIR-110
> Project: Bahir
>  Issue Type: Improvement
>Reporter: Esteban Laver
>   Original Estimate: 216h
>  Remaining Estimate: 216h
>
> Today we use the _changes API for Spark streaming receiver and _all_docs API 
> for non-streaming receiver. _all_docs API supports parallel reads (using 
> offset and range) but performance of _changes API is still better in most 
> cases (even with single threaded support).
> With this ticket we want to:
> a) re-implement all receivers using _changes API
> b) compare performance between the two implementations based on _changes and 
> _all_docs
> Based on the results in b) we could decide to either
> - replace _all_docs implementation with _changes based implementation OR
> - allow customers to pick one (with a solid documentation about pros and 
> cons) 



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Commented] (BAHIR-110) Replace use of _all_docs API with _changes API in all receivers

2017-07-10 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/BAHIR-110?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16080440#comment-16080440
 ] 

ASF GitHub Bot commented on BAHIR-110:
--

Github user ricellis commented on a diff in the pull request:

https://github.com/apache/bahir/pull/45#discussion_r126401728
  
--- Diff: sql-cloudant/README.md ---
@@ -31,15 +31,14 @@ The `--packages` argument can also be used with 
`bin/spark-submit`.
 
 Submit a job in Python:
 
-spark-submit  --master local[4] --jars   
 
+spark-submit  --master local[4] --packages 
org.apache.bahir:spark-sql-cloudant_2.11:2.2.0-SNAPSHOT   
 
 Submit a job in Scala:
 
-   spark-submit --class "" --master local[4] --jars  
+   spark-submit --class "" --master local[4] --packages 
org.apache.bahir:spark-sql-cloudant_2.11:2.2.0-SNAPSHOT 
--- End diff --

`SNAPSHOT` again?


> Replace use of _all_docs API with _changes API in all receivers
> ---
>
> Key: BAHIR-110
> URL: https://issues.apache.org/jira/browse/BAHIR-110
> Project: Bahir
>  Issue Type: Improvement
>Reporter: Esteban Laver
>   Original Estimate: 216h
>  Remaining Estimate: 216h
>
> Today we use the _changes API for Spark streaming receiver and _all_docs API 
> for non-streaming receiver. _all_docs API supports parallel reads (using 
> offset and range) but performance of _changes API is still better in most 
> cases (even with single threaded support).
> With this ticket we want to:
> a) re-implement all receivers using _changes API
> b) compare performance between the two implementations based on _changes and 
> _all_docs
> Based on the results in b) we could decide to either
> - replace _all_docs implementation with _changes based implementation OR
> - allow customers to pick one (with a solid documentation about pros and 
> cons) 



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Commented] (BAHIR-110) Replace use of _all_docs API with _changes API in all receivers

2017-07-10 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/BAHIR-110?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16080439#comment-16080439
 ] 

ASF GitHub Bot commented on BAHIR-110:
--

Github user ricellis commented on a diff in the pull request:

https://github.com/apache/bahir/pull/45#discussion_r126402346
  
--- Diff: sql-cloudant/README.md ---
@@ -52,39 +51,71 @@ Here each subsequent configuration overrides the 
previous one. Thus, configurati
 
 
 ### Configuration in application.conf
-Default values are defined in 
[here](cloudant-spark-sql/src/main/resources/application.conf).
+Default values are defined in [here](src/main/resources/application.conf).
 
 ### Configuration on SparkConf
 
 Name | Default | Meaning
 --- |:---:| ---
+cloudant.apiReceiver|"_all_docs"| API endpoint for RelationProvider when 
loading or saving data from Cloudant to DataFrames or SQL temporary tables. 
Select between "_all_docs" or "_changes" endpoint.
--- End diff --

I would say
>Cloudant API endpoint to use for

or
>Select between the Cloudant `_all_docs` or `_changes` endpoint

Probably also worth a "see below for details|notes|further explanation" or 
similar to help people find the content that would help them choose the 
appropriate endpoint.


> Replace use of _all_docs API with _changes API in all receivers
> ---
>
> Key: BAHIR-110
> URL: https://issues.apache.org/jira/browse/BAHIR-110
> Project: Bahir
>  Issue Type: Improvement
>Reporter: Esteban Laver
>   Original Estimate: 216h
>  Remaining Estimate: 216h
>
> Today we use the _changes API for Spark streaming receiver and _all_docs API 
> for non-streaming receiver. _all_docs API supports parallel reads (using 
> offset and range) but performance of _changes API is still better in most 
> cases (even with single threaded support).
> With this ticket we want to:
> a) re-implement all receivers using _changes API
> b) compare performance between the two implementations based on _changes and 
> _all_docs
> Based on the results in b) we could decide to either
> - replace _all_docs implementation with _changes based implementation OR
> - allow customers to pick one (with a solid documentation about pros and 
> cons) 



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Commented] (BAHIR-110) Replace use of _all_docs API with _changes API in all receivers

2017-07-10 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/BAHIR-110?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16080433#comment-16080433
 ] 

ASF GitHub Bot commented on BAHIR-110:
--

Github user ricellis commented on a diff in the pull request:

https://github.com/apache/bahir/pull/45#discussion_r126402723
  
--- Diff: sql-cloudant/README.md ---
@@ -52,39 +51,71 @@ Here each subsequent configuration overrides the 
previous one. Thus, configurati
 
 
 ### Configuration in application.conf
-Default values are defined in 
[here](cloudant-spark-sql/src/main/resources/application.conf).
+Default values are defined in [here](src/main/resources/application.conf).
 
 ### Configuration on SparkConf
 
 Name | Default | Meaning
 --- |:---:| ---
+cloudant.apiReceiver|"_all_docs"| API endpoint for RelationProvider when 
loading or saving data from Cloudant to DataFrames or SQL temporary tables. 
Select between "_all_docs" or "_changes" endpoint.
 cloudant.protocol|https|protocol to use to transfer data: http or https
-cloudant.host||cloudant host url
-cloudant.username||cloudant userid
-cloudant.password||cloudant password
+cloudant.host| |cloudant host url
+cloudant.username| |cloudant userid
+cloudant.password| |cloudant password
 cloudant.useQuery|false|By default, _all_docs endpoint is used if 
configuration 'view' and 'index' (see below) are not set. When useQuery is 
enabled, _find endpoint will be used in place of _all_docs when query condition 
is not on primary key field (_id), so that query predicates may be driven into 
datastore. 
 cloudant.queryLimit|25|The maximum number of results returned when 
querying the _find endpoint.
 jsonstore.rdd.partitions|10|the number of partitions intent used to drive 
JsonStoreRDD loading query result in parallel. The actual number is calculated 
based on total rows returned and satisfying maxInPartition and minInPartition
 jsonstore.rdd.maxInPartition|-1|the max rows in a partition. -1 means 
unlimited
 jsonstore.rdd.minInPartition|10|the min rows in a partition.
 jsonstore.rdd.requestTimeout|90| the request timeout in milliseconds
 bulkSize|200| the bulk save size
-schemaSampleSize| "-1" | the sample size for RDD schema discovery. 1 means 
we are using only first document for schema discovery; -1 means all documents; 
0 will be treated as 1; any number N means min(N, total) docs 
-createDBOnSave|"false"| whether to create a new database during save 
operation. If false, a database should already exist. If true, a new database 
will be created. If true, and a database with a provided name already exists, 
an error will be raised. 
+schemaSampleSize|-1| the sample size for RDD schema discovery. 1 means we 
are using only first document for schema discovery; -1 means all documents; 0 
will be treated as 1; any number N means min(N, total) docs 
--- End diff --

using only **the** first document


> Replace use of _all_docs API with _changes API in all receivers
> ---
>
> Key: BAHIR-110
> URL: https://issues.apache.org/jira/browse/BAHIR-110
> Project: Bahir
>  Issue Type: Improvement
>Reporter: Esteban Laver
>   Original Estimate: 216h
>  Remaining Estimate: 216h
>
> Today we use the _changes API for Spark streaming receiver and _all_docs API 
> for non-streaming receiver. _all_docs API supports parallel reads (using 
> offset and range) but performance of _changes API is still better in most 
> cases (even with single threaded support).
> With this ticket we want to:
> a) re-implement all receivers using _changes API
> b) compare performance between the two implementations based on _changes and 
> _all_docs
> Based on the results in b) we could decide to either
> - replace _all_docs implementation with _changes based implementation OR
> - allow customers to pick one (with a solid documentation about pros and 
> cons) 



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Commented] (BAHIR-110) Replace use of _all_docs API with _changes API in all receivers

2017-07-10 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/BAHIR-110?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16080432#comment-16080432
 ] 

ASF GitHub Bot commented on BAHIR-110:
--

Github user ricellis commented on a diff in the pull request:

https://github.com/apache/bahir/pull/45#discussion_r126401691
  
--- Diff: sql-cloudant/README.md ---
@@ -31,15 +31,14 @@ The `--packages` argument can also be used with 
`bin/spark-submit`.
 
 Submit a job in Python:
 
-spark-submit  --master local[4] --jars   
 
+spark-submit  --master local[4] --packages 
org.apache.bahir:spark-sql-cloudant_2.11:2.2.0-SNAPSHOT   
--- End diff --

Should the example really point to a `SNAPSHOT` version?


> Replace use of _all_docs API with _changes API in all receivers
> ---
>
> Key: BAHIR-110
> URL: https://issues.apache.org/jira/browse/BAHIR-110
> Project: Bahir
>  Issue Type: Improvement
>Reporter: Esteban Laver
>   Original Estimate: 216h
>  Remaining Estimate: 216h
>
> Today we use the _changes API for Spark streaming receiver and _all_docs API 
> for non-streaming receiver. _all_docs API supports parallel reads (using 
> offset and range) but performance of _changes API is still better in most 
> cases (even with single threaded support).
> With this ticket we want to:
> a) re-implement all receivers using _changes API
> b) compare performance between the two implementations based on _changes and 
> _all_docs
> Based on the results in b) we could decide to either
> - replace _all_docs implementation with _changes based implementation OR
> - allow customers to pick one (with a solid documentation about pros and 
> cons) 



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Commented] (BAHIR-110) Replace use of _all_docs API with _changes API in all receivers

2017-07-10 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/BAHIR-110?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16080434#comment-16080434
 ] 

ASF GitHub Bot commented on BAHIR-110:
--

Github user ricellis commented on a diff in the pull request:

https://github.com/apache/bahir/pull/45#discussion_r126423468
  
--- Diff: 
sql-cloudant/src/main/scala/org/apache/bahir/cloudant/CloudantChangesConfig.scala
 ---
@@ -0,0 +1,64 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+package org.apache.bahir.cloudant
+
+import org.apache.spark.storage.StorageLevel
+
+class CloudantChangesConfig(protocol: String, host: String, dbName: String,
+indexName: String = null, viewName: String = 
null)
+   (username: String, password: String, 
partitions: Int,
+maxInPartition: Int, minInPartition: Int, 
requestTimeout: Long,
+bulkSize: Int, schemaSampleSize: Int,
+createDBOnSave: Boolean, apiReceiver: String, 
selector: String,
+storageLevel: StorageLevel, useQuery: Boolean, 
queryLimit: Int)
+  extends CloudantConfig(protocol, host, dbName, indexName, 
viewName)(username, password,
+partitions, maxInPartition, minInPartition, requestTimeout, bulkSize, 
schemaSampleSize,
+createDBOnSave, apiReceiver, useQuery, queryLimit) {
+
+  override val defaultIndex: String = apiReceiver
+
+  def getSelector : String = {
+if (selector != null && !selector.isEmpty) {
+  selector
+} else {
+  // Exclude design docs
+  "{ \"_id\": { \"$regex\": \"^(?!.*_design/)\" } }"
--- End diff --

Does this regex really need a `.*`? Doesn't it make it exclude a document 
with a name like `my_design`? I guess the trailing slash protects against that 
because the `/` will appear in a design doc, but not a normal doc, but someone 
could create a doc with `/` I'm not sure if it would be encoded at the point of 
the regex check. Maybe more straightforward just to have `^(?!_design/)`


> Replace use of _all_docs API with _changes API in all receivers
> ---
>
> Key: BAHIR-110
> URL: https://issues.apache.org/jira/browse/BAHIR-110
> Project: Bahir
>  Issue Type: Improvement
>Reporter: Esteban Laver
>   Original Estimate: 216h
>  Remaining Estimate: 216h
>
> Today we use the _changes API for Spark streaming receiver and _all_docs API 
> for non-streaming receiver. _all_docs API supports parallel reads (using 
> offset and range) but performance of _changes API is still better in most 
> cases (even with single threaded support).
> With this ticket we want to:
> a) re-implement all receivers using _changes API
> b) compare performance between the two implementations based on _changes and 
> _all_docs
> Based on the results in b) we could decide to either
> - replace _all_docs implementation with _changes based implementation OR
> - allow customers to pick one (with a solid documentation about pros and 
> cons) 



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)