[ 
https://issues.apache.org/jira/browse/BAHIR-110?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16064865#comment-16064865
 ] 

ASF GitHub Bot commented on BAHIR-110:
--------------------------------------

Github user emlaver commented on the issue:

    https://github.com/apache/bahir/pull/45
  
    WIP: While running tests against databases with a size > 500 MB, 
`java.lang.OutOfMemoryError: Java heap space` error would occur (even when 
setting `--conf spark.driver.memory=10g`).  I believe this has to do with how 
the HTTP request is setup and called against the `_changes API` in 
[JsonStoreDataAccess.scala](https://github.com/apache/bahir/pull/45/files#diff-ab440bd537d48f7cf58cd9cf0ea143b1).
    Good news is I've created a test that uses Spark streaming (using 
CloudantReceiver.java) to read all docs from a Cloudant database into a Spark 
DataFrame.  It should work for a SQL Temp Table.  I ran this test without any 
java heap errors against a database size of 1 GB, 1.8 GB, and 14.2 GB.


> Replace use of _all_docs API with _changes API in all receivers
> ---------------------------------------------------------------
>
>                 Key: BAHIR-110
>                 URL: https://issues.apache.org/jira/browse/BAHIR-110
>             Project: Bahir
>          Issue Type: Improvement
>            Reporter: Esteban Laver
>   Original Estimate: 216h
>  Remaining Estimate: 216h
>
> Today we use the _changes API for Spark streaming receiver and _all_docs API 
> for non-streaming receiver. _all_docs API supports parallel reads (using 
> offset and range) but performance of _changes API is still better in most 
> cases (even with single threaded support).
> With this ticket we want to:
> a) re-implement all receivers using _changes API
> b) compare performance between the two implementations based on _changes and 
> _all_docs
> Based on the results in b) we could decide to either
> - replace _all_docs implementation with _changes based implementation OR
> - allow customers to pick one (with a solid documentation about pros and 
> cons) 



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

Reply via email to