ASF GitHub Bot commented on BAHIR-110:

Github user mayya-sharipova commented on the issue:

    I am getting the following unexpected behaviour:
    I have a database with 13 docs and 1 deleted doc. When displaying 
`df.count`, I am getting `14` which is incorrect.  When displaying a dataframe, 
I am getting the last record is NULL.
    |_deleted|_id|                _rev|airportName|
    |    null|DEL|1-67f14f8891a9f32...|      Delhi|
    |    null|JFK|1-ee8206c8e56a114...|   New York|
    |    null|SVO|1-7d18769b68f6099...|     Moscow|
    |    null|FRA|1-f358b62b0499340...|  Frankfurt|
    |    null|HKG|1-b040e40df5d0080...|  Hong Kong|
    |    null|CDG|1-8c51e401185272e...|      Paris|
    |    null|FCO|1-89431c8db8aa8e4...|       Rome|
    |    null|NRT|1-dce312ac1414110...|      Tokyo|
    |    null|LHR|1-303c622ad8380c9...|     London|
    |    null|BOM|2-a3f39a0741938c4...|    Mumbaii|
    |    null|YUL|1-19a9fe9cace23ec...|   Montreal|
    |    null|IKA|1-3dea74452ca86af...|     Tehran|
    |    null|SIN|1-67037272289432e...|  Singapore|
    |    true|SYD|2-1cc4f2c62db144a...|       null|
    We should NOT load into dataframe any deleted documents. A user may have 
thousands or millions of deleted documents. We should load only undeleted docs, 
and a dataframe should NOT have a column `"_deleted"`.
    Another error:
    Occasionally, when running `CloudantDF.py` example, get an error:
    File "/Cloudant/bahir/sql-cloudant/examples/python/CloudantDF.py", line 45, 
in <module>
        df.filter(df.airportName >= 'Moscow').select("_id",'airportName').show()
line 1020, in __getattr__
    AttributeError: 'DataFrame' object has no attribute 'airportName' 
    For this PR, we can disregard this error and investigate further in 
following PRs.

> Replace use of _all_docs API with _changes API in all receivers
> ---------------------------------------------------------------
>                 Key: BAHIR-110
>                 URL: https://issues.apache.org/jira/browse/BAHIR-110
>             Project: Bahir
>          Issue Type: Improvement
>            Reporter: Esteban Laver
>   Original Estimate: 216h
>  Remaining Estimate: 216h
> Today we use the _changes API for Spark streaming receiver and _all_docs API 
> for non-streaming receiver. _all_docs API supports parallel reads (using 
> offset and range) but performance of _changes API is still better in most 
> cases (even with single threaded support).
> With this ticket we want to:
> a) re-implement all receivers using _changes API
> b) compare performance between the two implementations based on _changes and 
> _all_docs
> Based on the results in b) we could decide to either
> - replace _all_docs implementation with _changes based implementation OR
> - allow customers to pick one (with a solid documentation about pros and 
> cons) 

This message was sent by Atlassian JIRA

Reply via email to