[jira] [Updated] (CASSANDRA-15413) Missing results on reading large frozen text map

Alex Petrov (Jira) Wed, 13 Nov 2019 04:15:51 -0800


     [ 
https://issues.apache.org/jira/browse/CASSANDRA-15413?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]


Alex Petrov updated CASSANDRA-15413:
------------------------------------
     Bug Category: Parent values: Correctness(12982)Level 1 values: Recoverable 
Corruption / Loss(12986)
       Complexity: Challenging
      Component/s: Local/SSTable
    Discovered By: User Report
         Severity: Critical
         Assignee: Alex Petrov
           Status: Open  (was: Triage Needed)

> Missing results on reading large frozen text map
> ------------------------------------------------
>
>                 Key: CASSANDRA-15413
>                 URL: https://issues.apache.org/jira/browse/CASSANDRA-15413
>             Project: Cassandra
>          Issue Type: Bug
>          Components: Local/SSTable
>            Reporter: Tyler Codispoti
>            Assignee: Alex Petrov
>            Priority: Normal
>
> Cassandra version: 2.2.15
> I have been running into a case where, when fetching the results from a table 
> with a frozen<map<text, text>>, if the number of results is greater than the 
> fetch size (default 5000), we can end up with missing data.
> Side note: The table schema comes from using KairosDB, but we've isolated 
> this issue to Cassandra itself. But it looks like this can cause problems for 
> users of KairosDB as well.
> Repro case. Tested against fresh install of Cassandra 2.2.15.
> 1. Create table (csqlsh)
> {code:sql}
> CREATE KEYSPACE test
>   WITH REPLICATION = { 
>    'class' : 'SimpleStrategy', 
>    'replication_factor' : 1 
>   };
>   CREATE TABLE test.test (
>     name text,
>     tags frozen<map<text, text>>,
>     PRIMARY KEY (name, tags)
>   ) WITH CLUSTERING ORDER BY (tags ASC);
> {code}
> 2. Insert data (python3)
> {code:python}
> import time
> from cassandra.cluster import Cluster
> cluster = Cluster(['127.0.0.1'])
> session = cluster.connect('test')
> for i in range(0, 20000):
>     session.execute(
>         """
>         INSERT INTO test (name, tags)  
>         VALUES (%s, %s)
>         """,
>         ("test_name", {'id':str(i)})
>     )
> {code}
>  
> 3. Flush
>  
> {code:java}
> nodetools flush{code}
>  
>  
> 4. Fetch data (python3)
> {code:python}
> import time
> from cassandra.cluster import Cluster
> cluster = Cluster(['127.0.0.1'], control_connection_timeout=5000)
> session = cluster.connect('test')
> session.default_fetch_size = 5000
> session.default_timeout = 120
> count = 0
> rows = session.execute("select tags from test where name='test_name'")
> for row in rows:
>     count += 1
> print(count)
> {code}
> Result: 10111 (expected 20000)
>  
> Changing the page size changes the result count. Some quick samples:
>  
> ||default_fetch_size||count||
> |5000|10111|
> |1000|1830|
> |999|1840|
> |998|1850|
> |20000|20000|
> |100000|20000|
>  
>  
> In short, I cannot guarantee I'll get all the results back unless the page 
> size > number of rows.
> This seems to get worse with multiple SSTables (eg nodetool flush between 
> some of the insert batches). When using replication, the issue can get 
> disgustingly bad - potentially giving a different result on each query.
> Interesting, if we pad the values on the tag map ("id" in this repro case) so 
> that the insertion is in lexicographical order, there is no issue. I believe 
> the issue also does not repro if I do not call "nodetools flush" before 
> querying.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org
For additional commands, e-mail: commits-h...@cassandra.apache.org

[jira] [Updated] (CASSANDRA-15413) Missing results on reading large frozen text map

Reply via email to