[ 
https://issues.apache.org/jira/browse/CASSANDRA-9304?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15006279#comment-15006279
 ] 

Stefania commented on CASSANDRA-9304:
-------------------------------------

Thank you for the review. Here is the latest update:

bq. In {{write_rows_to_csv()}}, instead of using {{sys.exec_info()\[0\]}}, it 
seems like we could just use the caught exception (by doing {{except Exception, 
e:}}). Am I missing a reason why that won't work here?

No reason, I changed it, thanks.

{quote}
In {{get_ranges()}}, this line will result in two token ranges using the same 
dict, so the count of attempts and rows can become incorrect:
{code}
ranges[(previous, None)] =  
{code}
Just using {{.copy()}} on the dict should be fine.
{quote}

Fixed, thank you.

bq. In {{report_error()}}, the {{issubclass()}} check should be 
{{isinstance()}} instead.

Changed, thanks.

{quote}
If a child process dies, we don't (and can't) make any adjustments to the 
suceeded/failed count. Because the jobs that were queued to that process aren't 
resubmitted, this results in the copy process looping endlessly. I see two 
possible solutions:

# Track which token ranges we've submitted to each process. When a child 
process dies, re-submit any of those token ranges for which we have 0 rows.
# Error the entire copy process

I'm personally okay with option #2, so feel free to go with that if you don't 
want to do the work for #1.
{quote}

It's not easy to track which process is handling a range because we are using a 
shared queue where each process takes the next job whenever it is ready to 
accept another job. So I've opted for option 2 since if a process dies chances 
are the other processes will die too.

{quote}
I was able to find these bugs by editing the code to some simple fault 
injection in start_job():
{code}
if random.random() > 0.7:
   future = session.execute_async("SELECT * FROM badtable")
elif random.random() > 0.99:
    sys.exit(1)
else:
    future = session.execute_async(query)
{code}

I think it would be a good idea to expose something like this for testing 
through an environment variable. That would allow us to easily exercise these 
different error scenarios in dtests that are otherwise hard to replicate.
{quote}

Done. I've also created a 
[dtest|https://github.com/stef1927/cassandra-dtest/commit/a629352aa4735baeef42aae81c3bb098e45b77db]
 but due to the random nature of failure injection there isn't much to check 
other than checking that it doesn't time-out. Do you think this test is useful 
at all?

I've relaunched one more set of CI jobs, they are still running.

> COPY TO improvements
> --------------------
>
>                 Key: CASSANDRA-9304
>                 URL: https://issues.apache.org/jira/browse/CASSANDRA-9304
>             Project: Cassandra
>          Issue Type: Improvement
>            Reporter: Jonathan Ellis
>            Assignee: Stefania
>            Priority: Minor
>              Labels: cqlsh
>             Fix For: 2.1.x, 2.2.x, 3.x
>
>
> COPY FROM has gotten a lot of love.  COPY TO not so much.  One obvious 
> improvement could be to parallelize reading and writing (write one page of 
> data while fetching the next).



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Reply via email to