[jira] [Commented] (CASSANDRA-9304) COPY TO improvements

Stefania (JIRA) Mon, 12 Oct 2015 18:35:51 -0700

    [ 
https://issues.apache.org/jira/browse/CASSANDRA-9304?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14954207#comment-14954207
 ]


Stefania commented on CASSANDRA-9304:
-------------------------------------

Thanks for your feedback [~thobbs] and sorry for the delay in addressing it.

bq. In the imports, StringIO is imported later. Also, to avoid flake8 warnings, 
add # noqa to the from io import StringIO line.

I removed {{StringIO.StringIO}}, I believe {{io.StringIO}} is equivalent and 
preferred to {{StringIO.StringIO}}? If so, {{# noqa}} doesn't seem to be 
required, the first {{E402}} is at line 117.


bq. The way that checks for unhandled options are performed is kind of strange. 
In perform_csv_import() they're checked immediately, but in 
perform_csv_export() it's deferred to ExportTask. Additionally, a lot of 
functions seem to modify the options inside an ImportExportConfig instance, 
which is somewhat unexpected behavior and seems likely to result in bugs in the 
future.

I was planning to add an {{ImportTask}} in CASSANDRA-9302 and to focus more on 
options in CASSANDRA-9303 but I've patched it up so that it is in a better 
shape. Let me know if I missed anything.

bq. In Python it's pretty unusual to use a class to only store mutable data, so 
I would replace the ImportExportConfig class with a function that returns a 
tuple (e.g. (csv_options, dialect_options, unrecognized_options)) or a dict.

Replaced with a function that returns {{csv_options, dialect_options, 
unrecognized_options}}


bq. Why do csv_options = config.csv_options if it's only used once a couple of 
lines later?

No longer applicable.

bq. As a matter of code style, avoid using tuple unpacking unless you're 
actually unpacking a tuple or the assignments are very trivial (e.g. a, b = 
None, None)

I cannot find the line you are referring to.

bq. ExportTask should inherit from object

Done

{quote} get_ranges():
There's no need to use a lambda sorting key on a list of tuples. Tuples are 
naturally sorted sanely.
Instead of del hosts\[:\], just say hosts = \[\]. You can also remove hosts = 
\[hostname\] above.
Instead of doing for entry in ring and then using entry\[0\] and entry\[1\], 
you can unpack the tuple in the loop: for token, replicas in ring:
The final ranges.append() call could use a comment for explanation. If you make 
the suggested changes to hosts, just use ranges\[-1\]\[1\] for the hosts 
(unless it's empty). This approach is a little more obvious for readers.
{quote}

Done

bq. In send_initial_work(), it's idiomatic in python to use _ for the name of 
an unused variable, so the loop should be: for _ in xrange(num_parallel_jobs).

Done

bq. In start_job(), it's not clear what would result in an IndexError and why 
we're ignoring it. Try to limit try/excepts to the minimum number of lines they 
need to contain, and add a comment explaining what's going on there.

Leftover from the initial patch, it's gone now.

bq. ExportSession should be a separate top-level class instead of a nested one. 
(It's pretty unusual to nest classes.) Also, it should inherit from object().

Done

bq. I would rename handle_result to attach_callbacks and rename return_result 
to write_rows_to_csv.

Done

{quote} The host selection code in get_session is over-complicated. How about 
something like:
{code}
new_hosts = [h for h in hosts if h not in self.hosts_to_sessions]
if new_hosts:
    host = new_hosts[0]
    # open Cluster, etc
else:
    host = min(hosts, key=lambda h: self.hosts_to_sessions[h].jobs)
    session = self.hosts_to_sessions[host]
    session.jobs += 1
    return session
{code}{quote}

Done, thank you. Do you think we need anything else to support very large 
clusters? For example should we close connections sooner rather than at the 
very end?

bq. Try to add a few more code comments explaining what's going on at a high 
level. Documenting method return values is also good (since there's no type 
information about that).

Done.

bq. You may want to put the csv-related classes and functions into separate 
modules under pylib.

I'll do this after you've had a chance to check all other points above.


I also had to fix a problem introduced by an upgrade to the python driver in 
order to run the round-trip tests, the same fix will be delivered by 
CASSANDRA-10507.

> COPY TO improvements
> --------------------
>
>                 Key: CASSANDRA-9304
>                 URL: https://issues.apache.org/jira/browse/CASSANDRA-9304
>             Project: Cassandra
>          Issue Type: Improvement
>          Components: Core
>            Reporter: Jonathan Ellis
>            Assignee: Stefania
>            Priority: Minor
>              Labels: cqlsh
>             Fix For: 3.x, 2.1.x, 2.2.x
>
>
> COPY FROM has gotten a lot of love.  COPY TO not so much.  One obvious 
> improvement could be to parallelize reading and writing (write one page of 
> data while fetching the next).



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (CASSANDRA-9304) COPY TO improvements

Reply via email to