[
https://issues.apache.org/jira/browse/CASSANDRA-9304?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14945873#comment-14945873
]
Tyler Hobbs commented on CASSANDRA-9304:
----------------------------------------
Overall the patch is definitely a nice improvement! Most of my review comments
are about using more idiomatic python.
Imports:
* In the imports, StringIO is imported later. Also, to avoid flake8 warnings,
add {{# noqa}} to the {{from io import StringIO}} line.
ImportExportConfig:
* The way that checks for unhandled options are performed is kind of strange.
In perform_csv_import() they're checked immediately, but in
perform_csv_export() it's deferred to ExportTask. Additionally, a lot of
functions seem to modify the options inside an ImportExportConfig instance,
which is somewhat unexpected behavior and seems likely to result in bugs in the
future.
* In Python it's pretty unusual to use a class to only store mutable data, so I
would replace the ImportExportConfig class with a function that returns a tuple
(e.g. {{(csv_options, dialect_options, unrecognized_options)}}) or a dict.
perform_csv_import():
* Why do {{csv_options = config.csv_options}} if it's only used once a couple
of lines later?
* As a matter of code style, avoid using tuple unpacking unless you're
actually unpacking a tuple or the assignments are very trivial (e.g. {{a, b =
None, None}})
ExportTask:
* The class should inherit from object
* get_ranges():
** There's no need to use a lambda sorting key on a list of tuples. Tuples are
naturally sorted sanely.
** Instead of {{del hosts\[:\]}}, just say {{hosts = \[\]}}. You can also
remove {{hosts = \[hostname\]}} above.
** Instead of doing {{for entry in ring}} and then using {{entry\[0\]}} and
{{entry\[1\]}}, you can unpack the tuple in the loop: {{for token, replicas in
ring:}}
** The final {{ranges.append()}} call could use a comment for explanation. If
you make the suggested changes to {{hosts}}, just use {{ranges\[-1\]\[1\]}} for
the hosts (unless it's empty). This approach is a little more obvious for
readers.
* In {{send_initial_work()}}, it's idiomatic in python to use {{_}} for the
name of an unused variable, so the loop should be: {{for _ in
xrange(num_parallel_jobs)}}.
ExportProcess:
* In {{start_job()}}, it's not clear what would result in an IndexError and why
we're ignoring it. Try to limit try/excepts to the minimum number of lines
they need to contain, and add a comment explaining what's going on there.
* ExportSession should be a separate top-level class instead of a nested one.
(It's pretty unusual to nest classes.) Also, it should inherit from object().
* I would rename {{handle_result}} to {{attach_callbacks}} and rename
{{return_result}} to {{write_rows_to_csv}}.
* The host selection code in {{get_session}} is over-complicated. How about
something like:
{code}
new_hosts = [h for h in hosts if h not in self.hosts_to_sessions]
if new_hosts:
host = new_hosts[0]
# open Cluster, etc
else:
host = min(hosts, key=lambda h: self.hosts_to_sessions[h].jobs)
session = self.hosts_to_sessions[host]
session.jobs += 1
return session
{code}
General:
* Try to add a few more code comments explaining what's going on at a high
level. Documenting method return values is also good (since there's no type
information about that).
* You may want to put the csv-related classes and functions into separate
modules under {{pylib}}.
> COPY TO improvements
> --------------------
>
> Key: CASSANDRA-9304
> URL: https://issues.apache.org/jira/browse/CASSANDRA-9304
> Project: Cassandra
> Issue Type: Improvement
> Components: Core
> Reporter: Jonathan Ellis
> Assignee: Stefania
> Priority: Minor
> Labels: cqlsh
> Fix For: 2.1.x
>
>
> COPY FROM has gotten a lot of love. COPY TO not so much. One obvious
> improvement could be to parallelize reading and writing (write one page of
> data while fetching the next).
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)