[jira] [Commented] (CASSANDRA-9304) COPY TO improvements

Tyler Hobbs (JIRA) Tue, 06 Oct 2015 14:39:43 -0700

    [ 
https://issues.apache.org/jira/browse/CASSANDRA-9304?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14945873#comment-14945873
 ]


Tyler Hobbs commented on CASSANDRA-9304:
----------------------------------------

Overall the patch is definitely a nice improvement!  Most of my review comments 
are about using more idiomatic python.

Imports:
 * In the imports, StringIO is imported later.  Also, to avoid flake8 warnings, 
add {{# noqa}} to the {{from io import StringIO}} line.

ImportExportConfig:
* The way that checks for unhandled options are performed is kind of strange.  
In perform_csv_import() they're checked immediately, but in 
perform_csv_export() it's deferred to ExportTask.  Additionally, a lot of 
functions seem to modify the options inside an ImportExportConfig instance, 
which is somewhat unexpected behavior and seems likely to result in bugs in the 
future. 
* In Python it's pretty unusual to use a class to only store mutable data, so I 
would replace the ImportExportConfig class with a function that returns a tuple 
(e.g. {{(csv_options, dialect_options, unrecognized_options)}}) or a dict.

perform_csv_import():
 * Why do {{csv_options = config.csv_options}} if it's only used once a couple 
of lines later?
 * As a matter of code style, avoid using tuple unpacking unless you're 
actually unpacking a tuple or the assignments are very trivial (e.g. {{a, b = 
None, None}})
   
ExportTask:
* The class should inherit from object
* get_ranges():
** There's no need to use a lambda sorting key on a list of tuples.  Tuples are 
naturally sorted sanely.
** Instead of {{del hosts\[:\]}}, just say {{hosts = \[\]}}.  You can also 
remove {{hosts = \[hostname\]}} above.
** Instead of doing {{for entry in ring}} and then using {{entry\[0\]}} and 
{{entry\[1\]}}, you can unpack the tuple in the loop: {{for token, replicas in 
ring:}}
** The final {{ranges.append()}} call could use a comment for explanation.  If 
you make the suggested changes to {{hosts}}, just use {{ranges\[-1\]\[1\]}} for 
the hosts (unless it's empty).  This approach is a little more obvious for 
readers.
* In {{send_initial_work()}}, it's idiomatic in python to use {{_}} for the 
name of an unused variable, so the loop should be: {{for _ in 
xrange(num_parallel_jobs)}}.

ExportProcess:
* In {{start_job()}}, it's not clear what would result in an IndexError and why 
we're ignoring it.  Try to limit try/excepts to the minimum number of lines 
they need to contain, and add a comment explaining what's going on there.
* ExportSession should be a separate top-level class instead of a nested one.  
(It's pretty unusual to nest classes.)  Also, it should inherit from object().
* I would rename {{handle_result}} to {{attach_callbacks}} and rename 
{{return_result}} to {{write_rows_to_csv}}.
* The host selection code in {{get_session}} is over-complicated.  How about 
something like:

{code}
new_hosts = [h for h in hosts if h not in self.hosts_to_sessions]
if new_hosts:
    host = new_hosts[0]
    # open Cluster, etc
else: 
    host = min(hosts, key=lambda h: self.hosts_to_sessions[h].jobs)
    session = self.hosts_to_sessions[host]
    session.jobs += 1
    return session
{code}

General:
* Try to add a few more code comments explaining what's going on at a high 
level.  Documenting method return values is also good (since there's no type 
information about that).
* You may want to put the csv-related classes and functions into separate 
modules under {{pylib}}.


> COPY TO improvements
> --------------------
>
>                 Key: CASSANDRA-9304
>                 URL: https://issues.apache.org/jira/browse/CASSANDRA-9304
>             Project: Cassandra
>          Issue Type: Improvement
>          Components: Core
>            Reporter: Jonathan Ellis
>            Assignee: Stefania
>            Priority: Minor
>              Labels: cqlsh
>             Fix For: 2.1.x
>
>
> COPY FROM has gotten a lot of love.  COPY TO not so much.  One obvious 
> improvement could be to parallelize reading and writing (write one page of 
> data while fetching the next).



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (CASSANDRA-9304) COPY TO improvements

Reply via email to