[kudu-CR] [tools] added insert-generated-rows into kudu tools

Alexey Serbin (Code Review) Wed, 28 Sep 2016 01:10:25 -0700

Alexey Serbin has posted comments on this change.

Change subject: [tools] added insert-generated-rows into kudu tools
......................................................................



Patch Set 12:

(20 comments)

http://gerrit.cloudera.org:8080/#/c/4412/11/src/kudu/tools/tool_action_test.cc
File src/kudu/tools/tool_action_test.cc:

PS11, Line 358:   }
              :   if (total_row_count != nullptr) {
              :     *total_row_count = accumulate(row_count.begin(), 
row_count.end(), 0UL);
              :   }
> Looks like it's still here?
Oops, my bad.  I'm surprised myself -- probably, I got distracted.  Let me 
remove it, though.


http://gerrit.cloudera.org:8080/#/c/4412/12/src/kudu/tools/tool_action_test.cc
File src/kudu/tools/tool_action_test.cc:

PS12, Line 21: options
> s/options/option ?
Done


Line 33: //     --num_threads=1 \
> don't you need master_addresses here as required params ? also applicable i
Good catch -- these examples became stale once I switched to the mandatory 
master(s) location.  Will update


PS12, Line 75: master server(s) location
> as per the code below there is no default_params for master server location
Yes, correct -- since changing to use master addresses as mandatory parameter, 
this no longer has default value.  Will update.


Line 80: //     bench_c_00_03
> Same as above. Also here and above, we could keep this name intuitive to re
Done


Line 119: using kudu::Schema;
> I think Schema is not being used anywhere.
Done


Line 131: using kudu::client::sp::shared_ptr;
> Actually a basic Qn here: why do we use this special shared_ptr  as opposed
It's just a typedef: it might be either std::shared_ptr or 
std::tr1::shared_ptr, check $REPO_ROOT/src/kudu/client/shared_ptr.h  That's 
because Kudu C++ client API uses this notation to be compatible with C++03.


PS12, Line 142: 1000000
> Nit: Not sure if you want to insert 1 million by default, if I am a curious
1M rows are inserted really fast: at my laptop it's just 5 seconds.  Also, the 
idea was to create some _load_ :)  With 1K rows there is no load at all.  
Besides, take into the consideration that the original insert-generated-rows 
tools was running with no limit at generated rows at all.

So, what do you think would be the right number here?


Line 194:   int64_t NextImpl() {
> Do we want to use uint64_t here ? Although Random.Next64() seems to be fill
OK, that makes sense.  Will update to uint64_t.


PS12, Line 215: != 0
> this check is prolly redundant ?
As I remember, Tidy Bot thinks it's better for readability: that's the reason I 
left it here.  I'll try to get rid of it -- let's see what it will say.


PS12, Line 285: case UNKNOWN_DATA:  // fall-through
> redundant ?
Yes, it seems so.  I copied the switch enumerators from some place and left 
this UNKNOWN_DATA as is.  Will remove.


Line 302:              FLAGS_buffer_flush_watermark_pct));
> I vaguely recall cpp guideline saying next spill being after 4 spaces I gue
QtCreator's editor does this sometimes.  Will fix.


PS12, Line 314: FLAGS_num_rows_per_thread
> consider using variable FLAGS_ directly to be consistent with other places.
I wanted to have a local variable since I access it in multiple places.


PS12, Line 315: num_rows_per_thread == 0
> given it's default value being 1M, what's this check for ?
Take a look at the comment for the '--num_rows_per_thread': 0 means no limit.


Line 324:   if (row_count != nullptr) {
> DCHECK or CHECK for this, unless we have a use case where it could actually
I don't see why DCHECK or CHECK is needed here.  One can pass nullptr as 
rowcount: that means there isn't any interest in getting the count.  It's a 
ubiquitous notation for pointer out parameters, which I want to have here, even 
if nobody uses it right now.


Line 403:   Stopwatch sw(Stopwatch::ALL_THREADS);
> Stupid Qn: Does ALL_THREADS here mean all threads on CPU or all threads gen
Threads in this process, so the monitoring/main thread will also be counted in.

I don't want LOG_TIMING, since I want to output that information in my own 
format, not what LOG_TIMING provides.


PS12, Line 440: optional scan afterwards
> I think this need not be made explicit since this is listed in optional par
I want this to be present to emphasize that there are two parts: the write and 
the read one, even if the latter is optional.  That's an essence of this test 
in a terse phrasing, otherwise it would be necessary to read through all the 
list of options to understand about the read part.


PS12, Line 452: kTableNameArg
> Can we not make this table_name optional and create one if the user doesn't
The original approach was to be able to use any table structure, not only some 
fixed pre-generated one which the test would create.  So, this test retrieves 
the table schema out of existing table it's pointed at.  That's because 
specifying custom table structure via parameters of this test would be 
cumbersome.
  
We can add this an option, with pairing auto-delete option. In case of 
auto-created table we can safely delete it if nobody asks to leave it alone 
after the test is completed.


PS12, Line 455:   
> Nit: Double space here.
Yes, here and elsewhere in the description for parameters: this is intentional 
(that's easier for reading, IMO).  Or we have some sort of style guide for 
having one space for those?


Line 457:           "the inserted rows." })
> Ideally, we would not want this behavior of test leaving the test data on t
I'm strongly opposed to auto-deleting any data by the test in its current 
implementation, since this test might be run in production environment.  
Nothing prevents a user pointing to already existing table with some data, and 
I don't want to delete any useful data along with the data generated by the 
test.

To me, using the approach you are advocating makes more sense in case of 
auto-created tables, because in that case we can have a guarantee the table was 
created by the test and not existed before.  I think we can use auto-generated 
table names for that and be safe.  I'll take a look at that approach.


-- 
To view, visit http://gerrit.cloudera.org:8080/4412
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-MessageType: comment
Gerrit-Change-Id: I332927c0b928c9c4fb81a8e26f5c9ed7565299ad
Gerrit-PatchSet: 12
Gerrit-Project: kudu
Gerrit-Branch: master
Gerrit-Owner: Alexey Serbin <aser...@cloudera.com>
Gerrit-Reviewer: Adar Dembo <a...@cloudera.com>
Gerrit-Reviewer: Alexey Serbin <aser...@cloudera.com>
Gerrit-Reviewer: David Ribeiro Alves <dral...@apache.org>
Gerrit-Reviewer: Dinesh Bhat <din...@cloudera.com>
Gerrit-Reviewer: Kudu Jenkins
Gerrit-Reviewer: Tidy Bot
Gerrit-Reviewer: Todd Lipcon <t...@apache.org>
Gerrit-HasComments: Yes

[kudu-CR] [tools] added insert-generated-rows into kudu tools

Reply via email to