[ 
https://issues.apache.org/jira/browse/HDFS-10754?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15427322#comment-15427322
 ] 

James Clampffer commented on HDFS-10754:
----------------------------------------

Nice set of tools.  If you happen to have perf comparisons with CLI hadoop and 
webhdfs you might want to put them in an attachment.  Others have raised valid 
concerns about wire compatibility in the past so it would be good to 
demonstrate that these tools are really worth implementing in C++.  I saw a 
project a year or so ago where someone wanted to have a script do some fairly 
simple FS operations.  I don't remember the details too well but it had to so 
with mirroring a directory structure + a few extra rules for what not to 
mirror.  The simple solution/prototype was to call the hadoop CLI from bash, 
but ~400ms JVM invocation time * 10s of thousands of files meant it really 
couldn't be used in production.  Had hdfs CLI tools with lower startup time 
existed the prototype would have worked just fine.

Judging from comments like "//Use GetOpt to read in the values" I'm guessing 
you're still working on this, but here's a few comments sooner rather than 
later.

0) Good to see really clear comments about how things work e.g. 
filesystem.cc:924.

1)  There's a good bit of code duplication in the config parsing code for each 
tool.  Factoring that out into a set of shared files makes each tool cleaner 
and also helps improve consistency in how they deal with configuration.  Same 
idea with the block of defines below.
{code}
#define SCHEME "hdfs"
#define HADOOP_CONF_DIR "HADOOP_CONF_DIR"
#define DEFAULT_HOST "localhost"
#define DEFAULT_PORT "8020"
#define DEFAULT_CONF_FILE "core-site.xml"
{code}

2) Very minor, disregard if you want, but you rather than including 
google/protobuf/io/coded_stream.h you could include 
google/protobuf/stubs/common.h if you just need something to declare 
ShutdownProtobufLibrary.  It should be a bit smaller so it'd help with compile 
times and brings in fewer symbols.

3) In places you call "optional<T>::value" you can also call 
"optional<T>::value_or" to get rid of some if blocks without making the code 
any harder to read (IMO).  Nested value_or calls do tend to make things hard to 
read.  The block below and the equivalent block for host are repeated in 
multiple places, perhaps make a function that takes the uri, options, and 
default and stick the logic there and share it between executables in the same 
was as the config stuff.
{code}
string port;
if(uri->get_port()){
  port = to_string(uri->get_port().value());
} else if (options.defaultFS.get_port()){
  port = to_string(options.defaultFS.get_port().value());
} else {
  port = DEFAULT_PORT;
}
{code}
can be:
{code}
string get_usable_port(const URI &uri, const Options &options) {
  if(uri->get_port()){
    return to_string(uri->get_port().value());
  }
  return to_string(options.defaultFS.get_port().value_or(DEFAULT_PORT));
}
{code}

4) One of the things that's getting more and more important is adding tests; 
admittedly I haven't been getting good CI coverage on my recent patches either. 
 Could you include some simple tests, even a bash script would do, where you 
run these tools?  One method would be to assume there's a cluster running and 
you're passed in something like $HADOOP_CLI (if you need to write anything), 
$NN_HOST, and $NN_PORT just so it's easy enough for people to run the tests.  
One of the major blockers we have in the way of merging this branch into trunk 
is getting really solid tests so if you think of a way to test these in the CI 
test environment that's ideal.

5) Use std::make_shared rather than passing a pointer into the constructor of 
shared_ptr.  make_shared will allocate the reference count variable 
contiguously with the rest of the class so you get less heap fragmentation and 
better locality.  It's also exception safe which is always a good thing.
{code}
std::shared_ptr<SetOwnerState> new_set_owner_state = 
std::shared_ptr<SetOwnerState>(new SetOwnerState (set_owner_state));
{code}
can be:
{code}
std::shared_ptr<SetOwnerState> new_set_owner_state = 
std::make_shared<SetOwnerState>(set_owner_state);
{code}

6) Not an issue in your current code, but in case anyone comes in later to 
modify it I'd get rid of pointer aliases like the one below.  You can still do 
the if(!whatever) on shared_ptrs, this is a case where constructing the 
shared_ptr with a pointer is a reasonable thing to do.
{code}
FileSystem *fs_raw = FileSystem::New(io_service, "", options);
if (!fs_raw) {
  cerr << "Could not create FileSystem object" << endl;
  return 1;
}
//Wrapping fs_raw into a unique pointer to guarantee deletion
shared_ptr<FileSystem> fs(fs_raw);
{code}

The code looks solid in terms of memory management and thread safety, let me 
know what you think of my comments and I can take a really close look at some 
of the tricky bits in the meantime.




> libhdfs++: Create tools directory and implement hdfs_cat, hdfs_chgrp, 
> hdfs_chown, and hdfs_chmod
> ------------------------------------------------------------------------------------------------
>
>                 Key: HDFS-10754
>                 URL: https://issues.apache.org/jira/browse/HDFS-10754
>             Project: Hadoop HDFS
>          Issue Type: Sub-task
>          Components: hdfs-client
>            Reporter: Anatoli Shein
>            Assignee: Anatoli Shein
>         Attachments: HDFS-10754.HDFS-8707.000.patch, 
> HDFS-10754.HDFS-8707.001.patch, HDFS-10754.HDFS-8707.002.patch, 
> HDFS-10754.HDFS-8707.003.patch, HDFS-10754.HDFS-8707.004.patch, 
> HDFS-10754.HDFS-8707.005.patch, HDFS-10754.HDFS-8707.006.patch
>
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to