[
https://issues.apache.org/jira/browse/HDFS-10785?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16003564#comment-16003564
]
Anatoli Shein commented on HDFS-10785:
--------------------------------------
Thanks [~James C] for the review. I have addressed your comments as follows:
-Looks like the tools_common getOptions functions pulls from hdfs-site.xml and
core-site.xml. Can you also make it allow the environment's HADOOP_CONF_DIR
override those?
* The environment's HADOOP_CONF_DIR seems to override the default values
already (in ConfigurationLoader::SetDefaultSearchPath())
-hdfs_get and hdfs_copyToLocal are really the same thing other than the help
message. Is there any reason to have two? hdfs_moveToLocal is also the same
thing other than doing the delete at the end. I think it'd be straightforward
to factor out the main copy loop into a function and call that from all 3; this
would make things a lot more maintainable.
* hdfs_get and hdfs_copyToLocal currently do the same thing, however they will
be different when we will implement writing functionality, since hdfs_get will
have the ability to write to hdfs also. I separated out the main copy loop
into a function readFile which is being used by: hdfs_moveToLocal, hdfs_cat,
hdfs_tail, hdfs_get, and hdfs_copyToLocal.
-Is it worth making hdfs_ls have the option to be recursive? Seems like we
could get the same functionality with hdfs_find, can we share more code between
the two?
* We have an option for hdfs_ls to be recursive in order to be compatible with
the ls command of the java client
(https://hadoop.apache.org/docs/r2.7.1/hadoop-project-dist/hadoop-common/FileSystemShell.html#ls).
There are still more flags to add support to after this initial patch is
landed. The recursive ls is implemented using a call to fs->Find, so the code
is already being reused. I am not sure if we can further reuse the code between
ls and find right now.
-How about moving the printStatInfo function in tools_common.h to be a .str()
method on the StatInfo object in include/statinfo.h. That looks like it'd be
useful for general debugging and logging in client applications as well. Same
with printFsInfo, printContentSummary etc.
* This is done.
-in hdfs_copyToLocal you don't need the break after the call to exit. Looks
like the same issue is in other tools. It'd be nice to generalize the options
parsing so you don't have that block of code in all the tools, not sure if
there's a good way to do it though.
usage();
exit(EXIT_SUCCESS);
break;
* I removed this extra break in all the tools. I don't currently see a good way
to generalize the GetOpt options parsing between the tools though.
-also in hdfs_copyToLocal either check that hdfs::Options isn't empty and call
exit with a warning or use a default one (seems reasonable). Accessing
optionals that haven't been set will throw some error but it's going to be
something confusing for end users.
* Solving HDFS-9539 should solve the options problem. Currently the default
values loaded are always empty, so the error message looks like: "Error
connecting to the cluster: defaultFS is empty. defaultFS of [] is not supported"
-More generally it seems like it'd make sense to stick the config parsing,
timeout limit changes, and FileSystem creation into a helper function in
tools_common so things can be changed in a single place.
* Done: moved everything into a doConnect function.
-also in hdfs_copyToLocal: I'd increase BUF_SIZE to something a lot larger than
4096. We still don't have caching for DN connections so copying a large file is
going to have a lot of overhead just setting up connections to the DN. Consider
moving it to a global variable so it can live in bss/data rather than the
stack; valgrind doesn't pick up on stack corruption well and it's possible
(though unlikely) that someone will set the system's max stack size to be
pretty small.
* Done. I increased it to 1 MB and made it global.
-would be really nice if hdfs_tail let you specify how many lines to print.
Still helpful and a lot less work would be specifying how many bytes to print.
* Java client does not have functionality to specify the number of bytes/lines
to print
(https://hadoop.apache.org/docs/r2.7.1/hadoop-project-dist/hadoop-common/FileSystemShell.html#tail).
Not sure if this improvement should be part of this patch.
-IMHO nitpicking coding style stuff to death during reviews is usually an
exercise in bikeshedding, but could you make sure tab spacing is consistent?
For example in hdfs_chmod.cpp you switch between 2 and 4 spaces. (turning .cpp
to .cc extensions for consistency would be nice too)
* I fixed this and other spacing issues that I noticed, and turned all .cpp
extensions to .cc
Have you tested all of these tools using valgrind and large directory trees? We
are in a bad state right now since the minidfscluster tests can't be run under
valgrind (JIT compilation + GC blow things up) so memory bugs keep slipping in.
I'm going to file a jira to fork the process before starting the minidfscluster
so we can start getting coverage but I'm not sure I'll get to it for a couple
weeks.
* Yes, I tested them with large directory trees under valgrind. The only leak I
am getting is the 'still reachable: 72,704 bytes' which is a known bug of GCC 5.
> libhdfs++: Implement the rest of the tools
> ------------------------------------------
>
> Key: HDFS-10785
> URL: https://issues.apache.org/jira/browse/HDFS-10785
> Project: Hadoop HDFS
> Issue Type: Sub-task
> Components: hdfs-client
> Reporter: Anatoli Shein
> Assignee: Anatoli Shein
> Attachments: HDFS-10785.HDFS-8707.000.patch,
> HDFS-10785.HDFS-8707.001.patch, HDFS-10785.HDFS-8707.002.patch,
> HDFS-10785.HDFS-8707.003.patch
>
>
--
This message was sent by Atlassian JIRA
(v6.3.15#6346)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]