[ 
https://issues.apache.org/jira/browse/VCL-815?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14315233#comment-14315233
 ] 

ASF subversion and git services commented on VCL-815:
-----------------------------------------------------

Commit 1658846 from [~arkurth] in branch 'vcl/trunk'
[ https://svn.apache.org/r1658846 ]

VCL-815
Reworked OS.pm::update_cluster. Each reservation now only creates a 
cluster_info file for itself. The firewall processes was also improved to be 
much more efficient.

Fixed problem with Windows code which creates the cluster_info file. The line 
endings were not Windows-style because the sed command in 
OS.pm::set_text_file_line_endings was failing. Sed under Cygwin can't handle a 
Windows-style file path if the "-i" switch is used. Added 
get_cygwin_unix_file_path subroutine. This gets called and the result gets 
passed to set_text_file_line_endings.

Other
Cleaned up and renamed Windows.pm::get_cygwin_path --> 
get_cygwin_installation_directory_path. It was messy. There were notify 
messages obviously copied/pasted from another subroutine which made no sense. 
Its name had also become vague with the addition of get_cygwin_unix_file_path, 
both of which call cygpath.exe.

> Problems with update_cluster in OS.pm
> -------------------------------------
>
>                 Key: VCL-815
>                 URL: https://issues.apache.org/jira/browse/VCL-815
>             Project: VCL
>          Issue Type: Bug
>          Components: vcld (backend)
>            Reporter: Andy Kurth
>            Assignee: Andy Kurth
>             Fix For: 2.4
>
>
> *Problem 1:*
> Due to OS.pm::update_cluster, the processing of the reserved state for each 
> computer is taking an extremely long time.  A fairly large cluster of 80 
> computers is taking over 20 minutes to complete, not including the time it 
> spends waiting for the user to click Connect and to actually connect to the 
> computer.  This is longer than it took to actually load the bare-metal 
> computers via xCAT.
> *Problem 2:*
> Every reservation that belongs to a cluster request first creates its own 
> cluster_info file in /tmp.  The file that each creates is identical.  Every 
> reservation then proceeds to copy the file it created to every other computer 
> in the cluster request via SCP.  So, for an 80 node cluster, cluster_info 
> files are needlessly copied back and forth 6,400 times.
> *Problem 3:*
> Each node does not copy the cluster_info file it created in /tmp to the 
> proper location on itself.  Instead, it sends it to itself via SCP.  There's 
> really no need for this because the file will end up in the proper location 
> when the other 79 nodes SCP it over.
> *Problem 4:*
> The subroutine is using the following to determine where to send the file to 
> via SCP:
> {noformat}
> my $node_name = $request_data->{reservation}{$resid}{computer}{SHORTNAME};
> {noformat}
> If you're not familiar with the backend code, the remote computer's short 
> hostname is being used.  For example, if the FQDN is vm9.example.com, the 
> code is trying to send the file to "vm9".  This will only work if "vm9" 
> resolves to an address the management node can get to.  This will not work 
> for distributed management nodes residing in different locations.  
> Furthermore, with the move away from relying on /etc/hosts, whenever a 
> hostname is passed to run_scp_command, the computer's private IP address will 
> be used.  Examine the determine_remote_connection_target for more 
> information.  If absolutely necessary to SCP files back and forth, the public 
> IP address must be used.
> *Problem 5:*
> The update_cluster subroutine is attempting to open the firewall on each 
> computer for incoming traffic from the other computers in the cluster.  This 
> is good, but the way it is implemented is very slow an inefficient.  It is 
> also not reliable due to the way the enable_firewall_port subroutines are 
> handling things.  Errors are being generated because the command being 
> generated is too large:
> {noformat}
> |10435|2626116|2726142|reserved|Linux.pm:enable_firewall_port|3828| ---- 
> WARNING ----
> |10435|2626116|2726142|reserved|Linux.pm:enable_firewall_port|3828| 
> 2015-02-04 
> 16:28:26|10435|2626116|2726142|reserved|Linux.pm:enable_firewall_port|3828|failed
>  to enable firewall port on blade1f5-5, protocol: tcp, port: any, scope: 
> <omitted>, exit status: 1, command:
> |10435|2626116|2726142|reserved|Linux.pm:enable_firewall_port|3828| iptables 
> -D INPUT 9 && iptables -D INPUT 8 && iptables -D INPUT 7 && iptables -D INPUT 
> 6 && iptables -D INPUT 5 && iptables -D INPUT 4 && iptables -D INPUT 3 && 
> iptables -D INPUT 2 && iptables -D INPUT 10 && iptables -D INPUT 1 && 
> /sbin/iptables -v -I INPUT 1 -p tcp -j ACCEPT -s 10.25.0.241 -m state --state 
> NEW,RELATED,ESTABLISHED && 
> ...much more omitted...
> |10435|2626116|2726142|reserved|Linux.pm:enable_firewall_port|3828| output:
> |10435|2626116|2726142|reserved|Linux.pm:enable_firewall_port|3828| iptables: 
> Index of deletion too big.
> |10435|2626116|2726142|reserved|Linux.pm:enable_firewall_port|3828| ( 0) 
> Linux.pm, enable_firewall_port (line: 3828)
> |10435|2626116|2726142|reserved|Linux.pm:enable_firewall_port|3828| (-1) 
> OS.pm, update_cluster (line: 3909)
> |10435|2626116|2726142|reserved|Linux.pm:enable_firewall_port|3828| (-2) 
> reserved.pm, process (line: 157)
> |10435|2626116|2726142|reserved|Linux.pm:enable_firewall_port|3828| (-3) 
> vcld, make_new_child (line: 587)
> |10435|2626116|2726142|reserved|Linux.pm:enable_firewall_port|3828| (-4) 
> vcld, main (line: 348)
> {noformat}
> *Problem 6:*
> The subroutine is calling $self->data->get_request_data() and then accessing 
> the hash data directly.  Example:
> {noformat}
> $request_data->{reservation}{$rid}{computer}{IPaddress}
> {noformat}
> The past few years have been spent trying to get rid of code which directly 
> accesses the hash constructed by get_request_info.  There are functions built 
> into DataStructure.pm which may be accessed via  $self->data to retrieve this 
> information, including information about other reservations in the cluster.  
> The get_request_data function should only be used when absolutely necessary 
> -- such as from DataStructure.pm or if the entire hash is needed to create 
> another module object.
> *Problem 7:*
> The reserved.pm::process subroutine calculates the exact time it should 
> timeout a reservation due to no initial connection by taking the time the 
> user clicked connect and adding the initialconnecttimeout variable value to 
> it.  The update_cluster subroutine gets run in the meantime.  As it is 
> currently written, it can take longer than the entire initialconnecttimeout 
> value.  When control is passed back to reserved.pm::process, the code 
> realizes the initial connection expire time is in the past so connection 
> checking isn't being done at all.
> *Problem 8:*
> The code includes the following:
> {noformat}
> if ($image_OS_type =~ /linux/i) {
>    ...
> }
> elsif ($image_OS_type =~ /windows/i) {
>    ...
> }
> {noformat}
> The backend is architected so that these if/else statements should be avoided 
> whenever possible.  Sure, an if/else such as this is quick and easy but the 
> more elegant way to handle it would be to add a commonly named subroutine to 
> the Linux and Windows modules that handles the specifics of this statement.  
> Again, work has been done over the past few years to rework code such as 
> this.  I see no reason to keep adding it back.
> Problem 9:
> "$image_OS_type" - There are over 9,000 variables defined in the backend 
> code.  Variables intended to act as constants for an entire module are all 
> uppercase.  Nearly all other variables are entirely lowercase, except for a 
> few in some very old code.  I see no reason to mix case.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Reply via email to