Re: Parsing TXT document and output to XML

2009-05-28 Thread Jim Gibson
On 5/27/09 Wed  May 27, 2009  3:27 PM, Stephen Reese rsre...@gmail.com
scribbled:

 List,
 
 I've been working on a method to parse a PDF or TXT document and
 output the results to XML over at Experts Exchange.
 http://www.experts-exchange.com/Programming/Languages/Scripting/Perl/Q_2443963
 0.html
 
 You may view the attached document or if the mailing list doesn't
 allow here is a copy of the document I would like to parse:
 http://filedb.experts-exchange.com/incoming/2009/05_w22/143310/XenApp-Secure-G
 ateway-Server-VL0.txt
 
 Basically I would like to take the following code and modify it to
 parse a TXT instead of a PDF document:
 
 #!/usr/bin/perl
 use strict;
 use warnings;
 use Data::Dumper;
 use CAM::PDF;
 
 my $pdf = CAM::PDF-new('XenApp_WebInterface_Server_VL04.pdf');
 my $text;
 foreach (1..$pdf-numPages) {
 $text .= $pdf-getPageText($_);
 }
 
 while($text =~ /Vulnerability Key:\s*
 (\S+)\s+STIG ID:\s*
 (\S+)\s+Release Number:\s*
 (\S+)\s+Status:\s*
 (\S+)\s+Short Name:\s*
 (\S+)\s+Long Name:\s*
 (\S+)\s+IA Controls:\s*
 (\S+)\s+Categories:\s*
 (\S+)\s+Effective Date:\s*
 (\S+)\s+Condition:\s*
 (\S+)\s+Policy:\s*
 (\S+)/g) {
 
 print Vuln
 Vulnerability_Key_$1/Vulnerability_Key_
 STIG_ID$2/STIG_ID_
 Release_Number_$3/Release_Number_
 Status_$4/Status_
 Short_Name_$5/Short_Name_
 Long_Name_$6/Long_Name_
 IA_Controls_IA_ControlID$7ID/IA_Control/IA_Controls_
 Categories_$8/Categories_
 Effective_Date_$9/Effective_Date_
 Condition_subitemtitle$10/titledata/data/subitem/Condition_
 Policy_$11/Policy_
 /Vuln\n;
 }

You have two basic choices:

1. Read the whole file into a variable and use the regular expression as
above to match multiple lines, extract the information, and print it.

2. Read the file line-by-line, save the relevant data, and print the data
when you have a complete set or at the end of the program.

A third choice if your data permits would be to set the input record
separator ($/) to the value that separates your records and read multiple
lines as a record. I don't think this will work in your case.

Here is an example of approach 2:

#!/usr/local/bin/perl
use strict;
use warnings;

my @keys = ( 
'Vulnerability Key',
'STIG ID',
'Release Number',
'Status',
'Short Name',
'Long Name',
'IA Controls',
'Categories',
'Effective Date',
'Condition',
'Policy'
);
my( %keys, %tags );
$keys{$_} = 1 for @keys;
$tags{$_} = $_ . '_' for @keys;
$tags{$_} =~ s/ /_/g for @keys;

my $file = 'XenApp Secure_Gateway_Server_VL04.txt';
open( my $fh, '', $file) or die(Can't open $file: $!);

my %record = map { $_, '' } @keys;
while( my $line = $fh ) {
chomp($line);
if( $line =~ m{ \A (.+?) : \s* (\S+) }x ) {
$record{$1} = $2 if $keys{$1};
if( $1 eq $keys[$#keys] ) {
print Vuln\n;
print $tags{$_}$record{$_}/$tags{$_}\n for @keys;
print /Vuln\n;
%record = map { $_, '' } @keys;
}
}
}

... which produces for your input:

Vuln
Vulnerability_Key_V0018219/Vulnerability_Key_
STIG_ID_CTX0700/STIG_ID_
Release_Number_1/Release_Number_
Status_Working/Status_
Short_Name_Secure/Short_Name_
Long_Name_Secure/Long_Name_
IA_Controls_ECSC-1/IA_Controls_
Categories_4.4/Categories_
Effective_Date_/Effective_Date_
Condition_/Condition_
Policy_All/Policy_
/Vuln
...

You may want to add error checking to the case where some keys are missing.
Note that your regular expression will only extract the first word of the
value, and your data in some cases has more than that on a line. You can
change this by changing the RE to:

if( $line =~ m{ \A (.+?) : \s* (.*?) \s* \z }x ) {




-- 
To unsubscribe, e-mail: beginners-unsubscr...@perl.org
For additional commands, e-mail: beginners-h...@perl.org
http://learn.perl.org/




Parsing TXT document and output to XML

2009-05-27 Thread Stephen Reese
List,

I've been working on a method to parse a PDF or TXT document and
output the results to XML over at Experts Exchange.
http://www.experts-exchange.com/Programming/Languages/Scripting/Perl/Q_24439630.html

You may view the attached document or if the mailing list doesn't
allow here is a copy of the document I would like to parse:
http://filedb.experts-exchange.com/incoming/2009/05_w22/143310/XenApp-Secure-Gateway-Server-VL0.txt

Basically I would like to take the following code and modify it to
parse a TXT instead of a PDF document:

#!/usr/bin/perl
use strict;
use warnings;
use Data::Dumper;
use CAM::PDF;

my $pdf = CAM::PDF-new('XenApp_WebInterface_Server_VL04.pdf');
my $text;
foreach (1..$pdf-numPages) {
$text .= $pdf-getPageText($_);
}

while($text =~ /Vulnerability Key:\s*
(\S+)\s+STIG ID:\s*
(\S+)\s+Release Number:\s*
(\S+)\s+Status:\s*
(\S+)\s+Short Name:\s*
(\S+)\s+Long Name:\s*
(\S+)\s+IA Controls:\s*
(\S+)\s+Categories:\s*
(\S+)\s+Effective Date:\s*
(\S+)\s+Condition:\s*
(\S+)\s+Policy:\s*
(\S+)/g) {

print Vuln
Vulnerability_Key_$1/Vulnerability_Key_
STIG_ID$2/STIG_ID_
Release_Number_$3/Release_Number_
Status_$4/Status_
Short_Name_$5/Short_Name_
Long_Name_$6/Long_Name_
IA_Controls_IA_ControlID$7ID/IA_Control/IA_Controls_
Categories_$8/Categories_
Effective_Date_$9/Effective_Date_
Condition_subitemtitle$10/titledata/data/subitem/Condition_
Policy_$11/Policy_
/Vuln\n;
}
VL04 
Page 1 of 8 


For Official Use Only 


When this document is printed, the document needs to be stamped top and bottom 
with the appropriate classification. 

VL04 -Vulnerabilities by Asset Property Element (for Vulnerability Maintainers) 

Vulnerability Key: V0018219 

STIG ID: CTX0700 
Release Number: 1 
Status: Working 
Short Name: Secure Gateway servers are not located in the DMZ. 
Long Name: Secure Gateway servers are not located in the DMZ or screened 
subnet. 
IA Controls: ECSC-1 Security Configuration Compliance 
Categories: 4.4 DMZ 

Effective Date: 
Condition: 

XenApp Secure Gateway Server (Target: XenApp Secure Gateway Server) 
Policy: All Policies 

MAC / 
Confidentiality 
Grid: 
I -Mission Critical II -Mission Support III -Administrative 
Classified 
Sensitive 
Public 
STIG ID: CTX0700 
Severity: Category II 
Long Name: Secure Gateway servers are not located in the DMZ or screened 
subnet. 
Vulnerability 
The Secure Gateway is an application that runs as a service on a server that is 
deployed in the 

Discussion: 
DMZ. The server running the Secure Gateway represents a single point of access 
to the secure, 
enterprise network. The Secure Gateway acts as an intermediary for every 
connection request 
originating from the Internet to the enterprise network. The Secure Gateway 
allows the tunneling of 
all ICA client traffic using SSL/TLS. The Secure Gateway manages the 
connectivity and encryption 
across the public Internet and hides the XenApp farm from potential intruders. 

Responsibility: Information Assurance Officer 

References: 
Department of Defense Instruction 8500.2 (DODI 8500.2) 

Checks: 
CTX0700 (Manual) 
Check with the Network reviewer or system administrator to obtain the external, 
internal, and DMZ 
IP addresses of the firewall. Once these IP addresses have been obtained, 
review the IP address 
configuration on Secure Gateway servers. Access the Secure Gateway server and 
type the 
following at the command prompt: 

C:\ipconfig /all 

1. If the IP address is on the same network as the DMZ firewall interface, this 
is not a finding. 
2. If the IP address is on the same internal network as the internal interface 
of the firewall, this is a 
finding. 
3. If the IP address is on the same network as the outside interface of the 
firewall, this is a finding. 
Fixes: 
CTX0700 (Manual) 
Place the Secure Gateway server in the DMZ or screened subnet. 

https://vms.disa.mil/VL04.aspx 
3/12/2009 



VL04 
Page 2 of 8 


Vulnerability Key: V0018220 
STIG ID: CTX0710 
Release Number: 1 

Status: Working 
Short Name: Secure Gateway certs are not DoD approved certs 
Long Name: Secure Gateway certificates are not DoD approved certificates. 
IA Controls: DCNR-1 Non-repudiation 
Categories: 1.2 PKI 
Effective Date: 

Condition: 

XenApp Secure Gateway Server (Target: XenApp Secure Gateway Server) 
Policy: All Policies 

MAC / 
Confidentiality 
Grid: 
I -Mission Critical II -Mission Support III -Administrative 
Classified 
Sensitive 
Public 
STIG ID: CTX0710 
Severity: Category II 
Long Name: Secure Gateway certificates are not DoD approved certificates. 


Vulnerability 
User sessions with Citrix Secure Gateway should be encrypted since transmitting 
data in plaintext 

Discussion: 
may be viewed as it travels through the network. User sessions may be initiated 
from ICA clients. 
To encrypt session data, the sending component, the client, applies ciphers to 
alter the data before 
transmitting it. The receiving component uses a key to decrypt the data, 
returning it to its original 
form. To 

Re: Parsing TXT document and output to XML

2009-05-27 Thread Stephen Reese
On Wed, May 27, 2009 at 6:27 PM, Stephen Reese rsre...@gmail.com wrote:
 List,

 I've been working on a method to parse a PDF or TXT document and
 output the results to XML over at Experts Exchange.
 http://www.experts-exchange.com/Programming/Languages/Scripting/Perl/Q_24439630.html

 You may view the attached document or if the mailing list doesn't
 allow here is a copy of the document I would like to parse:
 http://filedb.experts-exchange.com/incoming/2009/05_w22/143310/XenApp-Secure-Gateway-Server-VL0.txt

 Basically I would like to take the following code and modify it to
 parse a TXT instead of a PDF document:

 #!/usr/bin/perl
 use strict;
 use warnings;
 use Data::Dumper;
 use CAM::PDF;

 my $pdf = CAM::PDF-new('XenApp_WebInterface_Server_VL04.pdf');
 my $text;
 foreach (1..$pdf-numPages) {
        $text .= $pdf-getPageText($_);
 }

 while($text =~ /Vulnerability Key:\s*
 (\S+)\s+STIG ID:\s*
 (\S+)\s+Release Number:\s*
 (\S+)\s+Status:\s*
 (\S+)\s+Short Name:\s*
 (\S+)\s+Long Name:\s*
 (\S+)\s+IA Controls:\s*
 (\S+)\s+Categories:\s*
 (\S+)\s+Effective Date:\s*
 (\S+)\s+Condition:\s*
 (\S+)\s+Policy:\s*
 (\S+)/g) {

 print Vuln
 Vulnerability_Key_$1/Vulnerability_Key_
 STIG_ID$2/STIG_ID_
 Release_Number_$3/Release_Number_
 Status_$4/Status_
 Short_Name_$5/Short_Name_
 Long_Name_$6/Long_Name_
 IA_Controls_IA_ControlID$7ID/IA_Control/IA_Controls_
 Categories_$8/Categories_
 Effective_Date_$9/Effective_Date_
 Condition_subitemtitle$10/titledata/data/subitem/Condition_
 Policy_$11/Policy_
 /Vuln\n;
 }


I've tried to modify the script but I'm all over the place. Should I
use a WHILE statement to open the FILE and and then FOREACH to parse
each set of data? Or the other way around? Thanks

#!/usr/bin/perl
use strict;
use warnings;

open (FILE, 'XenApp_WebInterface_Server_VL04.txt');

while(FILE)
{
foreach($_ =~ /Vulnerability Key:\s*
(\S+)\s+STIG ID:\s*
(\S+)\s+Release Number:\s*
(\S+)\s+Status:\s*
(\S+)\s+Short Name:\s*
(\S+)\s+Long Name:\s*
(\S+)\s+IA Controls:\s*
(\S+)\s+Categories:\s*
(\S+)\s+Effective Date:\s*
(\S+)\s+Condition:\s*
(\S+)\s+Policy:\s*
(\S+)/g) {

print Vuln
Vulnerability_Key_$1/Vulnerability_Key_
STIG_ID$2/STIG_ID_
Release_Number_$3/Release_Number_
Status_$4/Status_
Short_Name_$5/Short_Name_
Long_Name_$6/Long_Name_
IA_Controls_IA_ControlID$7ID/IA_Control/IA_Controls_
Categories_$8/Categories_
Effective_Date_$9/Effective_Date_
Condition_subitemtitle$10/titledata/data/subitem/Condition_
Policy_$11/Policy_
/Vuln\n;
}

--
To unsubscribe, e-mail: beginners-unsubscr...@perl.org
For additional commands, e-mail: beginners-h...@perl.org
http://learn.perl.org/