Re: Parsing TXT document and output to XML
On 5/27/09 Wed May 27, 2009 3:27 PM, Stephen Reese rsre...@gmail.com scribbled: List, I've been working on a method to parse a PDF or TXT document and output the results to XML over at Experts Exchange. http://www.experts-exchange.com/Programming/Languages/Scripting/Perl/Q_2443963 0.html You may view the attached document or if the mailing list doesn't allow here is a copy of the document I would like to parse: http://filedb.experts-exchange.com/incoming/2009/05_w22/143310/XenApp-Secure-G ateway-Server-VL0.txt Basically I would like to take the following code and modify it to parse a TXT instead of a PDF document: #!/usr/bin/perl use strict; use warnings; use Data::Dumper; use CAM::PDF; my $pdf = CAM::PDF-new('XenApp_WebInterface_Server_VL04.pdf'); my $text; foreach (1..$pdf-numPages) { $text .= $pdf-getPageText($_); } while($text =~ /Vulnerability Key:\s* (\S+)\s+STIG ID:\s* (\S+)\s+Release Number:\s* (\S+)\s+Status:\s* (\S+)\s+Short Name:\s* (\S+)\s+Long Name:\s* (\S+)\s+IA Controls:\s* (\S+)\s+Categories:\s* (\S+)\s+Effective Date:\s* (\S+)\s+Condition:\s* (\S+)\s+Policy:\s* (\S+)/g) { print Vuln Vulnerability_Key_$1/Vulnerability_Key_ STIG_ID$2/STIG_ID_ Release_Number_$3/Release_Number_ Status_$4/Status_ Short_Name_$5/Short_Name_ Long_Name_$6/Long_Name_ IA_Controls_IA_ControlID$7ID/IA_Control/IA_Controls_ Categories_$8/Categories_ Effective_Date_$9/Effective_Date_ Condition_subitemtitle$10/titledata/data/subitem/Condition_ Policy_$11/Policy_ /Vuln\n; } You have two basic choices: 1. Read the whole file into a variable and use the regular expression as above to match multiple lines, extract the information, and print it. 2. Read the file line-by-line, save the relevant data, and print the data when you have a complete set or at the end of the program. A third choice if your data permits would be to set the input record separator ($/) to the value that separates your records and read multiple lines as a record. I don't think this will work in your case. Here is an example of approach 2: #!/usr/local/bin/perl use strict; use warnings; my @keys = ( 'Vulnerability Key', 'STIG ID', 'Release Number', 'Status', 'Short Name', 'Long Name', 'IA Controls', 'Categories', 'Effective Date', 'Condition', 'Policy' ); my( %keys, %tags ); $keys{$_} = 1 for @keys; $tags{$_} = $_ . '_' for @keys; $tags{$_} =~ s/ /_/g for @keys; my $file = 'XenApp Secure_Gateway_Server_VL04.txt'; open( my $fh, '', $file) or die(Can't open $file: $!); my %record = map { $_, '' } @keys; while( my $line = $fh ) { chomp($line); if( $line =~ m{ \A (.+?) : \s* (\S+) }x ) { $record{$1} = $2 if $keys{$1}; if( $1 eq $keys[$#keys] ) { print Vuln\n; print $tags{$_}$record{$_}/$tags{$_}\n for @keys; print /Vuln\n; %record = map { $_, '' } @keys; } } } ... which produces for your input: Vuln Vulnerability_Key_V0018219/Vulnerability_Key_ STIG_ID_CTX0700/STIG_ID_ Release_Number_1/Release_Number_ Status_Working/Status_ Short_Name_Secure/Short_Name_ Long_Name_Secure/Long_Name_ IA_Controls_ECSC-1/IA_Controls_ Categories_4.4/Categories_ Effective_Date_/Effective_Date_ Condition_/Condition_ Policy_All/Policy_ /Vuln ... You may want to add error checking to the case where some keys are missing. Note that your regular expression will only extract the first word of the value, and your data in some cases has more than that on a line. You can change this by changing the RE to: if( $line =~ m{ \A (.+?) : \s* (.*?) \s* \z }x ) { -- To unsubscribe, e-mail: beginners-unsubscr...@perl.org For additional commands, e-mail: beginners-h...@perl.org http://learn.perl.org/
Parsing TXT document and output to XML
List, I've been working on a method to parse a PDF or TXT document and output the results to XML over at Experts Exchange. http://www.experts-exchange.com/Programming/Languages/Scripting/Perl/Q_24439630.html You may view the attached document or if the mailing list doesn't allow here is a copy of the document I would like to parse: http://filedb.experts-exchange.com/incoming/2009/05_w22/143310/XenApp-Secure-Gateway-Server-VL0.txt Basically I would like to take the following code and modify it to parse a TXT instead of a PDF document: #!/usr/bin/perl use strict; use warnings; use Data::Dumper; use CAM::PDF; my $pdf = CAM::PDF-new('XenApp_WebInterface_Server_VL04.pdf'); my $text; foreach (1..$pdf-numPages) { $text .= $pdf-getPageText($_); } while($text =~ /Vulnerability Key:\s* (\S+)\s+STIG ID:\s* (\S+)\s+Release Number:\s* (\S+)\s+Status:\s* (\S+)\s+Short Name:\s* (\S+)\s+Long Name:\s* (\S+)\s+IA Controls:\s* (\S+)\s+Categories:\s* (\S+)\s+Effective Date:\s* (\S+)\s+Condition:\s* (\S+)\s+Policy:\s* (\S+)/g) { print Vuln Vulnerability_Key_$1/Vulnerability_Key_ STIG_ID$2/STIG_ID_ Release_Number_$3/Release_Number_ Status_$4/Status_ Short_Name_$5/Short_Name_ Long_Name_$6/Long_Name_ IA_Controls_IA_ControlID$7ID/IA_Control/IA_Controls_ Categories_$8/Categories_ Effective_Date_$9/Effective_Date_ Condition_subitemtitle$10/titledata/data/subitem/Condition_ Policy_$11/Policy_ /Vuln\n; } VL04 Page 1 of 8 For Official Use Only When this document is printed, the document needs to be stamped top and bottom with the appropriate classification. VL04 -Vulnerabilities by Asset Property Element (for Vulnerability Maintainers) Vulnerability Key: V0018219 STIG ID: CTX0700 Release Number: 1 Status: Working Short Name: Secure Gateway servers are not located in the DMZ. Long Name: Secure Gateway servers are not located in the DMZ or screened subnet. IA Controls: ECSC-1 Security Configuration Compliance Categories: 4.4 DMZ Effective Date: Condition: XenApp Secure Gateway Server (Target: XenApp Secure Gateway Server) Policy: All Policies MAC / Confidentiality Grid: I -Mission Critical II -Mission Support III -Administrative Classified Sensitive Public STIG ID: CTX0700 Severity: Category II Long Name: Secure Gateway servers are not located in the DMZ or screened subnet. Vulnerability The Secure Gateway is an application that runs as a service on a server that is deployed in the Discussion: DMZ. The server running the Secure Gateway represents a single point of access to the secure, enterprise network. The Secure Gateway acts as an intermediary for every connection request originating from the Internet to the enterprise network. The Secure Gateway allows the tunneling of all ICA client traffic using SSL/TLS. The Secure Gateway manages the connectivity and encryption across the public Internet and hides the XenApp farm from potential intruders. Responsibility: Information Assurance Officer References: Department of Defense Instruction 8500.2 (DODI 8500.2) Checks: CTX0700 (Manual) Check with the Network reviewer or system administrator to obtain the external, internal, and DMZ IP addresses of the firewall. Once these IP addresses have been obtained, review the IP address configuration on Secure Gateway servers. Access the Secure Gateway server and type the following at the command prompt: C:\ipconfig /all 1. If the IP address is on the same network as the DMZ firewall interface, this is not a finding. 2. If the IP address is on the same internal network as the internal interface of the firewall, this is a finding. 3. If the IP address is on the same network as the outside interface of the firewall, this is a finding. Fixes: CTX0700 (Manual) Place the Secure Gateway server in the DMZ or screened subnet. https://vms.disa.mil/VL04.aspx 3/12/2009 VL04 Page 2 of 8 Vulnerability Key: V0018220 STIG ID: CTX0710 Release Number: 1 Status: Working Short Name: Secure Gateway certs are not DoD approved certs Long Name: Secure Gateway certificates are not DoD approved certificates. IA Controls: DCNR-1 Non-repudiation Categories: 1.2 PKI Effective Date: Condition: XenApp Secure Gateway Server (Target: XenApp Secure Gateway Server) Policy: All Policies MAC / Confidentiality Grid: I -Mission Critical II -Mission Support III -Administrative Classified Sensitive Public STIG ID: CTX0710 Severity: Category II Long Name: Secure Gateway certificates are not DoD approved certificates. Vulnerability User sessions with Citrix Secure Gateway should be encrypted since transmitting data in plaintext Discussion: may be viewed as it travels through the network. User sessions may be initiated from ICA clients. To encrypt session data, the sending component, the client, applies ciphers to alter the data before transmitting it. The receiving component uses a key to decrypt the data, returning it to its original form. To
Re: Parsing TXT document and output to XML
On Wed, May 27, 2009 at 6:27 PM, Stephen Reese rsre...@gmail.com wrote: List, I've been working on a method to parse a PDF or TXT document and output the results to XML over at Experts Exchange. http://www.experts-exchange.com/Programming/Languages/Scripting/Perl/Q_24439630.html You may view the attached document or if the mailing list doesn't allow here is a copy of the document I would like to parse: http://filedb.experts-exchange.com/incoming/2009/05_w22/143310/XenApp-Secure-Gateway-Server-VL0.txt Basically I would like to take the following code and modify it to parse a TXT instead of a PDF document: #!/usr/bin/perl use strict; use warnings; use Data::Dumper; use CAM::PDF; my $pdf = CAM::PDF-new('XenApp_WebInterface_Server_VL04.pdf'); my $text; foreach (1..$pdf-numPages) { $text .= $pdf-getPageText($_); } while($text =~ /Vulnerability Key:\s* (\S+)\s+STIG ID:\s* (\S+)\s+Release Number:\s* (\S+)\s+Status:\s* (\S+)\s+Short Name:\s* (\S+)\s+Long Name:\s* (\S+)\s+IA Controls:\s* (\S+)\s+Categories:\s* (\S+)\s+Effective Date:\s* (\S+)\s+Condition:\s* (\S+)\s+Policy:\s* (\S+)/g) { print Vuln Vulnerability_Key_$1/Vulnerability_Key_ STIG_ID$2/STIG_ID_ Release_Number_$3/Release_Number_ Status_$4/Status_ Short_Name_$5/Short_Name_ Long_Name_$6/Long_Name_ IA_Controls_IA_ControlID$7ID/IA_Control/IA_Controls_ Categories_$8/Categories_ Effective_Date_$9/Effective_Date_ Condition_subitemtitle$10/titledata/data/subitem/Condition_ Policy_$11/Policy_ /Vuln\n; } I've tried to modify the script but I'm all over the place. Should I use a WHILE statement to open the FILE and and then FOREACH to parse each set of data? Or the other way around? Thanks #!/usr/bin/perl use strict; use warnings; open (FILE, 'XenApp_WebInterface_Server_VL04.txt'); while(FILE) { foreach($_ =~ /Vulnerability Key:\s* (\S+)\s+STIG ID:\s* (\S+)\s+Release Number:\s* (\S+)\s+Status:\s* (\S+)\s+Short Name:\s* (\S+)\s+Long Name:\s* (\S+)\s+IA Controls:\s* (\S+)\s+Categories:\s* (\S+)\s+Effective Date:\s* (\S+)\s+Condition:\s* (\S+)\s+Policy:\s* (\S+)/g) { print Vuln Vulnerability_Key_$1/Vulnerability_Key_ STIG_ID$2/STIG_ID_ Release_Number_$3/Release_Number_ Status_$4/Status_ Short_Name_$5/Short_Name_ Long_Name_$6/Long_Name_ IA_Controls_IA_ControlID$7ID/IA_Control/IA_Controls_ Categories_$8/Categories_ Effective_Date_$9/Effective_Date_ Condition_subitemtitle$10/titledata/data/subitem/Condition_ Policy_$11/Policy_ /Vuln\n; } -- To unsubscribe, e-mail: beginners-unsubscr...@perl.org For additional commands, e-mail: beginners-h...@perl.org http://learn.perl.org/