Hi
I am trying to read selected fields from a xml file with R using xml  
package. So far I have learned the basics of this package by going  
through the manual, examples, tutorial, and so on (www.omegahat.org/RSXML) 
. The problem is that I am getting stuck when it comes down to more  
complex xml files. I am a novice in R and xml, and was wondering if  
someone could help me out with here.

Here is my xml file. I am only interested in the <protein_group node.  
Therefore, I have omitted most of the information from the other two  
previous nodes (protein_summary_header, proteinprophet_details).

<?xml version="1.0" encoding="UTF-8"?>
<?xml-stylesheet type="text/xsl" 
href="http://localhost/ISB/data/interact-LFA1_C18_PME5R1.prot.xsl 
"?>
<protein_summary xmlns="http://regis-web.systemsbiology.net/protXML";  
xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance";  
xsi:schemaLocation="http://sashimi.sourceforge.net/schema_revision/protXML/protXML_v6.xsd
 
" summary_xml="interact-LFA1_C18_PME5R1.prot.xml">
<protein_summary_header reference_database="EColi_decoy_v3.0.fasta">
<program_details analysis="proteinprophet">
<proteinprophet_details  occam_flag="Y" run_options="XML">
<protein_group group_number="1" probability="1.0000">
       <protein protein_name="sp|P00004|CYC_HORSE"  
n_indistinguishable_proteins="1" probability="1.0000"  
percent_coverage="46.7" unique_stripped_peptides="EDLIAYLK+EETLMEYLENPK 
+KTGQAPGFTYTDANK+TEREDLIAYLK+TGPNLHGLFGR+TGQAPGFTYTDANK"  
group_sibling_id="a" total_number_peptides="226"  
pct_spectrum_ids="2.54" confidence="1.00">
          <parameter name="prot_length" value="107"/>
          <annotation protein_description="Cytochrome c OS=Equus  
caballus GN=CYCS PE=1 SV=2"/>
          <peptide peptide_sequence="KTGQAPGFTYTDANK" charge="2"  
initial_probability="0.9989" nsp_adjusted_probability="0.9998"  
peptide_group_designator="a" weight="1.00"  
is_nondegenerate_evidence="Y" n_enzymatic_termini="2"  
n_sibling_peptides="8.50" n_sibling_peptides_bin="6" n_instances="10"  
exp_tot_instances="9.94" is_contributing_evidence="Y"  
calc_neutral_pep_mass="1597.7737">
          </peptide>
          <peptide peptide_sequence="TGQAPGFTYTDANK" charge="2"  
initial_probability="0.9989" nsp_adjusted_probability="0.9998"  
weight="1.00" is_nondegenerate_evidence="Y" n_enzymatic_termini="2"  
n_sibling_peptides="8.50" n_sibling_peptides_bin="6" n_instances="90"  
exp_tot_instances="89.82" is_contributing_evidence="Y"  
calc_neutral_pep_mass="1469.6786">
          </peptide>
          <peptide peptide_sequence="KTGQAPGFTYTDANK" charge="3"  
initial_probability="0.9990" nsp_adjusted_probability="0.9998"  
peptide_group_designator="a" weight="1.00"  
is_nondegenerate_evidence="Y" n_enzymatic_termini="2"  
n_sibling_peptides="8.50" n_sibling_peptides_bin="6" n_instances="10"  
exp_tot_instances="9.89" is_contributing_evidence="Y"  
calc_neutral_pep_mass="1597.7737">
          </peptide>
       </protein>
</protein_group>
<protein_group group_number="2" probability="1.0000">
       <protein protein_name="sp|P00350|6PGD_ECOLI"  
n_indistinguishable_proteins="1" probability="1.0000"  
percent_coverage="32.1" unique_stripped_peptides="AGAGTDAAIDSLKPYLDK 
+EAYELVAPILTK+EFVESLETPR+EKTEEVIAENPGK+GDIIIDGGNTFFQDTIR+GPSIMPGGQK 
+GYTVSIFNR+IAAVAEDGEPCVTYIGADGAGHYVK+IVSYAQGFSQLR+QIADDYQQALR 
+TEEVIAENPGK+VLSGPQAQPAGDK" group_sibling_id="a"  
total_number_peptides="32" pct_spectrum_ids="0.36" confidence="1.00">
          <parameter name="prot_length" value="474"/>
          <annotation protein_description="6-phosphogluconate deh ...


I did the following:
 > doc <- xmlRoot(xmlTreeParse("myfile.xml"))
 > xmlApply(doc, names)
$protein_summary_header
   program_details
"program_details"

$dataset_derivation
list()

$protein_group
   protein
"protein"

$protein_group
   protein
"protein"

[IN FACT, THE $protein_group APPEARS A COUPLE HUNDRED TIMES]

So, I want to create a data frame comprising of selected information  
from my $protein_group as follows:

group_number    protein_name    probability     peptide_sequence         
initial_probability     n_instances
1       sp|P00004|CYC_HORSE     1.0000  KTGQAPGFTYTDANK 0.9989  10
1       sp|P00004|CYC_HORSE     1.0000  TGQAPGFTYTDANK  0.9989  90
1       sp|P00004|CYC_HORSE     1.0000  KTGQAPGFTYTDANK 0.9990  10
2       sp|P00350|6PGD_ECOLI    1.0000  NAPGTYCMR       0.9349  8
2       sp|P00350|6PGD_ECOLI    1.0000  TGAHPGPMK       0.9124  2

As I understand the variables from columns 4, 5 and 6 are children  
from protein_group. For each $protein_group, I need to retrieve some  
of its children.
I would greatly appreciate any help.
Thank you very much,
Alex
        [[alternative HTML version deleted]]

______________________________________________
[email protected] mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.

Reply via email to