Thanks, Dr. Lang, I used xmlEventParse() + branches concept as you suggested, it really works, and the memory issue is gone. Now I can query large XML files from within R. but here is another problem: it is too slows (a simple query has not finished for 1.5h), even though the number of relevant records is very limited, but the whole XML file has more than 500 thousand similarly-structured records. And the parser has to go through all of them as to find the matches. Attached is part of the XML files with two records. I am trying to retrieve the content of <moleculeName> nodes from <molecule> records where <name> nodes bear specific gene names. Is it possible to locate based on node content (or xmlValue) rather than node names (since they are the same in all records) first and then parse the xml record locally? Would query based on XPath be faster in this case? I understand that we do have the facility in the XML package for XPath based queries, called getNodeSet(). But that requires reading the whole XML tree into the memory first, which is not feasible for my large XML file. Or can I call XML::XPath statements using your R-Perl interface package? Any suggestions/thoughts? Thank you! Weijun
Part of my XML file: <molecule> <prov><im><imid>20</imid></im></prov><moleculeID>119043</moleculeID> <moleculeType>protein<prov><im><imid>20</imid></im></prov></moleculeType> <organismID>10090<prov><im><imid>20</imid></im></prov></organismID> <id><prov><im><imid>20</imid></im></prov><idType>GI</idType><idValue>6677981</idValue></id> <name>SKD1<prov><im><imid>20</imid></im></prov></name> <name>Vps4b<prov><im><imid>20</imid></im></prov></name> <name>8030489C12Rik<prov><im><imid>20</imid></im></prov></name> <description><distribution><value>Mouse homologue of yeast Vacuolar protein sorting 4 (Vps4); Suppressor of potassium transport defect 1. Mem ber of mammalian class E Vps proteins involved in endosomal transport; AAA-type ATPase.<prov><im><imid>20</imid></im></prov></value><value>Mo use homologue of yeast Vacuolar protein sorting 4 (Vps4); Suppressor of potassium transport defect 1. Member of mammalian class E Vps prot eins involved in endosomal transport; AAA-type ATPase.<prov><im><imid>20</imid></im></prov></value></distribution></description> <orthologue> <method><methodID>337974</methodID><methodName>miClust80</methodName></method> </orthologue> <variant> <prov><im><imid>20</imid></im></prov><variantID>0</variantID> </variant> <interaction><interactionRef>201581</interactionRef><moleculeRef>89434</moleculeRef><moleculeName>SBP1</moleculeName> <selfVariantRef>0</selfVariantRef><partnerVariantRef>0</partnerVariantRef></interaction> <interaction><interactionRef>201582</interactionRef><moleculeRef>17953</moleculeRef><moleculeName>mVps2</moleculeName> <selfVariantRef>0</selfVariantRef><partnerVariantRef>0</partnerVariantRef></interaction> </molecule> <molecule> <prov><im><imid>30</imid></im></prov><moleculeID>116226</moleculeID> <moleculeType>protein<prov><im><imid>30</imid></im></prov></moleculeType> <organismID>9606<prov><im><imid>30</imid></im></prov></organismID> <id><prov><im><imid>30</imid></im></prov><idType>HGNC</idType><idValue>9859</idValue></id> <name>RAP1GDS1<prov><im><imid>30</imid></im></prov></name> <name>GDS1<prov><im><imid>30</imid></im></prov></name> <name>MGC118859<prov><im><imid>30</imid></im></prov></name> <name>MGC118861<prov><im><imid>30</imid></im></prov></name> <variant> <prov><im><imid>30</imid></im></prov><variantID>0</variantID> </variant> <interaction><interactionRef>93569</interactionRef><moleculeRef>116280</moleculeRef><moleculeName>RAC1</moleculeName> <selfVariantRef>0</selfVariantRef><partnerVariantRef>0</partnerVariantRef></interaction> <interaction><interactionRef>104132</interactionRef><moleculeRef>103040</moleculeRef><moleculeName>RHOA</moleculeName> <selfVariantRef>0</selfVariantRef><partnerVariantRef>0</partnerVariantRef></interaction> <interaction><interactionRef>121818</interactionRef><moleculeRef>74726</moleculeRef><moleculeName>MBIP</moleculeName> <selfVariantRef>0</selfVariantRef><partnerVariantRef>0</partnerVariantRef></interaction> </molecule> --- Duncan Temple Lang <[EMAIL PROTECTED]> wrote: > > Well, as you mention at the end of the mail, > several people have given you suggestions about > how to solve the problem using different approaches. > You might search on the Web for how to install a 64 > bit version of libxml2? > Using xmlTreeParse(, useInternalNodes = TRUE) is an > approach > to reducing the memory consumption as is using the > handlers > argument. And if size is really the issue, you > should consider > the SAX model which is very memory efficient and > made available > via the xmlEventParse() function in the XML package. > And it even provides the concepts of branches to > provide a > hybrid of SAX and DOM-style parsing together. > > However, to solve the problem of the xmlMemDisplay > symbol not being found, you can look for where > that is used and remove it. It is in > src/DocParse.c > in the routine RS_XML_MemoryShow(). You can remove > the line > xmlMemDisplay(stderr) > or indeed the entire routine. Then re-install and > reload the package. > > D. > > > Luo Weijun wrote: > > Hello Dr. Lang and all, > > I posted this message in R-help mail list, but > havenât > > solved my problem so far. Therefore, could you > help me > > look at it? > > I have loading problem with XML_1.9 under 64 bit > > R2.3.1 for Mac OS X, which I got from > > http://R.research.att.com/. XML_1.9 works fine > under > > 32 bit R2.5.0. I thought that could be > installation > > problem, and I tried install.packages or biocLite, > > every time the package installed fine, except some > > warning messages below: > > ld64 warning: in /usr/lib/libxml2.dylib, file does > not > > contain requested architecture > > ld64 warning: in /usr/lib/libz.dylib, file does > not > > contain requested architecture > > ld64 warning: in /usr/lib/libiconv.dylib, file > does > > not contain requested architecture > > ld64 warning: in /usr/lib/libz.dylib, file does > not > > contain requested architecture > > ld64 warning: in /usr/lib/libxml2.dylib, file does > not > > contain requested architecture > > > > Here is the error messages I got, when XML is > loaded: > >> library(XML) > > Error in dyn.load(x, as.logical(local), > > as.logical(now)) : > > unable to load shared library > > '/usr/local/lib64/R/library/XML/libs/XML.so': > > > dlopen(/usr/local/lib64/R/library/XML/libs/XML.so, > > 6): Symbol not found: _xmlMemDisplay > > Referenced from: > > /usr/local/lib64/R/library/XML/libs/XML.so > > Expected in: flat namespace > > Error: .onLoad failed in 'loadNamespace' for 'XML' > > Error: package/namespace load failed for 'XML' > > > > Session information > >> sessionInfo() > > Version 2.3.1 Patched (2006-06-27 r38447) > > powerpc64-apple-darwin8.7.0 > > > > attached base packages: > > [1] "methods" "stats" "graphics" > "grDevices" > > "utils" "datasets" > > [7] "base" > > > > Prof Brian Ripley also suggested that this could > be > > that I donât have a 64-bit version of libxml2 > > installed. Where I get it and where/how to install > it, > > if thatâs the problem? > > The reason I need to use R64 is that I have memory > > limitation issue with R 32 bit version when I load > > some very large XML trees (the data file is about > > 800M). And Martin suggested me to use 'handler' > > argument of xmlTreeParse, tried 'handler' with > > useInternalNodes=T, but I still got this memory > > problem with R 32 bit version. Please tell me what > I > > can do now. Thank you so much! > > Weijun > > > > > > > > > > > ____________________________________________________________________________________ > > > > Comedy with an Edge to see what's on, when. > > > > ______________________________________________ > > [email protected] mailing list > > https://stat.ethz.ch/mailman/listinfo/r-help > > PLEASE do read the posting guide > http://www.R-project.org/posting-guide.html > > and provide commented, minimal, self-contained, > reproducible code. > ____________________________________________________________________________________ Pinpoint customers who are looking for what you sell. ______________________________________________ [email protected] mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
