I'm not familiar with the content models. I'm yet to get my head around how many content models there are.
Worst case sounds like a program consuming the XCC API: 1. Obtain list of XML URIs; 2. Pull each XML file; 3. Read content model info, adding unique content model IDs to some list. This is part of a potential migration that will move these files out of one ML instance and put them into another, where each XML instance must validate against a content model. It may be best to incorporate this activity then; yet, I was hoping to collect the content models in advance. This is some pre-migration work to find out just how much fun I'm going to have :) -Brent -----Original Message----- From: [email protected] [mailto:[email protected]] On Behalf Of Danny Sokolsky Sent: Thursday, December 18, 2008 2:50 PM To: General Mark Logic Developer Discussion Subject: RE: [MarkLogic Dev General] How best to identify all referencedcontentmodels? Hi Brent, The XQuery Data Model discards the DOCTYPE declarations, as they are only hints to the parser, and are therefore gone for XML documents in the database (they are already parsed). So you cannot query the DOCTYPE in an XML document. If you wanted to query type information, you could convert the DTD to a schema and then load the schema, which is XML and therefore queryable. Another idea is to enrich your documents upon loading and place the PUBLIC name in the document somewhere. For example, you can add it as an attribute on the root element of the document like this: <my-root PUBLIC="-//W3C//DTD MathML 2.0//EN"> If you have a range index on the PUBLIC attribute, then you can do a cts:uris with a range query that specifies the attribute value you are interested. something like: cts:uris("", (), cts:element-range-query(xs:QName("my-root"), xs:QName("PUBLIC"), "=", "-//W3C//DTD MathML 2.0//EN")) This probably will not help your current situation though... The only other idea I have is if you know of something about each document in a given DTD that is unique. If so, you can then write a query that searches for the that content and use it to constrain your URI search. -Danny From: [email protected] [mailto:[email protected]] On Behalf Of Hartwig, Brent (CL Tech Sv) Sent: Thursday, December 18, 2008 10:38 AM To: General Mark Logic Developer Discussion Subject: RE: [MarkLogic Dev General] How best to identify all referencedcontentmodels? Hi, Danny, Yes, the URI lexicon is enabled. That has told me how many XML files I have. Other than limiting processing to the XML files, I'm not sure how this helps. Maybe I'm under-utilizing this? I do not have any applicable range indexes in place today. Regarding DTDs and namespaces, you can have XML files conforming to DTDs where namespaces are not defined. You only get the doctype declaration before the root node; i.e.: <!DOCTYPE mathml PUBLIC "-//W3C//DTD MathML 2.0//EN" "mathml2/mathml2.dtd"> <mathml [no namespace atts]>...</> In the above case, I'd like to retrieve public ID of "-//W3C//DTD MathML 2.0//EN" and system ID of "mathml2/mathml2.dtd", or simply the entire doctype declaration. Thank you. -Brent ________________________________________ From: [email protected] [mailto:[email protected]] On Behalf Of Danny Sokolsky Sent: Thursday, December 18, 2008 1:12 PM To: General Mark Logic Developer Discussion Subject: RE: [MarkLogic Dev General] How best to identify all referenced contentmodels? Hi Brent, Do you have the URI lexicon enabled for your database? It might help. Also, do you have any range indexes on the element or attributes in question? Then you can do range queries or lexicon lookups on those values. I am not sure what you mean by your concern about DTDs and namespaces. Perhaps if you gave a sample XML snippet or two showing what the "public and system IDs" look like, that might help with a more specific answer. -Danny From: [email protected] [mailto:[email protected]] On Behalf Of Hartwig, Brent (CL Tech Sv) Sent: Thursday, December 18, 2008 9:48 AM To: General Mark Logic Developer Discussion Subject: [MarkLogic Dev General] How best to identify all referenced contentmodels? Hello and Happy Holidays, I'm trying to identify the public and system IDs of all content models our XML files reference. The XML files are in ML 3.2 and may conform to a DTD or XML Schema. Given the number of XML files, I would prefer not take the I/O hit for each file. I am also interested in the reverse: URIs of XML files not defining the content model. I see many namespace-related functions but am concerned this will not help the majority of files conforming to DTDs. Any ideas? Thank you in advance. -Brent _______________________________________________ General mailing list [email protected] http://xqzone.com/mailman/listinfo/general _______________________________________________ General mailing list [email protected] http://xqzone.com/mailman/listinfo/general
