RE: [MarkLogic Dev General] How best to identify all referencedcontentmodels?

Hartwig, Brent (CL Tech Sv) Thu, 18 Dec 2008 12:57:26 -0800

I'm not familiar with the content models. I'm yet to get my head around how 
many content models there are.


Worst case sounds like a program consuming the XCC API:

1. Obtain list of XML URIs;

2. Pull each XML file;

3. Read content model info, adding unique content model IDs to some list.

This is part of a potential migration that will move these files out of one ML 
instance and put them into another, where each XML instance must validate 
against a content model. It may be best to incorporate this activity then; yet, 
I was hoping to collect the content models in advance. This is some 
pre-migration work to find out just how much fun I'm going to have :)

-Brent

-----Original Message-----
From: [email protected] 
[mailto:[email protected]] On Behalf Of Danny Sokolsky
Sent: Thursday, December 18, 2008 2:50 PM
To: General Mark Logic Developer Discussion
Subject: RE: [MarkLogic Dev General] How best to identify all 
referencedcontentmodels?

Hi Brent,

The XQuery Data Model discards the DOCTYPE declarations, as they are only hints 
to the parser, and are therefore gone for XML documents in the database (they 
are already parsed).  So you cannot query the DOCTYPE in an XML document.   If 
you wanted to query type information, you could convert the DTD to a schema and 
then load the schema, which is XML and therefore queryable.  Another idea is to 
enrich your documents upon loading and place the PUBLIC name in the document 
somewhere.  For example, you can add it as an attribute on the root element of 
the document like this:

<my-root PUBLIC="-//W3C//DTD MathML 2.0//EN"> 

If you have a range index on the PUBLIC attribute, then you can do a cts:uris 
with a range query that specifies the attribute value you are interested.  
something like:

cts:uris("", (), cts:element-range-query(xs:QName("my-root"), 
xs:QName("PUBLIC"), "=", 
              "-//W3C//DTD MathML 2.0//EN"))

This probably will not help your current situation though...

The only other idea I have is if you know of something about each document in a 
given DTD that is unique.  If so, you can then write a query that searches for 
the that content and use it to constrain your URI search.

-Danny


From: [email protected] 
[mailto:[email protected]] On Behalf Of Hartwig, Brent 
(CL Tech Sv)
Sent: Thursday, December 18, 2008 10:38 AM
To: General Mark Logic Developer Discussion
Subject: RE: [MarkLogic Dev General] How best to identify all 
referencedcontentmodels?

Hi, Danny,
 
Yes, the URI lexicon is enabled. That has told me how many XML files I have. 
Other than limiting processing to the XML files, I'm not sure how this helps. 
Maybe I'm under-utilizing this? I do not have any applicable range indexes in 
place today.
 
Regarding DTDs and namespaces, you can have XML files conforming to DTDs where 
namespaces are not defined. You only get the doctype declaration before the 
root node; i.e.:
 
<!DOCTYPE mathml PUBLIC "-//W3C//DTD MathML 2.0//EN" "mathml2/mathml2.dtd"> 
<mathml [no namespace atts]>...</>
 
In the above case, I'd like to retrieve public ID of "-//W3C//DTD MathML 
2.0//EN" and system ID of "mathml2/mathml2.dtd", or simply the entire doctype 
declaration.
 
Thank you.
 
-Brent

________________________________________
From: [email protected] 
[mailto:[email protected]] On Behalf Of Danny Sokolsky
Sent: Thursday, December 18, 2008 1:12 PM
To: General Mark Logic Developer Discussion
Subject: RE: [MarkLogic Dev General] How best to identify all referenced 
contentmodels?
Hi Brent,

Do you have the URI lexicon enabled for your database?  It might help.  Also, 
do you have any range indexes on the element or attributes in question?  Then 
you can do range queries or lexicon lookups on those values.  I am not sure 
what you mean by your concern about DTDs and namespaces.  Perhaps if you gave a 
sample XML snippet or two showing what the "public and system IDs" look like, 
that might help with a more specific answer.

-Danny 

From: [email protected] 
[mailto:[email protected]] On Behalf Of Hartwig, Brent 
(CL Tech Sv)
Sent: Thursday, December 18, 2008 9:48 AM
To: General Mark Logic Developer Discussion
Subject: [MarkLogic Dev General] How best to identify all referenced 
contentmodels?

Hello and Happy Holidays,
 
I'm trying to identify the public and system IDs of all content models our XML 
files reference. The XML files are in ML 3.2 and may conform to a DTD or XML 
Schema. Given the number of XML files, I would prefer not take the I/O hit for 
each file. I am also interested in the reverse: URIs of XML files not defining 
the content model. I see many namespace-related functions but am concerned this 
will not help the majority of files conforming to DTDs. Any ideas? Thank you in 
advance.
 
-Brent
_______________________________________________
General mailing list
[email protected]
http://xqzone.com/mailman/listinfo/general
_______________________________________________
General mailing list
[email protected]
http://xqzone.com/mailman/listinfo/general

RE: [MarkLogic Dev General] How best to identify all referencedcontentmodels?

Reply via email to