Re: [MarkLogic Dev General] Generating XML from Schemas

David Lee Thu, 05 Jun 2014 15:56:28 -0700

This is in the general case quite tricky.
First off its a chicken and egg problem ... how do you detect which schema ?  
If the schema is defined in input then good ... but if not ... you need rules 
to find it.
Next,  schema *rarely* gives enough information on how to generate usable 
documents.
You can generate "valid" documents given a schema but usually not "usable" ones,
try some tools out like Oxygen and see what they do.  The problem is schema is 
designed
to detect invalid input and reject it,  not to define what "reasonable useful" 
input is.

For example a typical markup may allow   (p|br |table|image|reference)*
a valid example is nothing.   But that's not usable.
That's an edge case but its real.    A tangibly similar problem designing JSON 
to XML conversions.
If you only have the XML schema ... its difficult or impossible to generate the 
XML you *want* ... 
from just the data and the schema.  You need out of band information (code) ...
( you can see examples of this in our json library 
http://docs.marklogic.com/json:transform-from-json )
The "custom" configuration is a set of rules attempting to produce "decent" XML 
from JSON, or visa-versa
 and it does make use of schema , but only for atomic types ... 

This paper goes into much more depth on the issues
http://www.balisage.net/Proceedings/vol7/html/Lee01/BalisageVol7-Lee01.html

Its very tricky ... and the direction taken with the above approach requires 
annotated schema to give hints,
and input documents corresponding very closely  to the desired output.

Another problem is that even if schema had every bit of information you need , 
its horrendously difficult to parse and make use of.  
ML has *some* schema query ability but not in a general way.
Its designed to start with an XML node already read in, then you can query its 
schema structure ...
but you can't (easily) start with just a schema and query "what kinds of things 
go here" ...

My opinion ... this direction will seem wonderful at first but will end up a 
nightmare and a failure.
I suggest instead using some kind of out of band information ... like a XSLT or 
XQuery or other
"tempting" kind of technology, hand made from each schema you want to use, and 
designed
for the kind of input you'll be getting.   Its very tempting to want it to be 
done generically without
having to manually create the mappings ... but its not only really difficult 
... well , its impossible ...
Impossible in the sense that what everyone I know (and am assuming you too) 
*really* want is
to produce a specific "nice" version for the output.   Not just any conforming 
output,
but one that is structured and maps things the way you want ... When I first 
started on projects like this
I didn't fully appreciate that the problem of "nice" is not only undefinable 
but when you try to define it,
typically self-conflicting.  That is I find "This time I want arrays turned 
into nested elements" but "This other time,
I want arrays flattened into a single element" and "Sometimes, if the document 
has only 1 element make it an attribute" ...
I know this very well first hand by trying to achieve the magic goal for many 
years.  
You can ask anyone using the ML JSON library (which is really a simplified 
restricted form of this same problem),
what seems "obvious" actually isn't ... Making one desired format work ... 
tends to break others and its very  
tedious to dig down and discover why ... 

Even if you don't care (or can choose to not care) what the exact output looks 
like,
your still going to have a hard time generically mapping input to a document 
corresponding to a schema ... 
unless the document structure and all possibilities is very well know in 
advance.
If you know your data precisely,  and you can identify what fields would map to 
what structure precisely,
then you can do it.   
But alas ... that's the problem.  If can do that, you don't need schema, you 
need a transformation
mapping tool (xslt,xquery, something) .  Schema won't help with the mapping 
problem at all because it 
has no information in it about meaning or any kind of cross referencing to 
input data not already in that schema
format.

There is one way out ... and its generally not acceptable to most.
That is a schema which is extremely free form.  You still won't need the 
schema, but 
if it look something like  

   <field name="fieldname>data</field>
   ......

(think CSV)

then you can map nearly anything to it automatically ... you just won't get 
much value from it.

-----Original Message-----
From: [email protected] 
[mailto:[email protected]] On Behalf Of Wanczowski, Andrew
Sent: Thursday, June 05, 2014 5:41 PM
To: MarkLogic Developer Discussion
Subject: Re: [MarkLogic Dev General] Generating XML from Schemas

I am looking to generate documents that are not coming from external sources. 
For example we would have editors filling out a web forms and then posting them 
to ML. At that time I would like to detect what type of content they are trying 
to create and then generate a document using the appropriate schema.

-Drew

On 6/5/14 4:38 PM, "Michael Blakeley" <[email protected]> wrote:

>Can you expand on the need to "generate XML documents"? If you've got 
>NewsML etc coming in from external sources, what sort of documents are 
>"generated within the system"?
>
>-- Mike
>
>On 5 Jun 2014, at 06:58 , Wanczowski, Andrew 
><[email protected]> wrote:
>
>> Hello All,
>> 
>> I am currently looking to build a metadata store for various 
>>documents and want them to remain in their native schemas. Content 
>>will be generated within the system and come from external systems. 
>>Some examples are the PRISM, NewsML and IPTC/XMP Schemas.
>> 
>> I have been investigating ways to generate XML documents to be stored 
>>in MarkLogic. The great thing about MarkLogic is that you can have 
>>multiple schemas or schemaless documents in your database. However, 
>>this becomes challenging when you want your content to originate in 
>>MarkLogic or MarkLogic applications to control full CRUD of the 
>>documents. I am looking for something scalable where we would only 
>>have to manage one library for all CRUD functions. The current 
>>approach would be to have a library module for each schema which will 
>>handle all CRUD and serialization/de-serialization. This becomes a 
>>maintenance headache.
>> 
>> The desired features would be:
>> A single library module to handle document generation  Generate a 
>>document based on an XML Schema  Create, Update and Partial Update 
>>should be supported  Values should be populated based on user's input 
>>from Another XML document or JSON document  Input mappings should be 
>>configurable form both XML and JSON  Serialization/de-serialization of 
>>XML and JSON for API usage or web form usage
>> 
>> ExistDB has a way to generate an instance from an XML Schema.
>>Documentation can be found at
>>http://en.wikibooks.org/wiki/XQuery/XML_Schema_to_Instance . But this 
>>does not do all the features desired.
>> 
>> Any input would be extremely helpful!
>> 
>> Thanks
>> Drew
>> _______________________________________________
>> General mailing list
>> [email protected]
>> http://developer.marklogic.com/mailman/listinfo/general
>
>_______________________________________________
>General mailing list
>[email protected]
>http://developer.marklogic.com/mailman/listinfo/general

_______________________________________________
General mailing list
[email protected]
http://developer.marklogic.com/mailman/listinfo/general
_______________________________________________
General mailing list
[email protected]
http://developer.marklogic.com/mailman/listinfo/general

Re: [MarkLogic Dev General] Generating XML from Schemas

Reply via email to