Hi all,
Here's my situation:
* I have a mixture of xml, sgml, and simple mark-up, and can't rely on
the file extensions to tell me what's what.
* I have a working RecordLoader install, a MarkLogic database, and a file
staging area where folks are collecting the content.
* The process of collecting the files, and creating new files is ongoing;
I have no idea how many files we'll have, but it will almost certainly be
several hundred thousand.
* Many files have sgml doctype declarations, but not all. Some few have
xml doctype declarations.
* Most are not-UTF8. But some declare themselves as UTF8 in the doctype
declaration even though they aren't.
What I'd like to do:
* I need to ingest it all into the MarkLogic database.
* I want to clear the database and repeat the ingest on a regular (TBD)
basis.
* I do not want to pre-process the content prior to RecordLoader if at
all possible.
* I want to handle according to the following logic:
* Handle incoming content as non-UTF8.
* If incoming files are binary, do not load.
* If the incoming file has a doctype definition, xml or sgml, handle it
by converting to xml, removing problematic processing instructions, and
pre-empting MarkLogic from turning SGML singletons into nested XML nodes (via
default stack-level repair) by instead turning them into properly tagged XML
singletons.
(That is, I want <date><year year="2006"><month month="1"><day
day="1"></date> to become <date><year year="2006" /><month month="1" /><day
day="1" /></date> and not <date><year year="2006"><month month="1"><day
day="1"></day></month></year></date>)
* Unless incoming content declares its namespace, load all content to the
empty namespace.
* Else if the incoming file has a top level node, treat as xml.
* Else, ingest as text.
And finally, the question:
Can I use RecordLoader to do this, and without pre-processing? I'm having a
hard time wrapping my head around the processing paradigm of CONTENT_FACTORY
via RecordLoader. Is xquery-based content handling via CONTENT_FACTORY going to
fire after the MarkLogic Server has already handled incoming SGML singletons?
And if this is possible, does it put too much burden on RecordLoader (i.e. is
it scalable and repeatable on a regular basis)?
Thank you,
Paul
Paul Lewon
Production Technology, Global Production & Manufacturing Services
Cengage Learning
27500 Drake Rd. Farmington Hills, MI 48331
*: [email protected] | www.cengage.com
_______________________________________________
General mailing list
[email protected]
http://xqzone.com/mailman/listinfo/general