Thanks Andreas for the thorough response... I'll highlight a little more about what I'm exactly modifying and a little bit about my situation which might shed more light.
I am somewhat mis-using the document format I believe to achieve my goal - instead of being able to produce a -single- document where we use some basic replacement of text to achieve a final document, booklet or otherwise.. I am embedding multiple templates within a master template, the idea being, the XML input can define which letter template is expected to be used for that portion of XML. (using case statements) Each template could have any amount of data in it's payload which may extend the page size further than I could predict. [ie, if someone built a template with value-of="myreallybigparagraph" and replaced it with 10 pages of text] Instead of producing one document, I produce one document which consists of multiple different documents. When printing, each individual document is tracked via the barcode that I print - and which case I need to (after FOP has decided what fits into what page) rewrite the barcode place holders with the correct configuration for the Document/Page number. Thus why my current work exists in the Intermediate Format phase. So - is it possible given my scenario to apply what you've said (or any similar templating feature) at the FO or XSLT or Saxon phase? If this makes _any_ sense at all? :) Thanks Martin. -----Original Message----- From: Andreas Delmelle [mailto:[EMAIL PROTECTED] Sent: Wednesday, 4 June 2008 7:32 PM To: [email protected] Subject: Re: Editing Text in the Intermediate Format On Jun 4, 2008, at 03:42, Martin Edge wrote: > My latest problem is the Intermediate file is 800Mb.. and when > loading into > XmlDocument's Load method.. I run out of memory.. > > Guess I'll have to try and figure out how the heck you read/modify > files of > this size in c# > > Fun fun! > > Any tips? :) IIUC, XmlDocument is roughly the DOM implementation in C#. As Jeremias and Chris already hinted, you might fare better for files that size by going the SAX route to read them, as this could avoid having the entire tree representation of the document in memory all at once (depending on how you soon you start piping through the results). Then again, to /modify/ any XML file and obtain a new XML result, XSLT is virtually always the weapon of choice. Since you're using Saxon, you even have the option of using XSLT 2.0, which could open up some interesting possibilities. You could also use a classic string- or regex-based search-and- replace, but with XSLT, you at least get the assurance that the output will always be well-formed XML without worrying too much about the details. First, create a very basic stylesheet that merely copies the input XML. Very roughly, in 'classic' XSLT 1.0: --- <?xml version="1.0"?> <xsl:stylesheet version="1.0" xmlns:xsl="http://www.w3.org/1999/XSL/ Transform"> <xsl:output method="xml" version="1.0" encoding="utf-8" indent="no"/> <xsl:template match="/"> <xsl:apply-templates /> </xsl:template> <xsl:template match="node()"> <xsl:copy> <xsl:apply-templates select="@* | node()" /> </xsl:copy> </xsl:template> <xsl:template match="text()"> <xsl:value-of select="normalize-space(.)" /> </xsl:template> <xsl:template match="@*"> <xsl:copy-of select="." /> </xsl:template> </xsl:stylesheet> --- Then, start adding templates and logic only for those nodes you want to see changed. I deliberately added the text() matching template, since this will have an undesired effect on the <space> elements in the area tree. You probably want to change that to something like: --- <xsl:template match="text()"> <xsl:choose> <xsl:when test="parent::space or parent::char"><xsl:value-of select="." /></xsl:when> <xsl:when test="parent::word"><xsl:value-of select="normalize- space(.)" /></xsl:when> </xsl:choose> </xsl:template> --- This will ignore any text() nodes from the input if they are not children of <word>, <space> or <char>, as well as remove leading and trailing white-space from the <word> elements (if any). White-space characters inside <space> or <char> elements are preserved. Alternatively, you could split that up in three separate matching templates, to achieve the same result: --- <xsl:template match="text()[parent::space or parent::char]"> <xsl:value-of select="." /> </xsl:template> <xsl:template match="text()[parent::word]"> <xsl:value-of select="normalize-space(.)" /> </xsl:template> <xsl:template match="text()" /> --- To handle the simple substitution for <word> elements you could add something like: --- <xsl:template match="word"> <xsl:copy> <xsl:apply-templates select="@*" /> <xsl:choose> <xsl:when test=". = '*1234567890MLQIS*'">*0000001441*</xsl:when> <xsl:otherwise><xsl:apply-templates /></xsl:otherwise> </xsl:choose> </xsl:copy> </xsl:template> --- With a little imagination, you can even make the substitution more flexible (by means of xsl:params, or by defining a mapping of such substitutions in another XML file and assigning the root node of that document to an xsl:variable using XSLT's document() function) For the example you gave with the string that is split up over multiple <text> elements, I'm not sure how I'd handle that... I suspect these are already divided in the input FO, where you have something like: <block>*MDA*<page-number />-<page-number-citation ref-id="..." />- <page-number-citation ref-id="..." /> Hence why we get separate <text> elements in the area tree. If you were to add explicit "id" properties to the page-number- citation, like: <block>*MDA*<page-number id="pn-1" />-<page-number-citation id="pnc-1" ref-id="..." />-<page-number-citation id="pnc-2" ref- id="..." /> These would turn up in the area tree XML as a 'prod-id' attribute on the corresponding <text> element. Maybe you can use that to identify and process the nodes in question. It could even turn out to be easier if it is guaranteed that the entire sequence will always be in its own block. In that case, add the id to the fo:block, so you can identify the <block> element in the area tree by means of its 'prod- id', and write a template that merges the lineArea/text descendants for those blocks into one (making the value of all attributes equal to the aggregate of those attributes for all the merged <text> elements: 'ipd' or 'ipda' would be the sum(); 'offset' would be the min(); <text> elements with different 'baseline' are probably better not merged; 'bap', or border-and-padding, will require some work since the attribute value needs to be decoded into four numeric values). Still not sure how the white-space will be handled when it is inside a <word> element, but it seems to be worth a try. Good luck! Cheers Andreas --------------------------------------------------------------------- To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] --------------------------------------------------------------------- To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
