RE: Editing Text in the Intermediate Format

Martin Edge Wed, 04 Jun 2008 03:37:18 -0700

Thanks Andreas for the thorough response... 

I'll highlight a little more about what I'm exactly modifying and a little
bit about my situation which might shed more light.

I am somewhat mis-using the document format I believe to achieve my goal -
instead of being able to produce a -single- document where we use some basic
replacement of text to achieve a final document, booklet or otherwise.. 

I am embedding multiple templates within a master template, the idea being,
the XML input can define which letter template is expected to be used for
that portion of XML. (using case statements)

Each template could have any amount of data in it's payload which may extend
the page size further than I could predict. [ie, if someone built a template
with value-of="myreallybigparagraph" and replaced it with 10 pages of text]

Instead of producing one document, I produce one document which consists of
multiple different documents. 

When printing, each individual document is tracked via the barcode that I
print - and which case I need to (after FOP has decided what fits into what
page) rewrite the barcode place holders with the correct configuration for
the Document/Page number. Thus why my current work exists in the
Intermediate Format phase.

So - is it possible given my scenario to apply what you've said (or any
similar templating feature) at the FO or XSLT or Saxon phase?

If this makes _any_ sense at all? :)

Thanks 
Martin.

-----Original Message-----
From: Andreas Delmelle [mailto:[EMAIL PROTECTED] 
Sent: Wednesday, 4 June 2008 7:32 PM
To: [email protected]
Subject: Re: Editing Text in the Intermediate Format

On Jun 4, 2008, at 03:42, Martin Edge wrote:

> My latest problem is the Intermediate file is 800Mb.. and when  
> loading into
> XmlDocument's Load method.. I run out of memory..
>
> Guess I'll have to try and figure out how the heck you read/modify  
> files of
> this size in c#
>
> Fun fun!
>
> Any tips? :)

IIUC, XmlDocument is roughly the DOM implementation in C#. As  
Jeremias and Chris already hinted, you might fare better for files  
that size by going the SAX route to read them, as this could avoid  
having the entire tree representation of the document in memory all  
at once (depending on how you soon you start piping through the  
results).

Then again, to /modify/ any XML file and obtain a new XML result,  
XSLT is virtually always the weapon of choice. Since you're using  
Saxon, you even have the option of using XSLT 2.0, which could open  
up some interesting possibilities.

You could also use a classic string- or regex-based search-and- 
replace, but with XSLT, you at least get the assurance that the  
output will always be well-formed XML without worrying too much about  
the details.

First, create a very basic stylesheet that merely copies the input  
XML. Very roughly, in 'classic' XSLT 1.0:

---
<?xml version="1.0"?>
<xsl:stylesheet version="1.0" xmlns:xsl="http://www.w3.org/1999/XSL/ 
Transform">

<xsl:output method="xml" version="1.0" encoding="utf-8" indent="no"/>

<xsl:template match="/">
   <xsl:apply-templates />
</xsl:template>

<xsl:template match="node()">
   <xsl:copy>
     <xsl:apply-templates select="@* | node()" />
   </xsl:copy>
</xsl:template>

<xsl:template match="text()">
   <xsl:value-of select="normalize-space(.)" />
</xsl:template>

<xsl:template match="@*">
   <xsl:copy-of select="." />
</xsl:template>

</xsl:stylesheet>
---

Then, start adding templates and logic only for those nodes you want  
to see changed. I deliberately added the text() matching template,  
since this will have an undesired effect on the <space> elements in  
the area tree. You probably want to change that to something like:

---
<xsl:template match="text()">
   <xsl:choose>
     <xsl:when test="parent::space or parent::char"><xsl:value-of  
select="." /></xsl:when>
     <xsl:when test="parent::word"><xsl:value-of select="normalize- 
space(.)" /></xsl:when>
   </xsl:choose>
</xsl:template>
---

This will ignore any text() nodes from the input if they are not  
children of <word>, <space> or <char>, as well as remove leading and  
trailing white-space from the <word> elements (if any). White-space  
characters inside <space> or <char> elements are preserved.

Alternatively, you could split that up in three separate matching  
templates, to achieve the same result:

---
<xsl:template match="text()[parent::space or parent::char]">
   <xsl:value-of select="." />
</xsl:template>

<xsl:template match="text()[parent::word]">
   <xsl:value-of select="normalize-space(.)" />
</xsl:template>

<xsl:template match="text()" />
---

To handle the simple substitution for <word> elements you could add  
something like:

---
<xsl:template match="word">
   <xsl:copy>
     <xsl:apply-templates select="@*" />
     <xsl:choose>
       <xsl:when test=". = '*1234567890MLQIS*'">*0000001441*</xsl:when>
       <xsl:otherwise><xsl:apply-templates /></xsl:otherwise>
     </xsl:choose>
   </xsl:copy>
</xsl:template>
---

With a little imagination, you can even make the substitution more  
flexible (by means of xsl:params, or by defining a mapping of such  
substitutions in another XML file and assigning the root node of that  
document to an xsl:variable using XSLT's document() function)

For the example you gave with the string that is split up over  
multiple <text> elements, I'm not sure how I'd handle that...
I suspect these are already divided in the input FO, where you have  
something like:

<block>*MDA*<page-number />-<page-number-citation ref-id="..." />- 
<page-number-citation ref-id="..." />

Hence why we get separate <text> elements in the area tree.

If you were to add explicit "id" properties to the page-number- 
citation, like:

<block>*MDA*<page-number id="pn-1" />-<page-number-citation  
id="pnc-1" ref-id="..." />-<page-number-citation id="pnc-2" ref- 
id="..." />

These would turn up in the area tree XML as a 'prod-id' attribute on  
the corresponding <text> element. Maybe you can use that to identify  
and process the nodes in question. It could even turn out to be  
easier if it is guaranteed that the entire sequence will always be in  
its own block. In that case, add the id to the fo:block, so you can  
identify the <block> element in the area tree by means of its 'prod- 
id', and write a template that merges the lineArea/text descendants  
for those blocks into one (making the value of all attributes equal  
to the aggregate of those attributes for all the merged <text>  
elements: 'ipd' or 'ipda' would be the sum(); 'offset' would be the  
min(); <text> elements with different 'baseline' are probably better  
not merged; 'bap', or border-and-padding, will require some work  
since the attribute value needs to be decoded into four numeric values).

Still not sure how the white-space will be handled when it is inside  
a <word> element, but it seems to be worth a try.

Good luck!

Cheers

Andreas

---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

RE: Editing Text in the Intermediate Format

Reply via email to