RE: REReplace and RegExp

James Ang Tue, 23 Apr 2002 11:39:27 -0700

Here are the URL to the mentioned parsers:

Java-based (AND C++,Perl,COM) for the Anti-MS camp:
http://xml.apache.org/


Strictly COM-based for the those following the Dark M$ path:
http://www.microsoft.com/xml/

The Apache XML Xalan Project supports Xpath in its latest iteration. MS
XML 4.0 supports Xpath.

Both of these implementations are pretty big and can have a performance
hit at script/runtime. In my opinion, they would not scale properly if
it is used on a page that gets hit a lot. Both implementations have
memory leak issues on certain configurations (I have seen an early
Apache parser dies after parsing and transforming 10,000 to 50,000
documents for a particular JVM that I can't remember off the top of my
head). The MS XML stuff is notorious for memory leaks for certain
pattern of usage. I am sure it has been reduced/eliminated with the
latest iteration.

I think you may have to batch the processing and store it somewhere.
Have the batch run every so often to refresh the data.

I doubt the Oracle dudes would do better than the Apache or the
Microsoft XML teams. These two implementations are the best out there. I
would not be surprised if Oracle used or incorporated a portion of the
Apache parser.

Good luck! :)

----------------------------
James Ang
Senior Programmer
MedSeek, Inc.
[EMAIL PROTECTED]


-----Original Message-----
From: Troy Simpson [mailto:[EMAIL PROTECTED]] 
Sent: Tuesday, April 23, 2002 11:07 AM
To: CF-Talk
Subject: RE: REReplace and RegExp


James,

Thanks again.  Great food for thought.

Is there a performance hit with either of the XML Parsers you suggested?
If so, what has been your experience? Do these XML Parsers use something
called "XPath" to get the data from the XML Document?

Background:
I have created a pre-joined resultset of all the records(<CITATION>) and
stored them in an Oracle9i Table (a.k.a Materialize View).  This
basically takes all the records and joins them together into a
denormalized spreadsheet to elliminate the expensive joins at runtime.
One of the columns in my Table contains an XML Document compiled from
all the tables and columns  used for TEXT indexing and searching.

Like I said, the documents are stored in an Oracle9i database table
column as an XML document of type XMLType (clob or Character Large
Object).  There is some functionality in the database to parse the XML
Document but as of now, I do not know how elaborate or expansive these
functions are.  Ideally, I am trying to make the Oracle9i Database do
the work for me before returning the result set to the ColdFusion
Application Server for processing.  So far I've been able to get a list
of <agent> tags for all my <agents> in a <Citation>.

I wonder if the XML functionality in the Oracle9i database exist and is
as good as that of the two parsers you suggested?

I also have the option to store the values in seperate columns.  The
problem I am running into is that mutliple <agents> can exist for each
<CITATION> and I only want one record in the table(a.k.a Materialized
View) for each <CITATION>.

XML Document
--------------------------------------
<?xml version="1.0"?>
<citation>
  <accno>60793</accno>
  <title>Parc Andre-Citroen</title>
  <settitle sid="929">Paris: Parks and Gardens, 1615-1992</settitle>
  <callnumber>lGxFR B343a A53ed05</callnumber>
  <agents>
    <agent primary="NO" aid="157" acutter="O469">Oldenburg, Claes Thure,
b.1929</agent>
    <agent aid="2481" primary="NO" acutter="B847">Bruggen, Coosje van,
b.1942</agent>
    <agent primary="YES" aid="9387" acutter="B343">Berger, Patrick, b.
1947</agent>
    <agent primary="NO" aid="9388" acutter="V668">Viguier,
Jean-Paul</agent>
    <agent acutter="J474" primary="NO" aid="9389">Jodry,
Jean-Francois</agent>
    <agent primary="NO" aid="9390" acutter="C585">Clement,
Gilles</agent>
    <agent primary="NO" aid="9391">Provost, Alain</agent>
  </agents>
</citation>
-------------------------------------
Thanks
Troy

------------------------------------------
Troy Simpson
Applications Analyst/Programmer - MCSE, OCP DBA
North Carolina State University Libraries
Campus Box 7111 | Raleigh | North Carolina
ph.919.515.3855 | fax.919.513.3330
[EMAIL PROTECTED]

-----Original Message-----
From: James Ang [mailto:[EMAIL PROTECTED]]
Sent: Tuesday, April 23, 2002 1:14 PM
To: CF-Talk
Subject: Re: REReplace and RegExp


Troy,

What you need is a 2-part parser. There isn't an easy way unless you
decide to use MS XML Parser or the Apache.org Java parser to parse the
XML.

If you decided not to use the Apache or the MS XML parsers, here's how
your tag parser would do:

Step 1: Retrieve a start tag one at a time:
<agent([[:space:]]*|[[:space:]]+[^>]*)>

Step 2: Retrieve the individual attributes of the tag retrieved in step
1

Step 3: Perform transformation of the attributes in step 2.

Step 4: Perform transformation of the tag in Step 1

Step 5: Place the transformed string back in to xml input stream OR
place the transformed stream into your output stream.

Step 6: If not end of file/stream, go to Step 1.

Attached is a sample code that might help. It is meant for CFAS 5. I
wrote it thinking that it would solve your problem until I re-read your
posting. :P The attached file should provide some insight, I hope. :)
For the code to work in CFAS 4.5.x, you will need to convert the UDF to
Custom Tags.

Good luck. :)

Back to *real* work. (This list is too much fun.)

James Ang
Senior Programmer
MedSeek, Inc.
[EMAIL PROTECTED]


----- Original Message -----
From: "Troy Simpson" <[EMAIL PROTECTED]>
To: "CF-Talk" <[EMAIL PROTECTED]>
Sent: Tuesday, April 23, 2002 8:04 AM
Subject: RE: REReplace and RegExp


James,

Thanks for the response.  It has given me other ideas about how to
approach this.

It appears that the solution you provided only replaces that Tags, which
is part of the desired solution.  I also need to obtain the value of the
attributes and put then in differenct attributes for the <a> tag.  The
real kicker is that the attributes in the <agent> tag can be in any
order.  For
example:

>From this:
<agent primary="NO" aid="157" acutter="O469">Oldenburg, Claes
Thure,b.1929</agent> <agent aid="2481" acutter="B847"
primary="NO">Bruggen, Coosje van, b.1942</agent>

To this:
<a href="results.cfm?c=aid&q=157">Oldenburg, Claes Thure, b.1929</a> <a
href="results.cfm?c=aid&q=2481">Bruggen, Coosje van, b.1942</a>

So far, I've come up with this, which is not complete:
  REReplaceNoCase(
    agentList,

'<agent[[:space:]]+primary="(YES|NO)"[[:space:]]+aid="([0-9]*)"[[:space:
]]+a
cutter="([a-z0-9]*)">(.[^<]*?)</agent>',
'<a href="results.cfm?c=aid&q=\2">\4</a>',
"ALL")>

Background:
I am using ColdFusion 4.51 on a Windows2000/IIS5 server.  The <agent>
tags come from an XML document that looks like this:

<?xml version="1.0"?>
<citation>
  <accno>60793</accno>
  <title>Parc Andre-Citroen</title>
  <settitle sid="929">Paris: Parks and Gardens, 1615-1992</settitle>
  <callnumber>lGxFR B343a A53ed05</callnumber>
  <agents>
    <agent primary="NO" aid="157" acutter="O469">Oldenburg, Claes Thure,
b.1929</agent>
    <agent aid="2481" primary="NO" acutter="B847">Bruggen, Coosje van,
b.1942</agent>
    <agent primary="YES" aid="9387" acutter="B343">Berger, Patrick, b.
1947</agent>
    <agent primary="NO" aid="9388" acutter="V668">Viguier,
Jean-Paul</agent>
    <agent acutter="J474" primary="NO" aid="9389">Jodry,
Jean-Francois</agent>
    <agent primary="NO" aid="9390" acutter="C585">Clement,
Gilles</agent>
    <agent primary="NO" aid="9391">Provost, Alain</agent>
  </agents>
</citation>


Thanks,
Troy

------------------------------------------
Troy Simpson
Applications Analyst/Programmer - MCSE, OCP DBA
North Carolina State University Libraries
Campus Box 7111 | Raleigh | North Carolina
ph.919.515.3855 | fax.919.513.3330
[EMAIL PROTECTED]

-----Original Message-----
From: James Ang [mailto:[EMAIL PROTECTED]]
Sent: Monday, April 22, 2002 5:45 PM
To: CF-Talk
Subject: RE: REReplace and RegExp


Try this:

REReplaceNoCase(agents, "(</?)agent([[:space:]]*>|[[:space:]]+[^>]*>)",
"\1a\2", "ALL")

I have tested this code on CFAS 5 on WinXP.

James Ang
Senior Programmer
MedSeek, Inc.


-----Original Message-----
From: Troy Simpson [mailto:[EMAIL PROTECTED]]
Sent: Monday, April 22, 2002 2:15 PM
To: CF-Talk
Subject: REReplace and RegExp


Dear CF-Talkers:

I have a string in the following format( I've added carriage returns for
readability):

<cfset agents =
'<agent primary="NO" aid="157" acutter="O469">Oldenburg, Claes Thure,
b.1929</agent>' & '<agent primary="NO" aid="2481"
acutter="B847">Bruggen, Coosje van, b.1942</agent>' & '<agent
primary="YES" aid="9387" acutter="B343">Berger, Patrick, b.
1947</agent>'
>

I want to process the string to look like this (replace the AGENT tags
with ANCHOR tags):

<cfset agents =
'<a href="results.cfm?c=aid&q=157">Oldenburg, Claes Thure, b.1929</a>' &
'<a href="results.cfm?c=aid&q=2481">Bruggen, Coosje van, b.1942</a>' &
'<a href="reulsts.cfm?c=aid&q=9387">Berger, Patrick, b. 1947</a>'
>

I have somewhat accomplished this like so but still need some work and
have become a little lost.

REReplaceNoCase(
    agent,

'<agent[[:space:]]+primary="(YES|NO)"[[:space:]]+aid="([0-9]*)"[[:space:
]]+a
cutter="([a-z0-9]*)">(.*)?</agent>',
' <a href="results.cfm?c=aid&q=\2">\4</a> ',
"ALL")>

**Another problem is that the AGENT Attributes can be in any order which
really throughs wrench into things.

Anyone have any advise on how to approach this?
I would really appreciated it.

Thanks,
Troy


------------------------------------------
Troy Simpson
Applications Analyst/Programmer - MCSE, OCP-DBA
North Carolina State University Libraries
Campus Box 7111 | Raleigh | North Carolina
ph.919.515.3855 | fax.919.513.3330
[EMAIL PROTECTED]






______________________________________________________________________
Get the mailserver that powers this list at http://www.coolfusion.com
FAQ: http://www.thenetprofits.co.uk/coldfusion/faq
Archives: http://www.mail-archive.com/[email protected]/
Unsubscribe: http://www.houseoffusion.com/index.cfm?sidebar=lists

RE: REReplace and RegExp

Reply via email to