[cfaussie] Re: Pull apart a html table

Robin Hilliard Mon, 24 Nov 2008 12:20:32 -0800

Hi Dale,

Here's a function I've used in a few projects and presentations (I  
spoke about screen scraping most recently at WOTP) that returns a two  
dimensional array of matches and sub expressions within matches:


<cffunction access="private" returntype="array" name="REScrape">
        <cfargument type="string" name="regex" required="true">
        <cfargument type="string" name="source" required="true">

        <cfscript>

                var resultIndex = 1;
                var result = REFind(regex, source, 1, true);
                var matches = arrayNew(1);
                var terms = 0;
                
                result = REFind(regex, source, 1, true);

                while (result.pos[1] neq 0) {   
                                terms = arrayNew(1);

                                for (resultIndex = 1; resultIndex le 
arrayLen(result.pos);  
resultIndex++)
                                        arrayAppend(terms, mid(source, 
result.pos[resultIndex],  
result.len[resultIndex]));

                                arrayAppend(matches, terms);
                                result = REFind(regex, source, result.pos[1] + 
result.len[1], true);
                }
                
                return matches;
                
        </cfscript>
        
</cffunction>

Cheers,
Robin
                
        ROBIN HILLIARD
Chief Executive Officer
[EMAIL PROTECTED]

RocketBoots Pty Ltd
Level 11
189 Kent Street
Sydney NSW 2001
Australia
Phone +61 2 9323 2507
Facsimile +61 2 9323 2501
Mobile +61 418 414 341
www.rocketboots.com.au  
                


On 24/11/2008, at 4:59 PM, Dale Fraser wrote:

> I think I have it sorted, I didn’t realise that reFind only returned  
> the first occurrence, doh!
>
> Regards
> Dale Fraser
> http://learncf.com
> http://flexcf.com
>
>
> From: [email protected] [mailto:[EMAIL PROTECTED]  
> On Behalf Of Blair McKenzie
> Sent: Monday, 24 November 2008 4:51 PM
> To: [email protected]
> Subject: [cfaussie] Re: Pull apart a html table
>
> You could add an XML declaration and parse it into an XML object.  
> You'll still have to find the table in the HTML though.
>
> Blair
> On Mon, Nov 24, 2008 at 4:45 PM, Steve Onnis  
> <[EMAIL PROTECTED]> wrote:
>
> Would it be easier for you to convert it to a CSV format and process  
> it from
> there?
>
>
>
> <cfscript>
>
>        function TableToCSV () {
>                var table = arguments[1];
>
>                table = REReplaceNoCase(table, "[^[:print:]]", "",  
> "ALL");
>                table = replaceNocase(table, "</tr><tr>", chr(10),  
> "ALL");
>                table = replaceNoCase(table, "</td><td>", """,""",  
> "ALL");
>                table = replaceNoCase(table, "<td>", """", "ALL");
>                table = replaceNoCase(table, "</td>", """", "ALL");
>                table = REReplaceNoCase(table,
> "<(table|tbody|thead|tfoot|tr)([^>]*)>", "", "ALL");
>                table = REReplaceNoCase(table,
> "</(table|tbody|thead|tfoot|tr)([^>]*)>", "", "ALL");
>                return table;
>                }
>
> </cfscript>
>
> <cfsavecontent variable="table">
> <table>
> <tr>
>        <td>Cell 1.1</td>
>        <td>Cell 1.2</td>
>        <td>Cell 1.3</td>
>        <td>Cell 1.4</td>
> </tr>
> <tr>
>        <td>Cell 2.1</td>
>        <td>Cell 2.2</td>
>        <td>Cell 2.3</td>
>        <td>Cell 2.4</td>
> </tr>
> </table>
> </cfsavecontent>
> <cfoutput>
> <pre>#HTMLEditFormat(TableToCSV(table))#</pre>
> </cfoutput>
>
>
>
> ________________________________
>
> From: [email protected] [mailto:[EMAIL PROTECTED]  
> On Behalf
> Of Dale Fraser
> Sent: Monday, 24 November 2008 4:32 PM
> To: [email protected]
> Subject: [cfaussie] Re: Pull apart a html table
>
>
> I just need to get the content out, I know there is a fixed format  
> to the
> tables, each row has three cells, and I need to extract the info  
> from each
> cell and populate a database.
>
>
>
> I've been playing at regex to get all the rows to start with but  
> having
> trouble, I have
>
>
>
> <cfset result = reFind("<tr[^>]*>(.*?)</tr>", html, 1, true) />
>
> <cfdump var="#result#" />
>
>
>
> But it only returns 2 elements in the array and there are hundreds  
> of rows.
>
>
>
> Regards
>
> Dale Fraser
> http://learncf.com <http://learncf.com/>
>
> http://flexcf.com <http://flexcf.com/>
>
>
>
>
>
> From: [email protected] [mailto:[EMAIL PROTECTED]  
> On Behalf
> Of Steve Onnis
> Sent: Monday, 24 November 2008 4:24 PM
> To: [email protected]
> Subject: [cfaussie] Re: Pull apart a html table
>
>
>
> what are you wanting to do with them?
>
>
>
> ________________________________
>
> From: [email protected] [mailto:[EMAIL PROTECTED]  
> On Behalf
> Of Dale Fraser
> Sent: Monday, 24 November 2008 4:19 PM
> To: [email protected]
> Subject: [cfaussie] Pull apart a html table
>
> Is there an easy way to pull apart an html table.
>
>
>
> I have a heap of html where I need to loop through the html and get a
> specific table and then loop over the rows and columns.
>
>
>
> I could write all that code, but I feel like I would be reinventing  
> the
> wheel, is this something that could be done with a regex or outside  
> the
> scope?
>
>
>
> Regards
>
> Dale Fraser
> http://learncf.com <http://learncf.com/>
>
> http://flexcf.com <http://flexcf.com/>
>
>
>
>
>
> <BR
>
>
>
>
>
>
>
>
>
>
> >


--~--~---------~--~----~------------~-------~--~----~
You received this message because you are subscribed to the Google Groups 
"cfaussie" group.
To post to this group, send email to [email protected]
To unsubscribe from this group, send email to [EMAIL PROTECTED]
For more options, visit this group at 
http://groups.google.com/group/cfaussie?hl=en
-~----------~----~----~----~------~----~------~--~---

[cfaussie] Re: Pull apart a html table

Reply via email to