On Sun, Dec 16, 2012 at 12:10 AM, Paul Mena <[email protected]> wrote:
> I'm a Ruby Newbie trying to write a program to process thousands of HTML
> files, extracting pertinent text and inserting it into a MySQL database.
> Ruby seems ideally suited to the task in general, and I've already used
> Nokogiri to extract comment text.  What I need to do next is to print -
> and then ultimately delete or strip - the text between "pre" tags.
>
> Picture some html like this:
>
> <html>
> <head>
> <title>My Title</title>
> </head>
> <body>
> <h1>My Heading</h1>
> <strong>From:</strong>Me<br>
> <strong>Date:</strong> Wed Dec 05 2012 - 18:17:49 EST
> <!-- body="start" -->
> <p>
> text line 1
> <br>
> text line 2
> <br>
> text line 3
> <br>
> <p><pre>
> very important text
> more important text
> would you believe even more important text?
> </pre>
> <p><!-- body="end" -->
> </body>
> </html>
>
> I basically need to do 2 things: 1) to print only the text between the 2
> "pre" tags, and then 2) to print all of the non-tagged text between the
> "body" comments - minus the text between the "pre" tags.  I've been
> messing with this for a couple of hours - unsuccessfully - but I'm still
> convinced that this is the right tool for the job.

If you need to do more HTML and XML manipulation, learning XPath is a
good investment!  You can look here for a start:
http://www.w3schools.com/Xpath/default.asp

_One_ way to achieve what you want:

require 'nokogiri'

text = <<HTML
<html>
<head>
<title>My Title</title>
</head>
<body>
<h1>My Heading</h1>
<strong>From:</strong>Me<br>
<strong>Date:</strong> Wed Dec 05 2012 - 18:17:49 EST
<!-- body="start" -->
<p>
text line 1
<br>
text line 2
<br>
text line 3
<br>
<p><pre>
very important text
more important text
would you believe even more important text?
</pre>
<p><!-- body="end" -->
</body>
</html>
HTML

dom = Nokogiri.HTML(text)

puts dom.xpath('/html/body//pre/text()').map(&:to_s)

puts '---'

puts dom.xpath('/html/body//text()[not(ancestor::pre)]').map(&:to_s)

You can also process nodes individually if you replace ".map..." with
".each" and a block which receives the node and does something with
it.

Kind regards

robert

--
remember.guy do |as, often| as.you_can - without end
http://blog.rubybestpractices.com/

-- You received this message because you are subscribed to the Google Groups 
ruby-talk-google group. To post to this group, send email to 
[email protected]. To unsubscribe from this group, send email 
to [email protected]. For more options, visit this 
group at https://groups.google.com/d/forum/ruby-talk-google?hl=en

Reply via email to