On Apr 26, 2010, at 7:54 AM, Ashley Sheridan <a...@ashleysheridan.co.uk> wrote:

On Mon, 2010-04-26 at 07:58 -0400, Phpster wrote:

On Apr 26, 2010, at 7:23 AM, Ashley Sheridan
<a...@ashleysheridan.co.uk> wrote:

> On Mon, 2010-04-26 at 13:20 +0200, Peter Lind wrote:
>> On 26 April 2010 12:52, Ashley Sheridan <a...@ashleysheridan.co.uk>
>> wrote:
>>> I've been thinking about this problem for a little while, and the
>>> thing
>>> is, I can think of ways of doing it, but they're not very nice,
>>> and I
>>> don't think they're going to be fast.
>>> Basically, I have a load of HTML formatted content in a database
>>> that
>>> get displayed onto the site. It's part of a rudimentary CMS.
>>> Currently, the titles for each article are displayed on a page,
>>> and each
>>> title links to the full article. However, that leaves me with a page
>>> which is essentially a list of links, and that's not ideal for
>>> SEO. What
>>> I wanted to do to enhance the page is to have a short excerpt of x
>>> number of words/characters beneath each article title. The idea
>>> being
>>> that search engines will find the page as more than a link farm, and >>> visitors won't have to just rely on the title alone for the content.
>>> Here's the rub though. As the content is in HTML form, I can't
>>> just grab
>>> the first 100 characters and display them as that could leave an
>>> open
>>> tag without a closing one, potentially breaking the page. I could
>>> use
>>> strip_tags on the 100-character excerpt, but what if the excerpt
>>> itself
>>> broke a tag in half (i.e. <acronym title="something"> could become
>>> <acron )
>>> The only solutions I can see are:
>>>     * retrieve the entire article, perform a strip_tags and then
>>> take
>>>       the excerpt
>>>     * use a regex inside of mysql to pull out only the text
>>> The thing is, neither of these seems particularly pretty, and I am
>>> sure
>>> there's a better way, but it's too early in the week for my brain
>>> to be
>>> fully functional I think!
>>> Does anyone have any ideas about what I could do, or do you think
>>> I'm
>>> seeing problems where there are none?
>> Use htmltidy or htmlpurifier to clean up things. I.e. grab the amount
>> of content you want, then use one of the tools to repair and clean
>> the
>> html.
>> Regards
>> Peter
>> --
>> <hype>
>> WWW: http://plphp.dk / http://plind.dk
>> LinkedIn: http://www.linkedin.com/in/plind
>> Flickr: http://www.flickr.com/photos/fake51
>> BeWelcome: Fake51
>> Couchsurfing: Fake51
>> </hype>
> Would that work on content that stopped mid-tag? Assuming the original
> copy is:
> <p>This is some sentence, with an <abbr title="Abbreviation">abbr</
> abbr>
> in the middle of it.</p>
> If I was asking for only the first 50 characters, I'd get this:
> <p>This is some sentence, with an <abbr title="Abb
> Would either htmltidy or htmlpurifier be able to handle that? I don't
> mind whether it tries to repair the tag or remove it completely, as
> long
> as it does something to it.
> Thanks,
> Ash
> http://www.ashleysheridan.co.uk

When looking at the performance side of things, couldn't you add
another column to the table and do this work to tidy / strip tags
during the insert going forward?

Any current data would need a one time script to clean / tidy the
current data. you could run this on a nightly cron ( depending on how
much data there is) until the new column is filled with clean data.


Sent from my iPod

That's not a bad idea actually, I hadn't thought of it! I'm kicking myself now, because it's such an obvious solution!


I always prefer simple solutions! It keeps things easy!


Sent from my iPod

Reply via email to