> I was plucking a site which had been formatted in ms word 
> which kept giving the error in the subject. In order to
> get the pluck to work I had to download the site to my
> machine and run it through this sed script:
> s/<![^>]*>//
> (Actually I had to do it twice because I'm not the world's 
> leading expert in sed)

Try putting a g at the end.[0]  That should make the search
and replace global (more than once per line).

> The problem is msword embeds loads of pseudo directives in
> the form of (something like) <![if ...]>....<![endif]>
> and Plucker is choking on these.

Hmm...  I've tried searching the web to determine whether
those are legal or not.  As best I can tell, they're not.
The only thing that's allowed after a "<![" are IGNORE,
INCLUDE, and CDATA.  So that would make it an MSWord
problem, and one of the uglier ones I've seen.  *shudder*

> Is this an MSword problem or a Plucker problem? Either way I 
> suspect it will have to be fixed in Plucker 8^(.

I suspect that it won't be fixed in Plucker, and you'll have
to run a post-spidering command to fix the HTML before you
parse it.  You might want to look into Demoronizer [1], a
tool which will make MS-HTML into something more like HTML.

Later,
Blake.

[0] Disclaimer: I'm not the world's leading expert in sed either.
[1] http://www.fourmilab.ch/webtools/demoroniser/

_______________________________________________
plucker-dev mailing list
[EMAIL PROTECTED]
http://lists.rubberchicken.org/mailman/listinfo/plucker-dev

Reply via email to