extracting hrefs

Parker Thompson Thu, 03 Jul 2003 10:57:39 -0700

Hello,

I am trying to figure out whether POI's HDF stuff will do what I need and 
am hoping someone here has some experience/insight.


Background: I'm working on a web crawler in java and we're hoping to be
able to get links out of word documents (among others).  Our primary
concern is coverage, we want to get everything, but we are also concerned
about efficiency to a lesser degree.

My basic question, and I apologize that it's not more specific (I blame it
on the scant javadocs), is whether the hdf stuff is well-suited for this
at all, and even if it is, whether it might be overkill.  For example, it
seems like the java equivalent of 'strings <file>' and a regexp might be
good enough, but this might miss things like relative links.

In the best-case I'd have a class/classes that allowed me to fetch an
array of all URIs in a word doc, which I could then iterate through.

Thanks in advance for any suggestions,

pt.
-- 
Parker Thompson
The Internet Archive
510.541.0125




---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

extracting hrefs

Reply via email to