FYI, I was making this harder than it needed be. I wanted to post my
solution here for anyone who wanted it, and am also taking any suggestions
for improvement:
public class PoiDocParser {
public static void main(String[] args){
Writer out = new StringWriter();
WordDocument w;
try{
w = new WordDocument("/home/parkert/test.doc");
w.writeAllText(out);
}catch(IOException e){
e.printStackTrace();
}
String page = out.toString();
int currentPos = -1;
int linkStart = -1;
int linkEnd = -1;
char quote = '\"';
currentPos = page.indexOf("HYPERLINK");
while(currentPos >= 0){
linkStart = page.indexOf(quote, currentPos) + 1;
linkEnd = page.indexOf(quote, linkStart);
String hyperlink = page.substring(linkStart,
linkEnd);
System.out.println("link: '" + hyperlink + "'");
currentPos = page.indexOf("HYPERLINK", linkEnd +
1);
}
}
}
later,
pt.
--
Parker Thompson
The Internet Archive
510.541.0125
On Thu, 3 Jul 2003, Parker Thompson wrote:
|Hello,
|
|I am trying to figure out whether POI's HDF stuff will do what I need and
|am hoping someone here has some experience/insight.
|
|Background: I'm working on a web crawler in java and we're hoping to be
|able to get links out of word documents (among others). Our primary
|concern is coverage, we want to get everything, but we are also concerned
|about efficiency to a lesser degree.
|
|My basic question, and I apologize that it's not more specific (I blame it
|on the scant javadocs), is whether the hdf stuff is well-suited for this
|at all, and even if it is, whether it might be overkill. For example, it
|seems like the java equivalent of 'strings <file>' and a regexp might be
|good enough, but this might miss things like relative links.
|
|In the best-case I'd have a class/classes that allowed me to fetch an
|array of all URIs in a word doc, which I could then iterate through.
|
|Thanks in advance for any suggestions,
|
|pt.
|
---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]