Re: extracting hrefs

Parker Thompson Thu, 03 Jul 2003 17:07:35 -0700

FYI, I was making this harder than it needed be.  I wanted to post my
solution here for anyone who wanted it, and am also taking any suggestions
for improvement:


public class PoiDocParser {

        public static void main(String[] args){

                Writer out = new StringWriter();
                
                WordDocument w;
                try{
                        w = new WordDocument("/home/parkert/test.doc");
                        w.writeAllText(out);

                }catch(IOException e){
                        e.printStackTrace();
                }
                
                String page = out.toString();
                
                int currentPos = -1;
                int linkStart = -1;
                int linkEnd = -1;
                char quote = '\"';
                
                currentPos = page.indexOf("HYPERLINK");
                while(currentPos >= 0){

                        linkStart = page.indexOf(quote, currentPos) + 1;
                        linkEnd = page.indexOf(quote, linkStart);
                        
                        String hyperlink = page.substring(linkStart, 
linkEnd);
                        
                        System.out.println("link: '" + hyperlink + "'");        
                        
                        currentPos = page.indexOf("HYPERLINK", linkEnd + 
1);
                }
        }
}

later,

pt.
-- 
Parker Thompson
The Internet Archive
510.541.0125

On Thu, 3 Jul 2003, Parker Thompson wrote:

|Hello,
|
|I am trying to figure out whether POI's HDF stuff will do what I need and 
|am hoping someone here has some experience/insight.
|
|Background: I'm working on a web crawler in java and we're hoping to be
|able to get links out of word documents (among others).  Our primary
|concern is coverage, we want to get everything, but we are also concerned
|about efficiency to a lesser degree.
|
|My basic question, and I apologize that it's not more specific (I blame it
|on the scant javadocs), is whether the hdf stuff is well-suited for this
|at all, and even if it is, whether it might be overkill.  For example, it
|seems like the java equivalent of 'strings <file>' and a regexp might be
|good enough, but this might miss things like relative links.
|
|In the best-case I'd have a class/classes that allowed me to fetch an
|array of all URIs in a word doc, which I could then iterate through.
|
|Thanks in advance for any suggestions,
|
|pt.
|


---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

Re: extracting hrefs

Reply via email to