[iText-questions] Extract URL

anishhp Thu, 21 Jun 2007 06:29:50 -0700

i have issues in extracting embedded URL's and explicir URL's ... 
To read embedded URL's I modified the pdfReader.java to compare string with
HTTP ...if it matches then read the string...
==================================================


 case PRTokeniser.TK_STRING:
                PdfString str = new PdfString(tokens.getStringValue(),
null).setHexWriting(tokens.isHexString());   
                        //System.out.println("PDF Embedded URLs");              
        
                                
                                
                        if (strings != null)
                                        {
                                 strings.add(str);
                                                String content_url = 
str.toString();                                    
                                                if(content_url.length() >= 10)
                                                {
                                                        //System.out.println( 
"content = " + content_url);
                                                        content_url = 
content_url.toLowerCase();
                                                        
if(content_url.charAt(0) == 'h' && content_url.charAt(1) == 't' &&
content_url.charAt(2) == 't' && content_url.charAt(3) == 'p')
                                                        {
                                                                
                                                                
                                                                
itext_buffer.append(url_count+","+" ").append(str).append(" - page
no - "+pagenumber).append("\n").toString();
                                                                url_count = 
url_count + 1;
                                                                
                                                        }
                                                        
if(content_url.charAt(0) == 'w' && content_url.charAt(1) == 'w' &&
content_url.charAt(2) == 'w')
                                                        {
                                                                
//System.out.println(str.toString());
                                                                
itext_buffer.append(url_count+","+" ").append(str).append(" - page
no - "+pagenumber).append("\n").toString();
                                                                url_count = 
url_count + 1;
                                                        }
                                                } //if
                                        } //if          

                          return str;
                
            case PRTokeniser.TK_NAME:
                                
                                if(tokens.getStringValue() != null)
                                {
                                        
if(tokens.getStringValue().equals("Page"))
                                                pagenumber = pagenumber + 1;
                                }

                return new PdfName(tokens.getStringValue(), false);

====================================================================
To extract Explicit URL  I am reading document page by page and extracting
the URL.....
=========================================================
        for(int k=1;k<=pages;k++)
                        {
                                //urls_buffer.append(sb);
                                sb = new StringBuffer();
                                arraydata = pdfreader.getPageContent(k);
                                if(arraydata == null)
                                        return;
                                if(arraydata != null)
                                        str = new String(arraydata);
                                //System.out.println(str.toString());
                                Paragraph paragraph = new 
Paragraph(str.toString());
                                document.add(paragraph);
                                

                                for(i=0;i= 8)
                                        {
                                                if((string.indexOf('.') != 
-1)&&((string.indexOf('h') !=
-1)||(string.indexOf('w') != -1)))
                                                {
                                                        
//System.out.println("URLs "+string);
                                                        for(int j=0;j= (j+3)) 
                                                                        {
                                                                                
if((string.charAt(j+1) == 't')&&(string.charAt(j+2) ==
't')&&(string.charAt(j+3) == 'p'))
                                                                                
{
                                                                                
        string = string.substring(j);
                                                                                
        //System.out.println(string);
                                                                                
        urls_buffer.append(urls_count+","+" ").append(string).append(" -
page no - "+k).append("\n").toString();
                                                                                
        urls_count = urls_count + 1;
                                                                                
        
                                                                                
        //System.out.println("Writing pDF");
                                                                                
        //Paragraph paragraph = new Paragraph(string.toString());
                                                                                
        //Anchor anchor1 = new Anchor("website (external reference)",
FontFactory.getFont(FontFactory.HELVETICA, 12, Font.UNDERLINE, new Color(0,
0, 255)));
                                                                                
        
                                                                                
        
                                                                                
        
                                                                                
        break;
                                                                                
}
                                                                        }//end 
if
                                                                }//end if
                                                                else
                                                                {
                                                                        
if(string.charAt(j) == 'w')
                                                                        {
                                                                                
if(string.length() >= (j+2)) 
                                                                                
{
                                                                                
        if((string.charAt(j+1) == 'w')&&(string.charAt(j+2) == 'w'))
                                                                                
        {
                                                                                
                string = string.substring(j);
                                                                                
                //System.out.println(string);
                                                                                
                urls_buffer.append(urls_count+","+" ").append(string).append(" -
page no - "+k).append("\n").toString();
                                                                                
                urls_count = urls_count + 1;
                                                                                
                break;
                                                                                
        }
                                                                                
}//end if
                                                                        }//end 
if 
                                                                }
                                                        }//end for
                                                }//end if
                                        }//end if
                                }// end while
                                //System.out.println("No of Words in "+k+" Page 
"+words_count);
                        }//end f 


My problem is that I am not able to get the URL's in PDF in a sequence
manner..First it reads the whole document using PDFReader and then extracts
the URL and stores in a String..Then it reads PDF page by page and then
extracts Explicit URL....

Can you suggest for a better solution to this problem...How can I extract
both kinds of URL together..

Regards
Anish
-- 
View this message in context: 
http://www.nabble.com/Extract-URL-tf3958556.html#a11232488
Sent from the iText - General mailing list archive at Nabble.com.

-------------------------------------------------------------------------
This SF.net email is sponsored by DB2 Express
Download DB2 Express C - the FREE version of DB2 express and take
control of your XML. No limits. Just data. Click to get it now.
http://sourceforge.net/powerbar/db2/

_______________________________________________
iText-questions mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/itext-questions
Buy the iText book: http://itext.ugent.be/itext-in-action/

[iText-questions] Extract URL

Reply via email to