i have issues in extracting embedded URL's and explicir URL's ... To read embedded URL's I modified the pdfReader.java to compare string with HTTP ...if it matches then read the string... ==================================================
case PRTokeniser.TK_STRING:
PdfString str = new PdfString(tokens.getStringValue(),
null).setHexWriting(tokens.isHexString());
//System.out.println("PDF Embedded URLs");
if (strings != null)
{
strings.add(str);
String content_url =
str.toString();
if(content_url.length() >= 10)
{
//System.out.println(
"content = " + content_url);
content_url =
content_url.toLowerCase();
if(content_url.charAt(0) == 'h' && content_url.charAt(1) == 't' &&
content_url.charAt(2) == 't' && content_url.charAt(3) == 'p')
{
itext_buffer.append(url_count+","+" ").append(str).append(" - page
no - "+pagenumber).append("\n").toString();
url_count =
url_count + 1;
}
if(content_url.charAt(0) == 'w' && content_url.charAt(1) == 'w' &&
content_url.charAt(2) == 'w')
{
//System.out.println(str.toString());
itext_buffer.append(url_count+","+" ").append(str).append(" - page
no - "+pagenumber).append("\n").toString();
url_count =
url_count + 1;
}
} //if
} //if
return str;
case PRTokeniser.TK_NAME:
if(tokens.getStringValue() != null)
{
if(tokens.getStringValue().equals("Page"))
pagenumber = pagenumber + 1;
}
return new PdfName(tokens.getStringValue(), false);
====================================================================
To extract Explicit URL I am reading document page by page and extracting
the URL.....
=========================================================
for(int k=1;k<=pages;k++)
{
//urls_buffer.append(sb);
sb = new StringBuffer();
arraydata = pdfreader.getPageContent(k);
if(arraydata == null)
return;
if(arraydata != null)
str = new String(arraydata);
//System.out.println(str.toString());
Paragraph paragraph = new
Paragraph(str.toString());
document.add(paragraph);
for(i=0;i= 8)
{
if((string.indexOf('.') !=
-1)&&((string.indexOf('h') !=
-1)||(string.indexOf('w') != -1)))
{
//System.out.println("URLs "+string);
for(int j=0;j= (j+3))
{
if((string.charAt(j+1) == 't')&&(string.charAt(j+2) ==
't')&&(string.charAt(j+3) == 'p'))
{
string = string.substring(j);
//System.out.println(string);
urls_buffer.append(urls_count+","+" ").append(string).append(" -
page no - "+k).append("\n").toString();
urls_count = urls_count + 1;
//System.out.println("Writing pDF");
//Paragraph paragraph = new Paragraph(string.toString());
//Anchor anchor1 = new Anchor("website (external reference)",
FontFactory.getFont(FontFactory.HELVETICA, 12, Font.UNDERLINE, new Color(0,
0, 255)));
break;
}
}//end
if
}//end if
else
{
if(string.charAt(j) == 'w')
{
if(string.length() >= (j+2))
{
if((string.charAt(j+1) == 'w')&&(string.charAt(j+2) == 'w'))
{
string = string.substring(j);
//System.out.println(string);
urls_buffer.append(urls_count+","+" ").append(string).append(" -
page no - "+k).append("\n").toString();
urls_count = urls_count + 1;
break;
}
}//end if
}//end
if
}
}//end for
}//end if
}//end if
}// end while
//System.out.println("No of Words in "+k+" Page
"+words_count);
}//end f
My problem is that I am not able to get the URL's in PDF in a sequence
manner..First it reads the whole document using PDFReader and then extracts
the URL and stores in a String..Then it reads PDF page by page and then
extracts Explicit URL....
Can you suggest for a better solution to this problem...How can I extract
both kinds of URL together..
Regards
Anish
--
View this message in context:
http://www.nabble.com/Extract-URL-tf3958556.html#a11232488
Sent from the iText - General mailing list archive at Nabble.com.
------------------------------------------------------------------------- This SF.net email is sponsored by DB2 Express Download DB2 Express C - the FREE version of DB2 express and take control of your XML. No limits. Just data. Click to get it now. http://sourceforge.net/powerbar/db2/
_______________________________________________ iText-questions mailing list [email protected] https://lists.sourceforge.net/lists/listinfo/itext-questions Buy the iText book: http://itext.ugent.be/itext-in-action/
