I am not a pdf guru, but I have amassed quite a bit of information on
the topic. I have pinged around asking the pdf mavens of the world what
the issues are about parsing pdf and reading up on the subject to get a
better understanding. I have contributed all of this to mailing lists,
but coding this is not something I would feel confortable doing at this
point. Maybe it would be best for a coordinated lucene-nutch-pdfbox
development to produce some good code to do this. I am trying to get
some dialog going.
Here is some code I was asked to debug by another interested developer
that uses PDFBox to extract pdf tabular data, it seems to have some bugs
in it that I am trying to figure out.
try
{
int i =1;
String wordsep = null;
String str = null;
boolean flag = false;
Writer output = null;
PDDocument document = null;
document = PDDocument.load( "53 Nostro Ofc Cofc Daily
Position_AUS.pdf" );
PDDocumentOutline root =
document.getDocumentCatalog().getDocumentOutline();
PDOutlineItem item = root.getFirstChild();
PDOutlineItem item1 = item.getNextSibling();
while( item1 != null )
{
System.out.println( "Item:" + item.getTitle() );
System.out.println( "Item1:" + item1.getTitle()
);
output = new OutputStreamWriter(new
FileOutputStream( "simple"+i+".txt" ) );
PDFTextStripperByArea stripper= null;
stripper=new PDFTextStripperByArea();
List reg = stripper.getRegions();
System.out.println(reg.size());
// PDFTextStripper stripper = null;
//stripper = new PDFTextStripper();
wordsep = stripper.getWordSeparator();
//stripper.setSortByPosition(true);
stripper.setStartBookmark(item);
stripper.setLineSeparator("\n");
stripper.setWordSeparator(" ");
stripper.setPageSeparator("\n\n\n\n");
stripper.setWordSeparator(" ");
stripper.setEndBookmark(item1);
//str = stripper.getText(document);
//output.write( str, 0, str.length());
stripper.writeText( document, output );
i++;
item = item.getNextSibling();
item1 = item1.getNextSibling();
}
PDOutlineItem child = item.getFirstChild();
PDOutlineItem child1 = new PDOutlineItem();
while( child != null )
{
child1 = child;
child = child.getNextSibling();
}
System.out.println( "Item:" + item.getTitle() );
System.out.println( "Item1:" + child1.getTitle()
);
output = new OutputStreamWriter(new
FileOutputStream( "simple"+i+".txt" ) );
PDFTextStripperByArea stripper= null;
stripper=new PDFTextStripperByArea();
System.out.println("The word separator
is"+flag);
//stripper.setSortByPosition(true);
stripper.setLineSeparator("\n");
stripper.setPageSeparator("\n\n\n\n");
stripper.setWordSeparator(" ");
stripper.setStartBookmark(item);
stripper.setEndBookmark(child1);
//str = stripper.getText(document);
stripper.setShouldSeparateByBeads(stripper.shouldSeparateByBeads());
stripper.writeText( document, output );
output.close();
document.close();
}
catch(Exception ex)
{
System.out.println(ex);
}
}
-----Original Message-----
From: Jérôme Charron [mailto:[EMAIL PROTECTED]
Sent: Thursday, March 02, 2006 3:42 AM
To: [email protected]; [EMAIL PROTECTED]
Subject: Re: Nutch Parsing PDFs, and general PDF extraction
> This is something google does very well, and something nutch must
> match to compete.
Richard, it seems you are a real pdf guru, so any code contribution to
nutch is welcome.
;-)
Regards
Jérôme
--
http://motrech.free.fr/
http://www.frutch.org/
-------------------------------------------------------
This SF.Net email is sponsored by xPML, a groundbreaking scripting language
that extends applications into web and mobile media. Attend the live webcast
and join the prime developer group breaking into this new coding territory!
http://sel.as-us.falkag.net/sel?cmd=lnk&kid0944&bid$1720&dat1642
_______________________________________________
Nutch-developers mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/nutch-developers