Hi Andreas,
Thanks for your reply. But what I am looking for is further processing
the format information of the document instead of simply extracting the
text. So basicly what I am trying to do is let the stripper object know
which document it's processing when the writeText method is not called.
Do you have any idea about this?
Best,
Felix
Andreas Lehmkühler wrote:
Hi,
Shen Wang schrieb:
Hi guys,
I got a weird thing that I don't know how to make it work. Here is the
code:
import java.io.File;
import java.io.IOException;
import java.util.List;
import org.apache.pdfbox.pdmodel.PDDocument;
import org.apache.pdfbox.util.PDFText2HTML;
import org.apache.pdfbox.util.PDFTextStripper;
public class PDF_Title {
public PDF_Title() {
}
public static void main( String[] args ) throws IOException {
if ( args.length != 1 ) {
System.out.println( "bad input" );
}
String pdfFileName = args[ 0 ];
PDDocument document = PDDocument.load( pdfFileName );
PDFTextStripper stripper = null;
stripper = new PDFText2HTML("UTF-8");
List pages = document.getDocumentCatalog().getAllPages();
stripper.processPages(pages);
}
}
The problem is in the last line, if I leave the parameter of
processPages and blank, Eclipse will remind me that a pages list
parameter is needed and asks me to fill in. However, when I fill the
blank with the parameter, which is "pages" here, Eclipse will tell me
that the method of processPages from the type PDFTextStripper is not
visible and still refuses to compile. However, according to the javadoc,
processPages is simply a method of PDFTextStripper and asks for a page
list parameter. Could you guys help me point out where I made the
mistake? Thanks.
Try to use stripper.writeText(document, outputStream) instead of
stripper.processPages(..)
BR
Andreas Lehmkühler