Hi Patric and Chiyean,

Thanks for your guys' reply. It definitely helps. I didn't get back to your guys earlier because I cannot find internet connection for the past days.

Chiyean: I have checked the ExtractText.java example. Actually I did that before I asked the question. It's just that the PDDocument parameter seems to be only for the writeText method. The object of PDFTextStripper may still have no idea about which document it's processing when other methods are called.

Patric: Thanks for reminding me about tracking back to the extended classes. But still, I got some problem. For example, if it's not the ExtractText.java example, I will never figure out what the parameter "encoding" is and what are the options. It's only mentioned in the javadoc that it's a string type. Another example is for the processStream method, one of its parameter is COSStream. However, I have no idea what it's about. It extends COSDictionary, which is a class "represents a dictionary where name/value pairs reside". But, it never mentions how does the COSStream and a dictionary is related to a pdf file and in all the method of COSStream and COSDictionary, I don't see anyone can let these object know which pdf file is being processed. My feeling is I must miss some parts but I don't what that is. However, this makes me feel confused about what is going on. How do you figure out how those things (like COSStream, COSDictionary, encoding, PDResources, keys...) correspond to the pdf files? Do I need to go through the pdf file documentation to make myself clear about that? Please help me out. Thanks.

Best,

Felix


Omar Chiyean wrote:
Hi Patric, have you seen the examples
in the distribution??

Check org.apache.pdfbox.ExtracText.java
There is the way to use this class..

What I can say is that you need a PDDocument Handler.
Check the example, it would be very helpfull.

Cheers...

Reply via email to