from:"Alin Mazilu"

Re: Regarding pdf data extraction

2014-03-03 Thread Alin Mazilu

I don't think that class can help you... All you need is the
PDFTextStripper class...

On Mon, Mar 3, 2014 at 7:15 PM, Divya Muttineni wrote:

> I am trying to convert the tabular data from pdf file to text(.txt) file.
> In one of the article I came across
> org.apache.pdfbox.pdfviewer.PDFPageDrawer.
>
> Can you please help me how to extend this and override the strokepath()
> method.
>
>
> Thank you,
> Divya
>

Re: Need JBIG2 test image

2014-03-12 Thread Alin Mazilu

I have a scanned accident police reports that have people names, addresses
and phone numbers in them. I had a problem printing these files with pdfbox
and I had to improvise by using a command prompt print utility as a
Process. I could maybe give you one if you agree not to release it to the
public.

Alin

On Wed, Mar 12, 2014 at 1:19 PM, Tilman Hausherr wrote:

> Hello all,
>
> I'd need a PDF with JBIG2 encoding that can be distributed. So it should
> not have anything on it that is copyrighted, i.e. artwork or a real text.
> Just some random lines or a lorem ipsum text. The image should be black &
> white, i.e. not have other elements in it that have a color like a
> watermark. Some unserviced Xerox copiers might produce such images, or some
> software from Adobe, IRIS etc. If you have such a file, sent it to me,
> tilman at snafu dot de, not to the list.
>
> I want to use this PDF for a unit test that checks whether the PDF is
> decoded with the JBIG2 plugin. A fail would be an empty image. This way we
> check that the JBIG2 plugin is properly attached.
>
> Tilman
>
>

Problem With MergeUtility

2014-03-13 Thread Alin Mazilu

Hello guys,


Has anyone had any problem with this? Any idea why it happens? What would
be a good value for pushBackSize so this does not happen? Thanks!


Partial stack trace:


org.apache.pdfbox.exceptions.WrappedIOException: Could not push back 72940
bytes in order to reparse stream. Try increasing push back buffer using
system property org.apache.pdfbox.baseParser.pushBackSize



at
org.apache.pdfbox.pdfparser.BaseParser.parseCOSStream(BaseParser.java:546)



at
org.apache.pdfbox.pdfparser.PDFParser.parseObject(PDFParser.java:566)



at
org.apache.pdfbox.pdfparser.PDFParser.parse(PDFParser.java:187)



at
org.apache.pdfbox.pdmodel.PDDocument.load(PDDocument.java:1071)



at
org.apache.pdfbox.pdmodel.PDDocument.load(PDDocument.java:1038)



at
org.apache.pdfbox.util.PDFMergerUtility.mergeDocuments(PDFMergerUtility.java:186)

Re: Problem With MergeUtility

2014-03-13 Thread Alin Mazilu

Where? Here's the code that causes that:

PDFMergeUtility util = new PDFMergeUtility();

for (File file : set) {
try{
if( file.exists() ){
util.addSource(file);
}
} catch ( Exception e ){
   //log e
}
 }
util.setDestinationFileName(...);

util.mergeDocuments();


On Thu, Mar 13, 2014 at 11:27 AM, Maruan Sahyoun wrote:

> Hi,
>
> not a direct answer to your question but could you try
> PDDocument.loadNonSeq instead?
>
> BR
> Maruan Sahyoun
>
> > Am 13.03.2014 um 16:16 schrieb Alin Mazilu :
> >
> > Hello guys,
> >
> >
> > Has anyone had any problem with this? Any idea why it happens? What would
> > be a good value for pushBackSize so this does not happen? Thanks!
> >
> >
> > Partial stack trace:
> >
> >
> > org.apache.pdfbox.exceptions.WrappedIOException: Could not push back
> 72940
> > bytes in order to reparse stream. Try increasing push back buffer using
> > system property org.apache.pdfbox.baseParser.pushBackSize
> >
> >
> >
> >at
> >
> org.apache.pdfbox.pdfparser.BaseParser.parseCOSStream(BaseParser.java:546)
> >
> >
> >
> >at
> > org.apache.pdfbox.pdfparser.PDFParser.parseObject(PDFParser.java:566)
> >
> >
> >
> >at
> > org.apache.pdfbox.pdfparser.PDFParser.parse(PDFParser.java:187)
> >
> >
> >
> >at
> > org.apache.pdfbox.pdmodel.PDDocument.load(PDDocument.java:1071)
> >
> >
> >
> >at
> > org.apache.pdfbox.pdmodel.PDDocument.load(PDDocument.java:1038)
> >
> >
> >
> >at
> >
> org.apache.pdfbox.util.PDFMergerUtility.mergeDocuments(PDFMergerUtility.java:186)
>

Re: Problem With MergeUtility

2014-03-13 Thread Alin Mazilu

Ok, I will try. In my opinion it would be useful if it had the instance
variables protected rather than private, that way the class could be
extended as needed, like PDFTextStripper. It my situation I would only have
to override mergeDocuments(). Anyway, I will try it.

Thank you,

Alin


On Thu, Mar 13, 2014 at 12:52 PM, Timo Boehme wrote:

> Hi,
>
> as far as I remember PDFMergeUtility is one of the last utilities not
> supporting loadNonSeq currently.
>
> As a workaround get the source of PDFMergeUtility, change PDDocument.load
> to PDDocument.loadNonSeq  (you may provide null as buffer parameter).
>
>
> Best,
> Timo
>
>
> Am 13.03.2014 16:46, schrieb Alin Mazilu:
>
>  Where? Here's the code that causes that:
>>
>> PDFMergeUtility util = new PDFMergeUtility();
>>
>> for (File file : set) {
>> try{
>> if( file.exists() ){
>>  util.addSource(file);
>> }
>>  } catch ( Exception e ){
>> //log e
>>  }
>>   }
>> util.setDestinationFileName(...);
>>
>> util.mergeDocuments();
>>
>>
>> On Thu, Mar 13, 2014 at 11:27 AM, Maruan Sahyoun > >wrote:
>>
>>  Hi,
>>>
>>> not a direct answer to your question but could you try
>>> PDDocument.loadNonSeq instead?
>>>
>>> BR
>>> Maruan Sahyoun
>>>
>>>  Am 13.03.2014 um 16:16 schrieb Alin Mazilu :
>>>>
>>>> Hello guys,
>>>>
>>>>
>>>> Has anyone had any problem with this? Any idea why it happens? What
>>>> would
>>>> be a good value for pushBackSize so this does not happen? Thanks!
>>>>
>>>>
>>>> Partial stack trace:
>>>>
>>>>
>>>> org.apache.pdfbox.exceptions.WrappedIOException: Could not push back
>>>>
>>> 72940
>>>
>>>> bytes in order to reparse stream. Try increasing push back buffer using
>>>> system property org.apache.pdfbox.baseParser.pushBackSize
>>>>
>>>>
>>>>
>>>> at
>>>>
>>>>  org.apache.pdfbox.pdfparser.BaseParser.parseCOSStream(
>>> BaseParser.java:546)
>>>
>>>>
>>>>
>>>>
>>>> at
>>>> org.apache.pdfbox.pdfparser.PDFParser.parseObject(PDFParser.java:566)
>>>>
>>>>
>>>>
>>>> at
>>>> org.apache.pdfbox.pdfparser.PDFParser.parse(PDFParser.java:187)
>>>>
>>>>
>>>>
>>>> at
>>>> org.apache.pdfbox.pdmodel.PDDocument.load(PDDocument.java:1071)
>>>>
>>>>
>>>>
>>>> at
>>>> org.apache.pdfbox.pdmodel.PDDocument.load(PDDocument.java:1038)
>>>>
>>>>
>>>>
>>>> at
>>>>
>>>>  org.apache.pdfbox.util.PDFMergerUtility.mergeDocuments(
>>> PDFMergerUtility.java:186)
>>>
>>>
>>
>
> --
>
>  Timo Boehme
>  OntoChem GmbH
>  H.-Damerow-Str. 4
>  06120 Halle/Saale
>  T: +49 345 4780474
>  F: +49 345 4780471
>  timo.boe...@ontochem.com
>
> _
>
>  OntoChem GmbH
>  Geschäftsführer: Dr. Lutz Weber
>  Sitz: Halle / Saale
>  Registergericht: Stendal
>  Registernummer: HRB 215461
> _
>
>

Re: Problem With MergeUtility

2014-03-13 Thread Alin Mazilu

I know that. No problem.


On Thu, Mar 13, 2014 at 2:23 PM, John Hewson  wrote:

> Hi Alin
>
> Thanks for your fix.
>
> >  it would be useful if it had the instance
> > variables protected rather than private, that way the class could be
> > extended as needed, like PDFTextStripper.
>
> The problem with making fields protected is that it exposes internal
> implementation details,
> making them part of the public API. This prevents us from making internal
> changes in the
> future without introducing breaking changes to the public API.
>
> In the case of PDFTextStripper, there is a strong use case for using a
> protected field,
> because overriding it is the primary mechanism for custom text extraction.
>
> Cheers
>
> -- John
>
> On 13 Mar 2014, at 10:40, Alin Mazilu  wrote:
>
> > Ok, I will try. In my opinion it would be useful if it had the instance
> > variables protected rather than private, that way the class could be
> > extended as needed, like PDFTextStripper. It my situation I would only
> have
> > to override mergeDocuments(). Anyway, I will try it.
> >
> > Thank you,
> >
> > Alin
> >
> >
> > On Thu, Mar 13, 2014 at 12:52 PM, Timo Boehme  >wrote:
> >
> >> Hi,
> >>
> >> as far as I remember PDFMergeUtility is one of the last utilities not
> >> supporting loadNonSeq currently.
> >>
> >> As a workaround get the source of PDFMergeUtility, change
> PDDocument.load
> >> to PDDocument.loadNonSeq  (you may provide null as buffer parameter).
> >>
> >>
> >> Best,
> >> Timo
> >>
> >>
> >> Am 13.03.2014 16:46, schrieb Alin Mazilu:
> >>
> >> Where? Here's the code that causes that:
> >>>
> >>> PDFMergeUtility util = new PDFMergeUtility();
> >>>
> >>> for (File file : set) {
> >>> try{
> >>> if( file.exists() ){
> >>> util.addSource(file);
> >>> }
> >>> } catch ( Exception e ){
> >>>    //log e
> >>> }
> >>>  }
> >>> util.setDestinationFileName(...);
> >>>
> >>> util.mergeDocuments();
> >>>
> >>>
> >>> On Thu, Mar 13, 2014 at 11:27 AM, Maruan Sahyoun <
> sahy...@fileaffairs.de
> >>>> wrote:
> >>>
> >>> Hi,
> >>>>
> >>>> not a direct answer to your question but could you try
> >>>> PDDocument.loadNonSeq instead?
> >>>>
> >>>> BR
> >>>> Maruan Sahyoun
> >>>>
> >>>> Am 13.03.2014 um 16:16 schrieb Alin Mazilu :
> >>>>>
> >>>>> Hello guys,
> >>>>>
> >>>>>
> >>>>> Has anyone had any problem with this? Any idea why it happens? What
> >>>>> would
> >>>>> be a good value for pushBackSize so this does not happen? Thanks!
> >>>>>
> >>>>>
> >>>>> Partial stack trace:
> >>>>>
> >>>>>
> >>>>> org.apache.pdfbox.exceptions.WrappedIOException: Could not push back
> >>>>>
> >>>> 72940
> >>>>
> >>>>> bytes in order to reparse stream. Try increasing push back buffer
> using
> >>>>> system property org.apache.pdfbox.baseParser.pushBackSize
> >>>>>
> >>>>>
> >>>>>
> >>>>>at
> >>>>>
> >>>>> org.apache.pdfbox.pdfparser.BaseParser.parseCOSStream(
> >>>> BaseParser.java:546)
> >>>>
> >>>>>
> >>>>>
> >>>>>
> >>>>>at
> >>>>> org.apache.pdfbox.pdfparser.PDFParser.parseObject(PDFParser.java:566)
> >>>>>
> >>>>>
> >>>>>
> >>>>>at
> >>>>> org.apache.pdfbox.pdfparser.PDFParser.parse(PDFParser.java:187)
> >>>>>
> >>>>>
> >>>>>
> >>>>>at
> >>>>> org.apache.pdfbox.pdmodel.PDDocument.load(PDDocument.java:1071)
> >>>>>
> >>>>>
> >>>>>
> >>>>>at
> >>>>> org.apache.pdfbox.pdmodel.PDDocument.load(PDDocument.java:1038)
> >>>>>
> >>>>>
> >>>>>
> >>>>>at
> >>>>>
> >>>>> org.apache.pdfbox.util.PDFMergerUtility.mergeDocuments(
> >>>> PDFMergerUtility.java:186)
> >>>>
> >>>>
> >>>
> >>
> >> --
> >>
> >> Timo Boehme
> >> OntoChem GmbH
> >> H.-Damerow-Str. 4
> >> 06120 Halle/Saale
> >> T: +49 345 4780474
> >> F: +49 345 4780471
> >> timo.boe...@ontochem.com
> >>
> >> _
> >>
> >> OntoChem GmbH
> >> Geschäftsführer: Dr. Lutz Weber
> >> Sitz: Halle / Saale
> >> Registergericht: Stendal
> >> Registernummer: HRB 215461
> >> _
> >>
> >>
>
>

Re: PDFTextPositions

2014-04-02 Thread Alin Mazilu

You have to extend the PDFTextStripper class and override the
processTextPosition(...) method. From there the logic depends on you. You
can also override the writePage() method to grab the charactersByArticle
Vector and then you would look for your words in there by iterating over
it. Basically in both cases you will grab all TextPosition objects and
figure out your position and height/width form there.

~Alin

On Wed, Apr 2, 2014 at 6:32 PM, Sireesha Chilakamarri <
sireesha.chary...@gmail.com> wrote:

> Hi,
>
> I would like to Search and Obtain Text Position (X/Y/Width/height) for the
> searched Text.
>
> Suppose text "Hello_World" appears at different location and on different
> pages on the PDF document, I would like to see its X/Y/Width/Height for
> every occurence.
>
> How do I achieve this?
>
> Thank you,
> Sireesha
>

Re: PDFTextPositions

2014-04-02 Thread Alin Mazilu

Not that I know of. PDFBox provides mostly low level access to the PDF
format. The only relatively easy way to do it would be keep the
TextPosition objects and also grab the text output of the PDFTextStripper.
Then you can search the output (a String) for the position of the word you
are looking for and get the position in the PDF Page from the corresponding
TextPosition objects. Other than that... I can think of other ways but
would take longer to implement. Sorry, I would write a sample, but I'm not
at my desk right now.

Alin


On Wed, Apr 2, 2014 at 7:01 PM, Sireesha Chilakamarri <
sireesha.chary...@gmail.com> wrote:

> Hi Allin,
>
> I am able to run the PrintTextLocations example. This gives me the
> locations details for every characters.
>
> Is there a easier way to get coordinates for a Word as a whole, instead of
> all its characters?
>
> To Search for Text, I used a method prescribed in
>
> http://www.programming-free.com/2012/11/simple-word-search-in-pdf-files-using.html
> .
>
> Is there a easier way to Search for Text as well?
>
> Are there no direct APIs?
>
> Thank you,
> Sireesha
>
>
> On Wed, Apr 2, 2014 at 3:55 PM, Alin Mazilu  wrote:
>
> > You have to extend the PDFTextStripper class and override the
> > processTextPosition(...) method. From there the logic depends on you. You
> > can also override the writePage() method to grab the charactersByArticle
> > Vector and then you would look for your words in there by iterating over
> > it. Basically in both cases you will grab all TextPosition objects and
> > figure out your position and height/width form there.
> >
> > ~Alin
> >
> >
> > On Wed, Apr 2, 2014 at 6:32 PM, Sireesha Chilakamarri <
> > sireesha.chary...@gmail.com> wrote:
> >
> > > Hi,
> > >
> > > I would like to Search and Obtain Text Position (X/Y/Width/height) for
> > the
> > > searched Text.
> > >
> > > Suppose text "Hello_World" appears at different location and on
> different
> > > pages on the PDF document, I would like to see its X/Y/Width/Height for
> > > every occurence.
> > >
> > > How do I achieve this?
> > >
> > > Thank you,
> > > Sireesha
> > >
> >
>

Re: PDF file characters x and y coordinates

2014-05-16 Thread Alin Mazilu

I process about 2000 PDF files daily and I never had had an issue with the
coordinates. One piece of advise though: write your own
TextPositionComparator.

~Alin

On Fri, May 16, 2014 at 8:39 AM, Simer P  wrote:

> I just needed to confirm this with you guys.
>
> Can the X and Y coordinates returned in the
> processTextPosition(TextPosition text) ever be incorrect ?
>
> Because it doesn't really matter in what order the text is extracted ... if
> the x and y coordinates are accurate then I can rearrange the characters
> based on the applications requirements.
>
> So can the X and Y coordinates every be wrong ?
>
> Cheers
>

Re: Problem with processTextPosition

2014-05-17 Thread Alin Mazilu

What are the x and y coordinates of H and W?

Alin Mazilu
SKE GlobalTech, LLC
3250 West Market St. Suite 307D
Fairlawn, OH 44333

Sent from my Galaxy S3
On May 17, 2014 2:42 AM, "DImuthu Upeksha" 
wrote:

> Hi all,
>
> I was tying to manually feed text position objects to
> processTextPosition method in PDFTextStripper class. I created a sub
> class of PDFTextStripper and override processStream method. In
> processStream method I manually created two text position objects for
> words "W" and "H". At the end I passed them to processTextPosition
>
> processTextPosition(textPosition1);
> processTextPosition(textPosition2);
>
> Then I tested it using
>
> PDFTextStripper ocrStripper = new PDFOCRTextStripper();
> PDDocument document = PDDocument.load("some pdf file");
> String data = ocrStripper.getText(document);
> System.out.println(data);
>
> Output was : H W
>
> Then I changed the sequence of passing TextPosition objects in [1]
>
> processTextPosition(textPosition2);
> processTextPosition(textPosition1);
>
> Output was : WH
>
> --
>
> As far as I understood processTextPosition works with the text
> position metadata like x and y co-ordinates of the input text. It
> should not depend on the order of the input sequence. But in case It
> seems like processTextPosition method works according to order of
> input.
> Ex. If I input W first, it prints W first without considering it's
> actual position.
>
> Is this the normal behaviour? Or am I missing something here?
>
> [1] https://gist.github.com/DImuthuUpe/5dcfa9758f017794c649
> --
> Regards
>
> W.Dimuthu Upeksha
> Undergraduate
>
> Department of Computer Science And Engineering
>
> University of Moratuwa, Sri Lanka
>

Re: Problem with processTextPosition

2014-05-17 Thread Alin Mazilu

Hello,

I commented on the gist. You have to use setSortByPosition(true) in the
constructor right after super(). Be careful with your coordinate system.
When you do textPosition1.getY() you get 792 not 0. I don't remember
exactly where, but there is a class that uses the lower left corner of the
page as the origin (0,0), not the upper left corner as it is natural.

I hope that helps.

Alin

PS Is the OCR going to be pure Java or will you be writing it in other
language and use native calls?


On Sat, May 17, 2014 at 8:13 AM, DImuthu Upeksha  wrote:

> Hi Alin,
>
> You can find my source code from here
> https://gist.github.com/DImuthuUpe/5dcfa9758f017794c649
> As you can see I set
> X-offset : 0 and Y-offset : 0 for "H"
> X-offset : 32 and Y-offset : 0 for "W"
> in Text Matrices. Is that enough? Is there other way to set X,Y
> co-ordinates?
>
>
> On Sat, May 17, 2014 at 12:18 PM, Alin Mazilu  wrote:
> > What are the x and y coordinates of H and W?
> >
> > Alin Mazilu
> > SKE GlobalTech, LLC
> > 3250 West Market St. Suite 307D
> > Fairlawn, OH 44333
> >
> > Sent from my Galaxy S3
> > On May 17, 2014 2:42 AM, "DImuthu Upeksha" 
> > wrote:
> >
> >> Hi all,
> >>
> >> I was tying to manually feed text position objects to
> >> processTextPosition method in PDFTextStripper class. I created a sub
> >> class of PDFTextStripper and override processStream method. In
> >> processStream method I manually created two text position objects for
> >> words "W" and "H". At the end I passed them to processTextPosition
> >>
> >> processTextPosition(textPosition1);
> >> processTextPosition(textPosition2);
> >>
> >> Then I tested it using
> >>
> >> PDFTextStripper ocrStripper = new PDFOCRTextStripper();
> >> PDDocument document = PDDocument.load("some pdf file");
> >> String data = ocrStripper.getText(document);
> >> System.out.println(data);
> >>
> >> Output was : H W
> >>
> >> Then I changed the sequence of passing TextPosition objects in [1]
> >>
> >> processTextPosition(textPosition2);
> >> processTextPosition(textPosition1);
> >>
> >> Output was : WH
> >>
> >> --
> >>
> >> As far as I understood processTextPosition works with the text
> >> position metadata like x and y co-ordinates of the input text. It
> >> should not depend on the order of the input sequence. But in case It
> >> seems like processTextPosition method works according to order of
> >> input.
> >> Ex. If I input W first, it prints W first without considering it's
> >> actual position.
> >>
> >> Is this the normal behaviour? Or am I missing something here?
> >>
> >> [1] https://gist.github.com/DImuthuUpe/5dcfa9758f017794c649
> >> --
> >> Regards
> >>
> >> W.Dimuthu Upeksha
> >> Undergraduate
> >>
> >> Department of Computer Science And Engineering
> >>
> >> University of Moratuwa, Sri Lanka
> >>
>
>
>
> --
> Regards
>
> W.Dimuthu Upeksha
> Undergraduate
>
> Department of Computer Science And Engineering
>
> University of Moratuwa, Sri Lanka
>

Re: [DISCUSS] Switch to java 1.6

2013-04-28 Thread Alin Mazilu

Hello,

I got one: JavaFX. I use PDFBox in projects that use JavaFX 1.7/1.8.

Alin


On Sun, Apr 28, 2013 at 1:35 PM, Andreas Lehmkuehler wrote:

> Hi,
>
> there was already a discussion about switching to java 1.6. As this is a
> very
> important topic I'd like to move the discussion to a separate thread.
>
> There are a lot of good reasons to switch to java 1.6 and until now
> everybody
> agrees to do the switch.
>
> Is there anybody who has at least one good reason not to go on and switch
> to
> java 1.6?
>
> BR
> Andreas Lehmkühler
>

Re: [DISCUSS] Switch to java 1.6

2013-04-30 Thread Alin Mazilu

JavaFX has become part of Java main download in version 1.7 and it will
have the version number of Java. I am using PDFBox 1.7.1 in all my projects
at the moment. My initial response was because I misread the "switching" to
java 1.6 part, and I thought that future versions of PDFBox would not work
on any other versions of Java. I am going to have to make something like a
PDF plugin for the JavaFX WebView controller and a PDF viewer for on JavaFX
technology and I got scared, because I really like PDFBox and I don't want
to change to another library. It turns out that I can breath normally
now... :))

On Tue, Apr 30, 2013 at 1:03 PM, Thomas Chojecki  wrote:

>
> Zitat von Alin Mazilu :
>
>  Hello,
>>
> Hi,
>
>
>  I got one: JavaFX. I use PDFBox in projects that use JavaFX 1.7/1.8.
>>
> I try to find this JavaFX version to see what Java version it need, but I
> can't figure out where to download it.
> Wikipedia [1] did not list such a version. Can you please provide more
> detailed informations or test your project with an JRE 1.6 or higher?
>
> The next big question is, did you use the latest pdfbox version in your
> project? If there are no problems you can stay at the 1.8.1 version.
>
> So please give us more detailes.
>
>
> Best Regards
> Thomas
>
>
> [1] http://en.wikipedia.org/wiki/**JavaFX<http://en.wikipedia.org/wiki/JavaFX>
>
>

PDF Text Highlight

2013-07-26 Thread Alin Mazilu

Hello all,

I have a bit of a situation on my hands. Here it is: I have a bunch of PDF
files sitting in a folder somewhere. What I have to do is search all of
them for certain names and highlight those names with a yellow marker-like
background and then I have to send all PDFs to a printer.

I have done the searching and text extraction and the printing, but for the
life of me, I can't figure out how to do the highlighting. What makes it
even harder is that I have hundreds of these PDFs per day and human
interaction is out of the question. It has to be a push of a button.

Any ideas? I appreciate it.

Alin Mazilu

Re: PDF Text Highlight

2013-07-27 Thread Alin Mazilu

Thank you very much! It does work. The only thing is that you have to
use yellowStream.getCOSObject() instead of yellowStream in your last line.
Also, the PDPageContentStream.fillRect( x, y, w, h) method uses the bottom
left corner of the page as the origin (0,0) which is different from the PDF
standard -- the upper left corner. But that's not a problem as it's fixable
with simple arithmetic.

Thank you so much for your help. It would have taken me a long time to
figure it out on my own, if ever.

Alin Mazilu


On Fri, Jul 26, 2013 at 6:19 PM, Fred Hansen  wrote:

> Caveat: I've not tried this; nor anything like it. I am answering because
> figuring out how to do it was a challenge.
>
> Presumably your program has variables 'page' and 'document' where the
> rectangle goes and variables llx, lly, w, and h delimiting the rectangle.
>
> Here's some code that might work.  (UNTESTED)
>
> // first construct a stream that draws a yellow rectangle
> //  at the desired coordinates, but on a temporary page
> PDPage tempPage = new PDPage();
> PDPageContentStream tempStream = new PDPageContentStream(document,
> tempPage);
> tempStream.setNonStrokingColor(0,255,255);//a version of yellow
> tempStream.fillRect(llx, lly, w, h);   //  where to put rect
> tempStream.close();
>
> // now get a handle on the stream (I hope it is not an array)
> PDStream yellowStream = tempPage.getContents();
>
> // get the contents of the page
> COSDictionary dict = page.getCOSDictionary();
> COSBase pageStream = dict.getDictionaryObject("Contents");
>
> // make sure the contents are a COSArray
> COSArray pageStreamArray;
> if (pageStream instanceof COSStream) {
> pageStreamArray = new COSArray();
> pageStreamArray.add(pageStream);
> dict.setItem("Contents", pageStreamArray);
> }
> else pageStreamArray = (COSArray)pageStream;
>
> // now we add yellowStream at the front of page.getContents()
> //   (in front so text is later drawn on top of it)
> pageStreamArray.add(0, yellowStream );
>
>   --
>  *From:* Alin Mazilu 
> *To:* dev@pdfbox.apache.org
> *Sent:* Friday, July 26, 2013 12:33 PM
> *Subject:* PDF Text Highlight
>
> Hello all,
>
> I have a bit of a situation on my hands. Here it is: I have a bunch of PDF
> files sitting in a folder somewhere. What I have to do is search all of
> them for certain names and highlight those names with a yellow marker-like
> background and then I have to send all PDFs to a printer.
>
> I have done the searching and text extraction and the printing, but for the
> life of me, I can't figure out how to do the highlighting. What makes it
> even harder is that I have hundreds of these PDFs per day and human
> interaction is out of the question. It has to be a push of a button.
>
> Any ideas? I appreciate it.
>
> Alin Mazilu
>
>
>

Re: PDFTextStripper's writeLine() must be protected!

2013-11-15 Thread Alin Mazilu

Hello,

I would venture to guess that if you need to override that method you
probably need to do something more complicated than just finding out where
a line starts and where it ends. Because if you just need to get the
beginning and end of each line, you can override setLineSeparator() and all
the setXxxStart() and setXxxEnd() and then grab the "output" which is
protected and you have access to. If you set the line separator, the
paragraph start and end, the page start and end, etc., you can make out
easily where the lines start and end.

Perhaps if you gave a little more detail about what it is you are trying to
accomplish, my help could be a little more meaningful. I've been using
pdfbox for a long time in quite a few projects and I have never had the
need to override writeLine. The library is quite well thought out.

Regards,

Alin

On Fri, Nov 15, 2013 at 9:14 PM, Edson Alves Pereira wrote:

> Hello guys, i was just trying to extend PDFTextStripper to capture the
> whole line of a page from a simple PDF and it made me face a problem, the
> method writeLine() is private making impossible to me distinguish when the
> line finish without to go down textPosition and PDF objects.
>
> It could be protected?
>
> Regards,
> Edson
>

Error printing...

2014-01-22 Thread Alin Mazilu

Hello all,

I am printing some PDFs and I am getting this:

Jan 22, 2014 12:07:47 PM org.apache.pdfbox.filter.JBIG2Filter decode
SEVERE: Can't find an ImageIO plugin to decode the JBIG2 encoded datastream.
Jan 22, 2014 12:07:47 PM
org.apache.pdfbox.pdmodel.graphics.xobject.PDPixelMap getRGBImage
SEVERE: Something went wrong ... the pixelmap doesn't contain any data.
Jan 22, 2014 12:07:47 PM org.apache.pdfbox.util.operator.pagedrawer.Invoke
process
WARNING: getRGBImage returned NULL
Jan 22, 2014 12:07:47 PM org.apache.pdfbox.filter.JBIG2Filter decode
SEVERE: Can't find an ImageIO plugin to decode the JBIG2 encoded datastream.
Jan 22, 2014 12:07:47 PM
org.apache.pdfbox.pdmodel.graphics.xobject.PDPixelMap getRGBImage
SEVERE: Something went wrong ... the pixelmap doesn't contain any data.
Jan 22, 2014 12:07:47 PM org.apache.pdfbox.util.operator.pagedrawer.Invoke
process
WARNING: getRGBImage returned NULL
Jan 22, 2014 12:07:47 PM org.apache.pdfbox.filter.JBIG2Filter decode
SEVERE: Can't find an ImageIO plugin to decode the JBIG2 encoded datastream.
Jan 22, 2014 12:07:47 PM
org.apache.pdfbox.pdmodel.graphics.xobject.PDPixelMap getRGBImage
SEVERE: Something went wrong ... the pixelmap doesn't contain any data.
Jan 22, 2014 12:07:47 PM org.apache.pdfbox.util.operator.pagedrawer.Invoke
process
WARNING: getRGBImage returned NULL

Is there a quick way to fix this? Is there a JBIG2 plugin? I really need to
fix it today or I'm in trouble. :)

Thank you,

Alin

Re: Error printing...

2014-01-22 Thread Alin Mazilu

Thank you for your quick responses, but the application is a JavaFX self
contained application packaged with the JRE and is independent of the JRE
installed on the OS. So I think I need to package the JAI libraries but I
have no idea how :D Any thoughts?

Thank you,

Alin


On Wed, Jan 22, 2014 at 1:48 PM, John Hewson  wrote:

> Yes, there is. Simply Google "JBIG2 plugin” and follow the first link, it
> will be called "jbig2-imageio".
>
> -- John
>
> On 22 Jan 2014, at 09:16, Alin Mazilu  wrote:
>
> > Hello all,
> >
> > I am printing some PDFs and I am getting this:
> >
> > Jan 22, 2014 12:07:47 PM org.apache.pdfbox.filter.JBIG2Filter decode
> > SEVERE: Can't find an ImageIO plugin to decode the JBIG2 encoded
> datastream.
> > Jan 22, 2014 12:07:47 PM
> > org.apache.pdfbox.pdmodel.graphics.xobject.PDPixelMap getRGBImage
> > SEVERE: Something went wrong ... the pixelmap doesn't contain any data.
> > Jan 22, 2014 12:07:47 PM
> org.apache.pdfbox.util.operator.pagedrawer.Invoke
> > process
> > WARNING: getRGBImage returned NULL
> > Jan 22, 2014 12:07:47 PM org.apache.pdfbox.filter.JBIG2Filter decode
> > SEVERE: Can't find an ImageIO plugin to decode the JBIG2 encoded
> datastream.
> > Jan 22, 2014 12:07:47 PM
> > org.apache.pdfbox.pdmodel.graphics.xobject.PDPixelMap getRGBImage
> > SEVERE: Something went wrong ... the pixelmap doesn't contain any data.
> > Jan 22, 2014 12:07:47 PM
> org.apache.pdfbox.util.operator.pagedrawer.Invoke
> > process
> > WARNING: getRGBImage returned NULL
> > Jan 22, 2014 12:07:47 PM org.apache.pdfbox.filter.JBIG2Filter decode
> > SEVERE: Can't find an ImageIO plugin to decode the JBIG2 encoded
> datastream.
> > Jan 22, 2014 12:07:47 PM
> > org.apache.pdfbox.pdmodel.graphics.xobject.PDPixelMap getRGBImage
> > SEVERE: Something went wrong ... the pixelmap doesn't contain any data.
> > Jan 22, 2014 12:07:47 PM
> org.apache.pdfbox.util.operator.pagedrawer.Invoke
> > process
> > WARNING: getRGBImage returned NULL
> >
> > Is there a quick way to fix this? Is there a JBIG2 plugin? I really need
> to
> > fix it today or I'm in trouble. :)
> >
> > Thank you,
> >
> > Alin
>
>

Re: Html to Pdf

2014-09-05 Thread Alin Mazilu

Since we are suggesting alternatives, I use iText for converting HTML into
PDF. Here is an example:
http://www.rgagnon.com/javadetails/java-html-to-pdf-using-itext.html

Hope that helps,

Alin

On Fri, Sep 5, 2014 at 1:50 PM, John Hewson  wrote:

> Rendering HTML is very complex, you basically need to use a modified web
> browser.
>
> You might want to try PhantomJS http://phantomjs.org/screen-capture.html
> which can produce PDFs.
>
> -- John
>
> On 5 Sep 2014, at 01:08, Emre Türker  wrote:
>
> > Hi,
> >
> >
> >
> > I want export to pdf from html text with using PDFBox. How can I do it?
> >
> > Please help me.
> >
> >
> >
> > Emre.
> >
>
>

Re: Regarding pdf data extraction

Re: Need JBIG2 test image

Problem With MergeUtility

Re: Problem With MergeUtility

Re: Problem With MergeUtility

Re: Problem With MergeUtility

Re: PDFTextPositions

Re: PDFTextPositions

Re: PDF file characters x and y coordinates

Re: Problem with processTextPosition

Re: Problem with processTextPosition

Re: [DISCUSS] Switch to java 1.6

Re: [DISCUSS] Switch to java 1.6

PDF Text Highlight

Re: PDF Text Highlight

Re: PDFTextStripper's writeLine() must be protected!

Error printing...

Re: Error printing...

Re: Html to Pdf

19 matches

Site Navigation

Mail list logo

Footer information