[jira] [Updated] (PDFBOX-2450) RenderingHints of PageDrawer should be customizable

2014-10-23 Thread Pei-Tang Huang (JIRA)

 [ 
https://issues.apache.org/jira/browse/PDFBOX-2450?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Pei-Tang Huang updated PDFBOX-2450:
---
Description: 
RenderingHints defaults are hard-coded in 
[PageDrawer|https://github.com/apache/pdfbox/blob/54037862e4c55ab45eb8aecc44b79afbfbcd8dd9/pdfbox/src/main/java/org/apache/pdfbox/rendering/PageDrawer.java#L143-L151]
 now.

However, the defaults are not always valid for every situation. For example, I 
have a PDF file containing barcodes, which will be printed out using a low 
resolution dot matrix printer. The {{RenderingHints.VALUE_ANTIALIAS_ON}}  
default confuses all of our barcode reader!

A dirty workaround I had is to extend {{PDFPrinter}}, use reflection to set the 
{{protected}} but {{final}} {{super.renderer}} field with following extension:

{code:java}
@Override
public void renderPageToGraphics(int pageIndex, Graphics2D graphics, float 
scale) throws IOException {
// proxy graphics, applying hints immediately, suppress all subsequent 
setRenderingHint request
Graphics2D hintsAppliedGraphics = new HintsAppliedGraphics2D(graphics, 
hints);

super.renderPageToGraphics(pageIndex, hintsAppliedGraphics, scale);
}
{code}

It will be nice if there exist a more elegant way to specify {{RenderHint}} s.

  was:
RenderingHints defaults are hard-coded in 
[PageDrawer|https://github.com/apache/pdfbox/blob/54037862e4c55ab45eb8aecc44b79afbfbcd8dd9/pdfbox/src/main/java/org/apache/pdfbox/rendering/PageDrawer.java#L143-L151]
 now.

However, the defaults are not always valid for every situation. For example, I 
have a PDF file containing barcodes, which will be printed out using a low 
resolution dot matrix printer. The {{RenderingHints.VALUE_ANTIALIAS_ON}}  
default confuses all of our barcode reader!

A dirty workaround I had is to extend {{PDFPrinter}}, use reflection to set the 
{{protected}} but {{final}} {{super.renderer}} field with following extension:

{code:java}
@Override
public void renderPageToGraphics(int pageIndex, Graphics2D graphics, float 
scale)
throws IOException {
Graphics2D hintsAppliedGraphics = new HintsAppliedGraphics2D(graphics, 
hints);

super.renderPageToGraphics(pageIndex, hintsAppliedGraphics, scale);
}
{code}

Where the HintsPredefinedGraphics2D is a proxy to the original Graphics2D, 
applying the given {{RenderingHints}} immediately and will suppress all 
subsequent {{setRenderingHints}} / {{setRenderingHint}} request from 
{{PageDrawer}}.

I will be appreciated if there is a more elegant way to specify {{RenderHint}} 
s.


> RenderingHints of PageDrawer should be customizable
> ---
>
> Key: PDFBOX-2450
> URL: https://issues.apache.org/jira/browse/PDFBOX-2450
> Project: PDFBox
>  Issue Type: Improvement
>  Components: Rendering
>Affects Versions: 2.0.0
>Reporter: Pei-Tang Huang
>Priority: Minor
>  Labels: PageDrawer, RenderingHints,
>
> RenderingHints defaults are hard-coded in 
> [PageDrawer|https://github.com/apache/pdfbox/blob/54037862e4c55ab45eb8aecc44b79afbfbcd8dd9/pdfbox/src/main/java/org/apache/pdfbox/rendering/PageDrawer.java#L143-L151]
>  now.
> However, the defaults are not always valid for every situation. For example, 
> I have a PDF file containing barcodes, which will be printed out using a low 
> resolution dot matrix printer. The {{RenderingHints.VALUE_ANTIALIAS_ON}}  
> default confuses all of our barcode reader!
> A dirty workaround I had is to extend {{PDFPrinter}}, use reflection to set 
> the {{protected}} but {{final}} {{super.renderer}} field with following 
> extension:
> {code:java}
> @Override
> public void renderPageToGraphics(int pageIndex, Graphics2D graphics, float 
> scale) throws IOException {
> // proxy graphics, applying hints immediately, suppress all subsequent 
> setRenderingHint request
> Graphics2D hintsAppliedGraphics = new HintsAppliedGraphics2D(graphics, 
> hints);
> super.renderPageToGraphics(pageIndex, hintsAppliedGraphics, scale);
> }
> {code}
> It will be nice if there exist a more elegant way to specify {{RenderHint}} s.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (PDFBOX-2450) RenderingHints of PageDrawer should be customizable

2014-10-23 Thread John Hewson (JIRA)

[ 
https://issues.apache.org/jira/browse/PDFBOX-2450?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14182500#comment-14182500
 ] 

John Hewson commented on PDFBOX-2450:
-

It's not clear where the problem is, are you saying that setting the anti-alias 
hint to on causes issues with the dot matrix printer driver? If the driver 
doesn't support anti-aliasing then it should ignore the hint.

Could you attach a sample PDF? (More > Attach Files). If I can examine how the 
barcodes are embedded then it might be more clear what the best solution is.

> RenderingHints of PageDrawer should be customizable
> ---
>
> Key: PDFBOX-2450
> URL: https://issues.apache.org/jira/browse/PDFBOX-2450
> Project: PDFBox
>  Issue Type: Improvement
>  Components: Rendering
>Affects Versions: 2.0.0
>Reporter: Pei-Tang Huang
>Priority: Minor
>  Labels: PageDrawer, RenderingHints,
>
> RenderingHints defaults are hard-coded in 
> [PageDrawer|https://github.com/apache/pdfbox/blob/54037862e4c55ab45eb8aecc44b79afbfbcd8dd9/pdfbox/src/main/java/org/apache/pdfbox/rendering/PageDrawer.java#L143-L151]
>  now.
> However, the defaults are not always valid for every situation. For example, 
> I have a PDF file containing barcodes, which will be printed out using a low 
> resolution dot matrix printer. The {{RenderingHints.VALUE_ANTIALIAS_ON}}  
> default confuses all of our barcode reader!
> A dirty workaround I had is to extend {{PDFPrinter}}, use reflection to set 
> the {{protected}} but {{final}} {{super.renderer}} field with following 
> extension:
> {code:java}
> @Override
> public void renderPageToGraphics(int pageIndex, Graphics2D graphics, float 
> scale)
> throws IOException {
> Graphics2D hintsAppliedGraphics = new HintsAppliedGraphics2D(graphics, 
> hints);
> super.renderPageToGraphics(pageIndex, hintsAppliedGraphics, scale);
> }
> {code}
> Where the HintsPredefinedGraphics2D is a proxy to the original Graphics2D, 
> applying the given {{RenderingHints}} immediately and will suppress all 
> subsequent {{setRenderingHints}} / {{setRenderingHint}} request from 
> {{PageDrawer}}.
> I will be appreciated if there is a more elegant way to specify 
> {{RenderHint}} s.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Created] (PDFBOX-2450) RenderingHints of PageDrawer should be customizable

2014-10-23 Thread Pei-Tang Huang (JIRA)
Pei-Tang Huang created PDFBOX-2450:
--

 Summary: RenderingHints of PageDrawer should be customizable
 Key: PDFBOX-2450
 URL: https://issues.apache.org/jira/browse/PDFBOX-2450
 Project: PDFBox
  Issue Type: Improvement
  Components: Rendering
Affects Versions: 2.0.0
Reporter: Pei-Tang Huang
Priority: Minor


RenderingHints defaults are hard-coded in 
[PageDrawer|https://github.com/apache/pdfbox/blob/54037862e4c55ab45eb8aecc44b79afbfbcd8dd9/pdfbox/src/main/java/org/apache/pdfbox/rendering/PageDrawer.java#L143-L151]
 now.

However, the defaults are not always valid for every situation. For example, I 
have a PDF file containing barcodes, which will be printed out using a low 
resolution dot matrix printer. The {{RenderingHints.VALUE_ANTIALIAS_ON}}  
default confuses all of our barcode reader!

A dirty workaround I had is to extend {{PDFPrinter}}, use reflection to set the 
{{protected}} but {{final}} {{super.renderer}} field with following extension:

{code:java}
@Override
public void renderPageToGraphics(int pageIndex, Graphics2D graphics, float 
scale)
throws IOException {
Graphics2D hintsAppliedGraphics = new HintsAppliedGraphics2D(graphics, 
hints);

super.renderPageToGraphics(pageIndex, hintsAppliedGraphics, scale);
}
{code}

Where the HintsPredefinedGraphics2D is a proxy to the original Graphics2D, 
applying the given {{RenderingHints}} immediately and will suppress all 
subsequent {{setRenderingHints}} / {{setRenderingHint}} request from 
{{PageDrawer}}.

I will be appreciated if there is a more elegant way to specify {{RenderHint}} 
s.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (PDFBOX-2447) "Cannot save a document which has been closed" when encrypting

2014-10-23 Thread Tilman Hausherr (JIRA)

[ 
https://issues.apache.org/jira/browse/PDFBOX-2447?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14182491#comment-14182491
 ] 

Tilman Hausherr commented on PDFBOX-2447:
-

With your change, it would no longer do what the javadoc says: "This will get 
the document CATALOG. This is guaranteed to not return null"

> "Cannot save a document which has been closed" when encrypting
> --
>
> Key: PDFBOX-2447
> URL: https://issues.apache.org/jira/browse/PDFBOX-2447
> Project: PDFBox
>  Issue Type: Bug
>  Components: PDModel
>Affects Versions: 2.0.0
> Environment: java7 deb7
>Reporter: Ralf Hauser
> Attachments: patch2447.txt, patch2447a.txt, patch2447b.txt
>
>
> InputStream content = ...;
> int keyLength = 256;
> AccessPermission ap = new AccessPermission();
> StandardProtectionPolicy spp = new 
> StandardProtectionPolicy(
> symmPw, symmPw, ap);
> spp.setEncryptionKeyLength(keyLength);
> document.protect(spp);
> ByteArrayOutputStream baos = new ByteArrayOutputStream();
> document.save(baos);
> in the save() the above mentioned exception is thrown (wasn't with the 
> 2013-11 snapshot)



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (PDFBOX-2447) "Cannot save a document which has been closed" when encrypting

2014-10-23 Thread Ralf Hauser (JIRA)

 [ 
https://issues.apache.org/jira/browse/PDFBOX-2447?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ralf Hauser updated PDFBOX-2447:

Attachment: patch2447b.txt

Fine, order changed in patch2447b.txt

Please committ

> "Cannot save a document which has been closed" when encrypting
> --
>
> Key: PDFBOX-2447
> URL: https://issues.apache.org/jira/browse/PDFBOX-2447
> Project: PDFBox
>  Issue Type: Bug
>  Components: PDModel
>Affects Versions: 2.0.0
> Environment: java7 deb7
>Reporter: Ralf Hauser
> Attachments: patch2447.txt, patch2447a.txt, patch2447b.txt
>
>
> InputStream content = ...;
> int keyLength = 256;
> AccessPermission ap = new AccessPermission();
> StandardProtectionPolicy spp = new 
> StandardProtectionPolicy(
> symmPw, symmPw, ap);
> spp.setEncryptionKeyLength(keyLength);
> document.protect(spp);
> ByteArrayOutputStream baos = new ByteArrayOutputStream();
> document.save(baos);
> in the save() the above mentioned exception is thrown (wasn't with the 
> 2013-11 snapshot)



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (PDFBOX-2449) Character missing in text extraction

2014-10-23 Thread Tilman Hausherr (JIRA)

 [ 
https://issues.apache.org/jira/browse/PDFBOX-2449?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tilman Hausherr updated PDFBOX-2449:

Description: 
The attached file brings this text extraction:

1.8.6:
For safe! clean! abundant water?in our homes!
rivers! lakes! and streams?is one of our

1.8.7:
For safe! clean! abundant water?in our homes!
rivers! lakes! and streams?is one of our

1.8.8:
For safe! clean! abundant water?n our homes!
rivers! lakes! and streams?s one of our

2.0:
For safe! clean! abundant water–in our homes!
rivers! lakes! and streams–is one of our

AR:
For safe! clean! abundant water–in our homes!
rivers! lakes! and streams–is one of our

So the "i" has been lost in the 1.8.8 version. (2.0 and AR have a character 
that is invisible when viewing this issue, but can be seen when editing)

  was:
The attached file brings this text extraction:

1.8.6:
For safe! clean! abundant water?in our homes!
rivers! lakes! and streams?is one of our

1.8.7:
For safe! clean! abundant water?in our homes!
rivers! lakes! and streams?is one of our

1.8.8:
For safe! clean! abundant water?n our homes!
rivers! lakes! and streams?s one of our

2.0:
For safe! clean! abundant water–in our homes!
rivers! lakes! and streams–is one of our

AR:
For safe! clean! abundant water–in our homes!
rivers! lakes! and streams–is one of our

So the "i" has been lost in the 1.8.8 version.


> Character missing in text extraction
> 
>
> Key: PDFBOX-2449
> URL: https://issues.apache.org/jira/browse/PDFBOX-2449
> Project: PDFBox
>  Issue Type: Bug
>Affects Versions: 1.8.8
>Reporter: Tilman Hausherr
> Attachments: 267739.pdf
>
>
> The attached file brings this text extraction:
> 1.8.6:
> For safe! clean! abundant water?in our homes!
> rivers! lakes! and streams?is one of our
> 1.8.7:
> For safe! clean! abundant water?in our homes!
> rivers! lakes! and streams?is one of our
> 1.8.8:
> For safe! clean! abundant water?n our homes!
> rivers! lakes! and streams?s one of our
> 2.0:
> For safe! clean! abundant water–in our homes!
> rivers! lakes! and streams–is one of our
> AR:
> For safe! clean! abundant water–in our homes!
> rivers! lakes! and streams–is one of our
> So the "i" has been lost in the 1.8.8 version. (2.0 and AR have a character 
> that is invisible when viewing this issue, but can be seen when editing)



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (PDFBOX-2439) ArrayIndexOutOfBoundsException in multithreaded system

2014-10-23 Thread John Hewson (JIRA)

[ 
https://issues.apache.org/jira/browse/PDFBOX-2439?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14181958#comment-14181958
 ] 

John Hewson commented on PDFBOX-2439:
-

Good point, that's a bug, it should be.

> ArrayIndexOutOfBoundsException in multithreaded system
> --
>
> Key: PDFBOX-2439
> URL: https://issues.apache.org/jira/browse/PDFBOX-2439
> Project: PDFBox
>  Issue Type: Bug
>  Components: FontBox
>Affects Versions: 2.0.0
>Reporter: simon steiner
>Priority: Critical
> Fix For: 2.0.0
>
> Attachments: TestPDFBox.java
>
>
> When it loads replacement font from OS i sometimes get:
> Exception in thread "Thread-27" java.lang.ArrayIndexOutOfBoundsException: 
> 40036
>   at 
> org.apache.fontbox.ttf.GlyfSimpleDescript.readFlags(GlyfSimpleDescript.java:197)
>   at 
> org.apache.fontbox.ttf.GlyfSimpleDescript.(GlyfSimpleDescript.java:78)
>   at org.apache.fontbox.ttf.GlyphData.initData(GlyphData.java:58)
>   at org.apache.fontbox.ttf.GlyphTable.getGlyph(GlyphTable.java:161)
>   at 
> org.apache.pdfbox.rendering.font.TTFGlyph2D.getPathForGID(TTFGlyph2D.java:140)
>   at 
> org.apache.pdfbox.rendering.font.TTFGlyph2D.getPathForCharacterCode(TTFGlyph2D.java:92)
>   at 
> org.apache.pdfbox.rendering.PageDrawer.drawGlyph2D(PageDrawer.java:392)
>   at org.apache.pdfbox.rendering.PageDrawer.showGlyph(PageDrawer.java:372)
>   at 
> org.apache.pdfbox.contentstream.PDFStreamEngine.showText(PDFStreamEngine.java:411)
>   at org.apache.pdfbox.rendering.PageDrawer.showText(PageDrawer.java:346)
>   at 
> org.apache.pdfbox.contentstream.PDFStreamEngine.showTextStrings(PDFStreamEngine.java:322)
>   at 
> org.apache.pdfbox.contentstream.operator.text.ShowTextAdjusted.process(ShowTextAdjusted.java:38)
>   at 
> org.apache.pdfbox.contentstream.PDFStreamEngine.processOperator(PDFStreamEngine.java:482)
>   at 
> org.apache.pdfbox.contentstream.PDFStreamEngine.processSubStream(PDFStreamEngine.java:236)
>   at 
> org.apache.pdfbox.contentstream.PDFStreamEngine.processSubStream(PDFStreamEngine.java:201)
>   at 
> org.apache.pdfbox.contentstream.PDFStreamEngine.processStream(PDFStreamEngine.java:155)
>   at org.apache.pdfbox.rendering.PageDrawer.drawPage(PageDrawer.java:177)
>   at 
> org.apache.pdfbox.rendering.PDFRenderer.renderPage(PDFRenderer.java:228)
>   at 
> org.apache.pdfbox.rendering.PDFRenderer.renderImage(PDFRenderer.java:160)
>   at 
> org.apache.pdfbox.rendering.PDFRenderer.renderImageWithDPI(PDFRenderer.java:109)



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Assigned] (PDFBOX-2448) ligatures and some glyphs missing

2014-10-23 Thread John Hewson (JIRA)

 [ 
https://issues.apache.org/jira/browse/PDFBOX-2448?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

John Hewson reassigned PDFBOX-2448:
---

Assignee: John Hewson

> ligatures and some glyphs missing
> -
>
> Key: PDFBOX-2448
> URL: https://issues.apache.org/jira/browse/PDFBOX-2448
> Project: PDFBox
>  Issue Type: Bug
>  Components: FontBox
>Affects Versions: 2.0.0
>Reporter: Tilman Hausherr
>Assignee: John Hewson
>  Labels: type1
> Fix For: 2.0.0
>
> Attachments: 115258.pdf, 1152581.jpg
>
>
> Ligatures are missing in the attached file ("filter", "identification") and 
> some glyphs (on first page, below "BAT Observations" a glyph is missing after 
> the italic "T".



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (PDFBOX-2448) ligatures and some glyphs missing

2014-10-23 Thread John Hewson (JIRA)

 [ 
https://issues.apache.org/jira/browse/PDFBOX-2448?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

John Hewson updated PDFBOX-2448:

Fix Version/s: 2.0.0

> ligatures and some glyphs missing
> -
>
> Key: PDFBOX-2448
> URL: https://issues.apache.org/jira/browse/PDFBOX-2448
> Project: PDFBox
>  Issue Type: Bug
>  Components: FontBox
>Affects Versions: 2.0.0
>Reporter: Tilman Hausherr
>  Labels: type1
> Fix For: 2.0.0
>
> Attachments: 115258.pdf, 1152581.jpg
>
>
> Ligatures are missing in the attached file ("filter", "identification") and 
> some glyphs (on first page, below "BAT Observations" a glyph is missing after 
> the italic "T".



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (PDFBOX-2449) Character missing in text extraction

2014-10-23 Thread Tilman Hausherr (JIRA)

 [ 
https://issues.apache.org/jira/browse/PDFBOX-2449?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tilman Hausherr updated PDFBOX-2449:

Attachment: 267739.pdf

> Character missing in text extraction
> 
>
> Key: PDFBOX-2449
> URL: https://issues.apache.org/jira/browse/PDFBOX-2449
> Project: PDFBox
>  Issue Type: Bug
>Affects Versions: 1.8.8
>Reporter: Tilman Hausherr
> Attachments: 267739.pdf
>
>
> The attached file brings this text extraction:
> 1.8.6:
> For safe! clean! abundant water?in our homes!
> rivers! lakes! and streams?is one of our
> 1.8.7:
> For safe! clean! abundant water?in our homes!
> rivers! lakes! and streams?is one of our
> 1.8.8:
> For safe! clean! abundant water?n our homes!
> rivers! lakes! and streams?s one of our
> 2.0:
> For safe! clean! abundant water–in our homes!
> rivers! lakes! and streams–is one of our
> AR:
> For safe! clean! abundant water–in our homes!
> rivers! lakes! and streams–is one of our
> So the "i" has been lost in the 1.8.8 version.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Created] (PDFBOX-2449) Character missing in text extraction

2014-10-23 Thread Tilman Hausherr (JIRA)
Tilman Hausherr created PDFBOX-2449:
---

 Summary: Character missing in text extraction
 Key: PDFBOX-2449
 URL: https://issues.apache.org/jira/browse/PDFBOX-2449
 Project: PDFBox
  Issue Type: Bug
Affects Versions: 1.8.8
Reporter: Tilman Hausherr


The attached file brings this text extraction:

1.8.6:
For safe! clean! abundant water?in our homes!
rivers! lakes! and streams?is one of our

1.8.7:
For safe! clean! abundant water?in our homes!
rivers! lakes! and streams?is one of our

1.8.8:
For safe! clean! abundant water?n our homes!
rivers! lakes! and streams?s one of our

2.0:
For safe! clean! abundant water–in our homes!
rivers! lakes! and streams–is one of our

AR:
For safe! clean! abundant water–in our homes!
rivers! lakes! and streams–is one of our

So the "i" has been lost in the 1.8.8 version.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (PDFBOX-2439) ArrayIndexOutOfBoundsException in multithreaded system

2014-10-23 Thread simon steiner (JIRA)

[ 
https://issues.apache.org/jira/browse/PDFBOX-2439?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14181938#comment-14181938
 ] 

simon steiner commented on PDFBOX-2439:
---

I couldnt see a place where GlyphTable.cache is populated

> ArrayIndexOutOfBoundsException in multithreaded system
> --
>
> Key: PDFBOX-2439
> URL: https://issues.apache.org/jira/browse/PDFBOX-2439
> Project: PDFBox
>  Issue Type: Bug
>  Components: FontBox
>Affects Versions: 2.0.0
>Reporter: simon steiner
>Priority: Critical
> Fix For: 2.0.0
>
> Attachments: TestPDFBox.java
>
>
> When it loads replacement font from OS i sometimes get:
> Exception in thread "Thread-27" java.lang.ArrayIndexOutOfBoundsException: 
> 40036
>   at 
> org.apache.fontbox.ttf.GlyfSimpleDescript.readFlags(GlyfSimpleDescript.java:197)
>   at 
> org.apache.fontbox.ttf.GlyfSimpleDescript.(GlyfSimpleDescript.java:78)
>   at org.apache.fontbox.ttf.GlyphData.initData(GlyphData.java:58)
>   at org.apache.fontbox.ttf.GlyphTable.getGlyph(GlyphTable.java:161)
>   at 
> org.apache.pdfbox.rendering.font.TTFGlyph2D.getPathForGID(TTFGlyph2D.java:140)
>   at 
> org.apache.pdfbox.rendering.font.TTFGlyph2D.getPathForCharacterCode(TTFGlyph2D.java:92)
>   at 
> org.apache.pdfbox.rendering.PageDrawer.drawGlyph2D(PageDrawer.java:392)
>   at org.apache.pdfbox.rendering.PageDrawer.showGlyph(PageDrawer.java:372)
>   at 
> org.apache.pdfbox.contentstream.PDFStreamEngine.showText(PDFStreamEngine.java:411)
>   at org.apache.pdfbox.rendering.PageDrawer.showText(PageDrawer.java:346)
>   at 
> org.apache.pdfbox.contentstream.PDFStreamEngine.showTextStrings(PDFStreamEngine.java:322)
>   at 
> org.apache.pdfbox.contentstream.operator.text.ShowTextAdjusted.process(ShowTextAdjusted.java:38)
>   at 
> org.apache.pdfbox.contentstream.PDFStreamEngine.processOperator(PDFStreamEngine.java:482)
>   at 
> org.apache.pdfbox.contentstream.PDFStreamEngine.processSubStream(PDFStreamEngine.java:236)
>   at 
> org.apache.pdfbox.contentstream.PDFStreamEngine.processSubStream(PDFStreamEngine.java:201)
>   at 
> org.apache.pdfbox.contentstream.PDFStreamEngine.processStream(PDFStreamEngine.java:155)
>   at org.apache.pdfbox.rendering.PageDrawer.drawPage(PageDrawer.java:177)
>   at 
> org.apache.pdfbox.rendering.PDFRenderer.renderPage(PDFRenderer.java:228)
>   at 
> org.apache.pdfbox.rendering.PDFRenderer.renderImage(PDFRenderer.java:160)
>   at 
> org.apache.pdfbox.rendering.PDFRenderer.renderImageWithDPI(PDFRenderer.java:109)



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (PDFBOX-2448) ligatures and some glyphs missing

2014-10-23 Thread Tilman Hausherr (JIRA)

 [ 
https://issues.apache.org/jira/browse/PDFBOX-2448?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tilman Hausherr updated PDFBOX-2448:

Attachment: 1152581.jpg
115258.pdf

> ligatures and some glyphs missing
> -
>
> Key: PDFBOX-2448
> URL: https://issues.apache.org/jira/browse/PDFBOX-2448
> Project: PDFBox
>  Issue Type: Bug
>  Components: FontBox
>Affects Versions: 2.0.0
>Reporter: Tilman Hausherr
>  Labels: type1
> Attachments: 115258.pdf, 1152581.jpg
>
>
> Ligatures are missing in the attached file ("filter", "identification") and 
> some glyphs (on first page, below "BAT Observations" a glyph is missing after 
> the italic "T".



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Created] (PDFBOX-2448) ligatures and some glyphs missing

2014-10-23 Thread Tilman Hausherr (JIRA)
Tilman Hausherr created PDFBOX-2448:
---

 Summary: ligatures and some glyphs missing
 Key: PDFBOX-2448
 URL: https://issues.apache.org/jira/browse/PDFBOX-2448
 Project: PDFBox
  Issue Type: Bug
  Components: FontBox
Affects Versions: 2.0.0
Reporter: Tilman Hausherr


Ligatures are missing in the attached file ("filter", "identification") and 
some glyphs (on first page, below "BAT Observations" a glyph is missing after 
the italic "T".



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (PDFBOX-2439) ArrayIndexOutOfBoundsException in multithreaded system

2014-10-23 Thread John Hewson (JIRA)

[ 
https://issues.apache.org/jira/browse/PDFBOX-2439?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14181917#comment-14181917
 ] 

John Hewson commented on PDFBOX-2439:
-

Thanks, but please attach patches as files, not comments, they cannot be copy + 
pasted from comments correctly due to JIRA markup. I can't apply the patch from 
your comment.

{quote}Is GlyphTable.cache used?{quote}
Yes, by GlyphTable#getGlyph

{quote}
I also get
Exception LCMS error 13: Couldn't link the profiles
{quote}

Please open a new issue for that, it's related to ICC Profiles.

> ArrayIndexOutOfBoundsException in multithreaded system
> --
>
> Key: PDFBOX-2439
> URL: https://issues.apache.org/jira/browse/PDFBOX-2439
> Project: PDFBox
>  Issue Type: Bug
>  Components: FontBox
>Affects Versions: 2.0.0
>Reporter: simon steiner
>Priority: Critical
> Fix For: 2.0.0
>
> Attachments: TestPDFBox.java
>
>
> When it loads replacement font from OS i sometimes get:
> Exception in thread "Thread-27" java.lang.ArrayIndexOutOfBoundsException: 
> 40036
>   at 
> org.apache.fontbox.ttf.GlyfSimpleDescript.readFlags(GlyfSimpleDescript.java:197)
>   at 
> org.apache.fontbox.ttf.GlyfSimpleDescript.(GlyfSimpleDescript.java:78)
>   at org.apache.fontbox.ttf.GlyphData.initData(GlyphData.java:58)
>   at org.apache.fontbox.ttf.GlyphTable.getGlyph(GlyphTable.java:161)
>   at 
> org.apache.pdfbox.rendering.font.TTFGlyph2D.getPathForGID(TTFGlyph2D.java:140)
>   at 
> org.apache.pdfbox.rendering.font.TTFGlyph2D.getPathForCharacterCode(TTFGlyph2D.java:92)
>   at 
> org.apache.pdfbox.rendering.PageDrawer.drawGlyph2D(PageDrawer.java:392)
>   at org.apache.pdfbox.rendering.PageDrawer.showGlyph(PageDrawer.java:372)
>   at 
> org.apache.pdfbox.contentstream.PDFStreamEngine.showText(PDFStreamEngine.java:411)
>   at org.apache.pdfbox.rendering.PageDrawer.showText(PageDrawer.java:346)
>   at 
> org.apache.pdfbox.contentstream.PDFStreamEngine.showTextStrings(PDFStreamEngine.java:322)
>   at 
> org.apache.pdfbox.contentstream.operator.text.ShowTextAdjusted.process(ShowTextAdjusted.java:38)
>   at 
> org.apache.pdfbox.contentstream.PDFStreamEngine.processOperator(PDFStreamEngine.java:482)
>   at 
> org.apache.pdfbox.contentstream.PDFStreamEngine.processSubStream(PDFStreamEngine.java:236)
>   at 
> org.apache.pdfbox.contentstream.PDFStreamEngine.processSubStream(PDFStreamEngine.java:201)
>   at 
> org.apache.pdfbox.contentstream.PDFStreamEngine.processStream(PDFStreamEngine.java:155)
>   at org.apache.pdfbox.rendering.PageDrawer.drawPage(PageDrawer.java:177)
>   at 
> org.apache.pdfbox.rendering.PDFRenderer.renderPage(PDFRenderer.java:228)
>   at 
> org.apache.pdfbox.rendering.PDFRenderer.renderImage(PDFRenderer.java:160)
>   at 
> org.apache.pdfbox.rendering.PDFRenderer.renderImageWithDPI(PDFRenderer.java:109)



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (PDFBOX-2409) got the wrong result from Arabic text extraction

2014-10-23 Thread John Hewson (JIRA)

 [ 
https://issues.apache.org/jira/browse/PDFBOX-2409?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

John Hewson updated PDFBOX-2409:

Attachment: TextEdit-Arial.png

I'm able to get the same rendering if I use TextEdit on my Mac with Arial as 
the font, I've attached a screenshot. Looking at the text I think that the 
diacritics are being placed on the wrong character, one character too late? 
(i.e. to the left). What do you think? This could well be a PDFBox text 
encoding issue with combining diacritics.

> got the wrong result from Arabic text extraction
> 
>
> Key: PDFBOX-2409
> URL: https://issues.apache.org/jira/browse/PDFBOX-2409
> Project: PDFBox
>  Issue Type: Bug
>  Components: Text extraction
>Affects Versions: 1.8.7, 2.0.0
> Environment: Ubuntu 14.04 64bit
> java version "1.8.0_20"
>Reporter: EugenePig
>Assignee: John Hewson
> Attachments: THESSALONIANS.pdf, THESSALONIANS.txt, 
> THESSALONIANS_win7_firefox.jpg, TextEdit-Arial.png, jahewson.mac.png
>
>
> java -jar pdfbox-app-1.8.7.jar ExtractText -sort -encoding UTF-8 
> THESSALONIANS.pdf
> java -jar pdfbox-app-2.0.0-SNAPSHOT.jar ExtractText -sort -encoding UTF-8 
> THESSALONIANS.pdf
> Please compare THESSALONIANS.txt.jpg with THESSALONIANS.pdf. There are a lot 
> of differences. I just marked a few differences with red circles.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Comment Edited] (PDFBOX-2447) "Cannot save a document which has been closed" when encrypting

2014-10-23 Thread John Hewson (JIRA)

[ 
https://issues.apache.org/jira/browse/PDFBOX-2447?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14181763#comment-14181763
 ] 

John Hewson edited comment on PDFBOX-2447 at 10/23/14 6:50 PM:
---

Some feedback: please don't use backwards conditionals in patches, e.g. in the 
PDFBox source we don't write:

{{if (null == getDocumentCatalog())}}

but always:

{{if (getDocumentCatalog() == null)}}

The same applies to {{null != document}}.


was (Author: jahewson):
Some feedback: please don't use backwards conditionals in patches, e.g. in the 
PDFBox source we don't write:

{{if (null == getDocumentCatalog())}}

but always:

{{if (getDocumentCatalog() == null)}}

> "Cannot save a document which has been closed" when encrypting
> --
>
> Key: PDFBOX-2447
> URL: https://issues.apache.org/jira/browse/PDFBOX-2447
> Project: PDFBox
>  Issue Type: Bug
>  Components: PDModel
>Affects Versions: 2.0.0
> Environment: java7 deb7
>Reporter: Ralf Hauser
> Attachments: patch2447.txt, patch2447a.txt
>
>
> InputStream content = ...;
> int keyLength = 256;
> AccessPermission ap = new AccessPermission();
> StandardProtectionPolicy spp = new 
> StandardProtectionPolicy(
> symmPw, symmPw, ap);
> spp.setEncryptionKeyLength(keyLength);
> document.protect(spp);
> ByteArrayOutputStream baos = new ByteArrayOutputStream();
> document.save(baos);
> in the save() the above mentioned exception is thrown (wasn't with the 
> 2013-11 snapshot)



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Comment Edited] (PDFBOX-2447) "Cannot save a document which has been closed" when encrypting

2014-10-23 Thread John Hewson (JIRA)

[ 
https://issues.apache.org/jira/browse/PDFBOX-2447?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14181763#comment-14181763
 ] 

John Hewson edited comment on PDFBOX-2447 at 10/23/14 6:49 PM:
---

Some feedback: please don't use backwards conditionals in patches, e.g. in the 
PDFBox source we don't write:

{{if (null == getDocumentCatalog())}}

but always:

{{if (getDocumentCatalog() == null)}}


was (Author: jahewson):
Some feedback: Please don't use backwards conditionals in patches, e.g. in the 
PDFBox source we don't write:

{{if (null == getDocumentCatalog())}}

but always:

{{if (getDocumentCatalog() == null)}}

> "Cannot save a document which has been closed" when encrypting
> --
>
> Key: PDFBOX-2447
> URL: https://issues.apache.org/jira/browse/PDFBOX-2447
> Project: PDFBox
>  Issue Type: Bug
>  Components: PDModel
>Affects Versions: 2.0.0
> Environment: java7 deb7
>Reporter: Ralf Hauser
> Attachments: patch2447.txt, patch2447a.txt
>
>
> InputStream content = ...;
> int keyLength = 256;
> AccessPermission ap = new AccessPermission();
> StandardProtectionPolicy spp = new 
> StandardProtectionPolicy(
> symmPw, symmPw, ap);
> spp.setEncryptionKeyLength(keyLength);
> document.protect(spp);
> ByteArrayOutputStream baos = new ByteArrayOutputStream();
> document.save(baos);
> in the save() the above mentioned exception is thrown (wasn't with the 
> 2013-11 snapshot)



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (PDFBOX-2447) "Cannot save a document which has been closed" when encrypting

2014-10-23 Thread John Hewson (JIRA)

[ 
https://issues.apache.org/jira/browse/PDFBOX-2447?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14181763#comment-14181763
 ] 

John Hewson commented on PDFBOX-2447:
-

Some feedback: Please don't use backwards conditionals in patches, e.g. in the 
PDFBox source we don't write:

{{if (null == getDocumentCatalog())}}

but always:

{{if (getDocumentCatalog() == null)}}

> "Cannot save a document which has been closed" when encrypting
> --
>
> Key: PDFBOX-2447
> URL: https://issues.apache.org/jira/browse/PDFBOX-2447
> Project: PDFBox
>  Issue Type: Bug
>  Components: PDModel
>Affects Versions: 2.0.0
> Environment: java7 deb7
>Reporter: Ralf Hauser
> Attachments: patch2447.txt, patch2447a.txt
>
>
> InputStream content = ...;
> int keyLength = 256;
> AccessPermission ap = new AccessPermission();
> StandardProtectionPolicy spp = new 
> StandardProtectionPolicy(
> symmPw, symmPw, ap);
> spp.setEncryptionKeyLength(keyLength);
> document.protect(spp);
> ByteArrayOutputStream baos = new ByteArrayOutputStream();
> document.save(baos);
> in the save() the above mentioned exception is thrown (wasn't with the 
> 2013-11 snapshot)



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (PDFBOX-1372) NullPointerException with loadDescriptorDictionary

2014-10-23 Thread John Hewson (JIRA)

 [ 
https://issues.apache.org/jira/browse/PDFBOX-1372?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

John Hewson updated PDFBOX-1372:

Fix Version/s: 2.0.0

> NullPointerException with loadDescriptorDictionary
> --
>
> Key: PDFBOX-1372
> URL: https://issues.apache.org/jira/browse/PDFBOX-1372
> Project: PDFBox
>  Issue Type: Bug
>  Components: PDModel
>Affects Versions: 1.6.0, 2.0.0
> Environment: Windows 7, WebLogic 10.3.2, jdk160_14_R27.6.5-32.
>Reporter: wentao
>Assignee: John Hewson
> Fix For: 2.0.0
>
> Attachments: CODE128.TTF, FRE3OF9X.TTF, index128.jsp
>
>
> I downloaded a ttf from http://www.jtbarton.com/Barcodes/code128.ttf and 
> tried to use this with pdfbox 1.6.0 in my jsp.
> it returns below error
> java.lang.NullPointerException
>   at 
> org.apache.pdfbox.pdmodel.font.PDTrueTypeFont.loadDescriptorDictionary(PDTrueTypeFont.java:339)
>   at 
> org.apache.pdfbox.pdmodel.font.PDTrueTypeFont.loadTTF(PDTrueTypeFont.java:164)
>   at 
> org.apache.pdfbox.pdmodel.font.PDTrueTypeFont.loadTTF(PDTrueTypeFont.java:140)
>   at 
> org.apache.pdfbox.pdmodel.font.PDTrueTypeFont.loadTTF(PDTrueTypeFont.java:127)



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Assigned] (PDFBOX-1372) NullPointerException with loadDescriptorDictionary

2014-10-23 Thread John Hewson (JIRA)

 [ 
https://issues.apache.org/jira/browse/PDFBOX-1372?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

John Hewson reassigned PDFBOX-1372:
---

Assignee: John Hewson

> NullPointerException with loadDescriptorDictionary
> --
>
> Key: PDFBOX-1372
> URL: https://issues.apache.org/jira/browse/PDFBOX-1372
> Project: PDFBox
>  Issue Type: Bug
>  Components: PDModel
>Affects Versions: 1.6.0, 2.0.0
> Environment: Windows 7, WebLogic 10.3.2, jdk160_14_R27.6.5-32.
>Reporter: wentao
>Assignee: John Hewson
> Fix For: 2.0.0
>
> Attachments: CODE128.TTF, FRE3OF9X.TTF, index128.jsp
>
>
> I downloaded a ttf from http://www.jtbarton.com/Barcodes/code128.ttf and 
> tried to use this with pdfbox 1.6.0 in my jsp.
> it returns below error
> java.lang.NullPointerException
>   at 
> org.apache.pdfbox.pdmodel.font.PDTrueTypeFont.loadDescriptorDictionary(PDTrueTypeFont.java:339)
>   at 
> org.apache.pdfbox.pdmodel.font.PDTrueTypeFont.loadTTF(PDTrueTypeFont.java:164)
>   at 
> org.apache.pdfbox.pdmodel.font.PDTrueTypeFont.loadTTF(PDTrueTypeFont.java:140)
>   at 
> org.apache.pdfbox.pdmodel.font.PDTrueTypeFont.loadTTF(PDTrueTypeFont.java:127)



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


Re: 2.0

2014-10-23 Thread John Hewson
That’s very good news!

-- John

> On 23 Oct 2014, at 11:40, Tilman Hausherr  wrote:
> 
> This is now obsolete, thanks to Andreas having resolved PDFBOX-2250.
> 
> Tilman
> 
> Am 23.10.2014 um 09:33 schrieb John Hewson:
>> Do we have a JIRA issue for these, or shall I create one?
>> 
>> -- John
>> 
>> On 14 Oct 2014, at 09:18, Tilman Hausherr > > wrote:
>> 
>>> Here are some:
>>> 
>>> 055/055794.pdf
>>> 082/082463.pdf
>>> 108/108362.pdf
>>> 113/113223.pdf
>>> 115/115458.pdf
>>> 115/115463.pdf
>>> 122/122393.pdf
>>> 129/129416.pdf
>>> 133/133423.pdf
>>> 148/148020.pdf
>>> 152/152012.pdf
>>> 161/161466.pdf
>>> 
>>> to be found here:
>>> http://digitalcorpora.org/corp/nps/files/govdocs1/zipfiles/ 
>>> 
>>> 
>>> Tilman
>>> 
>>> Am 14.10.2014 um 21:06 schrieb John Hewson:
 Unless somebody provides us with a list of those files, then I think this 
 is an unreasonable request. As long as we continue to leave the old parser 
 in PDFBox, we won’t get the bug reports which we need to fix the new 
 parser, and the situation will never resolve itself. Falling back to the 
 old parser is just as bad - we won’t get bug reports.
 
 -- John
 
 On 14 Oct 2014, at 07:39, Tilman Hausherr >>> > wrote:
 
> I prefer that the "old" parser not be removed, because there are many 
> files that can only be parsed by the old parser. This came out in a  
> large scale test with TIKA.
> 
> The best idea (in my current opinion) is to use the nonSeq parser first, 
> and the old parser if there is an exception.
> 
> Tilman
> 
> Am 14.10.2014 um 09:45 schrieb Timo Boehme:
>> Hi,
>> 
>> Am 14.10.2014 um 07:22 schrieb John Hewson:
>>> Hi,
> John Hewson mailto:j...@jahewson.com>> hat am 10. 
> Oktober 2014 um 20:05 geschrieben:
> 
> 
>- Parsing (Andreas?)
 I guess we won't get a complete new parser in 2.0, but I try to 
 improve the XRef
 and the COSStream stuff
>>> It would be great if we could get rid of the old parser and switch to 
>>> the non-sequential
>>> parser, WDYT?
>> I would also propose to completely remove the old parser. That way we 
>> are more flexible in parsing streams etc. since parts of the 
>> non-sequential parser are a compromise to work side-by-side with the old 
>> parser.
>> Possibly there are a small number of functions for which the old parser 
>> is still needed - e.g. signing?
>> 
>> 
>> Best,
>> Timo
>> 
>> 
>> 
> 



Re: 2.0

2014-10-23 Thread Tilman Hausherr

This is now obsolete, thanks to Andreas having resolved PDFBOX-2250.

Tilman

Am 23.10.2014 um 09:33 schrieb John Hewson:

Do we have a JIRA issue for these, or shall I create one?

-- John

On 14 Oct 2014, at 09:18, Tilman Hausherr mailto:thaush...@t-online.de>> wrote:


Here are some:

055/055794.pdf
082/082463.pdf
108/108362.pdf
113/113223.pdf
115/115458.pdf
115/115463.pdf
122/122393.pdf
129/129416.pdf
133/133423.pdf
148/148020.pdf
152/152012.pdf
161/161466.pdf

to be found here:
http://digitalcorpora.org/corp/nps/files/govdocs1/zipfiles/ 


Tilman

Am 14.10.2014 um 21:06 schrieb John Hewson:

Unless somebody provides us with a list of those files, then I think this is an 
unreasonable request. As long as we continue to leave the old parser in PDFBox, 
we won’t get the bug reports which we need to fix the new parser, and the 
situation will never resolve itself. Falling back to the old parser is just as 
bad - we won’t get bug reports.

-- John

On 14 Oct 2014, at 07:39, Tilman Hausherr mailto:thaush...@t-online.de>> wrote:


I prefer that the "old" parser not be removed, because there are many files 
that can only be parsed by the old parser. This came out in a  large scale test with TIKA.

The best idea (in my current opinion) is to use the nonSeq parser first, and 
the old parser if there is an exception.

Tilman

Am 14.10.2014 um 09:45 schrieb Timo Boehme:

Hi,

Am 14.10.2014 um 07:22 schrieb John Hewson:

Hi,

John Hewson mailto:j...@jahewson.com>> hat am 10. Oktober 
2014 um 20:05 geschrieben:


- Parsing (Andreas?)

I guess we won't get a complete new parser in 2.0, but I try to improve the XRef
and the COSStream stuff

It would be great if we could get rid of the old parser and switch to the 
non-sequential
parser, WDYT?

I would also propose to completely remove the old parser. That way we are more 
flexible in parsing streams etc. since parts of the non-sequential parser are a 
compromise to work side-by-side with the old parser.
Possibly there are a small number of functions for which the old parser is 
still needed - e.g. signing?


Best,
Timo








Re: 2.0

2014-10-23 Thread John Hewson
Do we have a JIRA issue for these, or shall I create one?

-- John

On 14 Oct 2014, at 09:18, Tilman Hausherr mailto:thaush...@t-online.de>> wrote:

> Here are some:
> 
> 055/055794.pdf
> 082/082463.pdf
> 108/108362.pdf
> 113/113223.pdf
> 115/115458.pdf
> 115/115463.pdf
> 122/122393.pdf
> 129/129416.pdf
> 133/133423.pdf
> 148/148020.pdf
> 152/152012.pdf
> 161/161466.pdf
> 
> to be found here:
> http://digitalcorpora.org/corp/nps/files/govdocs1/zipfiles/ 
> 
> 
> Tilman
> 
> Am 14.10.2014 um 21:06 schrieb John Hewson:
>> Unless somebody provides us with a list of those files, then I think this is 
>> an unreasonable request. As long as we continue to leave the old parser in 
>> PDFBox, we won’t get the bug reports which we need to fix the new parser, 
>> and the situation will never resolve itself. Falling back to the old parser 
>> is just as bad - we won’t get bug reports.
>> 
>> -- John
>> 
>> On 14 Oct 2014, at 07:39, Tilman Hausherr > > wrote:
>> 
>>> I prefer that the "old" parser not be removed, because there are many files 
>>> that can only be parsed by the old parser. This came out in a  large scale 
>>> test with TIKA.
>>> 
>>> The best idea (in my current opinion) is to use the nonSeq parser first, 
>>> and the old parser if there is an exception.
>>> 
>>> Tilman
>>> 
>>> Am 14.10.2014 um 09:45 schrieb Timo Boehme:
 Hi,
 
 Am 14.10.2014 um 07:22 schrieb John Hewson:
> Hi,
>>> John Hewson mailto:j...@jahewson.com>> hat am 10. 
>>> Oktober 2014 um 20:05 geschrieben:
>>> 
>>> 
>>>- Parsing (Andreas?)
>> I guess we won't get a complete new parser in 2.0, but I try to improve 
>> the XRef
>> and the COSStream stuff
> It would be great if we could get rid of the old parser and switch to the 
> non-sequential
> parser, WDYT?
 I would also propose to completely remove the old parser. That way we are 
 more flexible in parsing streams etc. since parts of the non-sequential 
 parser are a compromise to work side-by-side with the old parser.
 Possibly there are a small number of functions for which the old parser is 
 still needed - e.g. signing?
 
 
 Best,
 Timo
 
 
>> 
> 



[jira] [Closed] (PDFBOX-1038) Strange signs after pdftohtml parsing.

2014-10-23 Thread JIRA

 [ 
https://issues.apache.org/jira/browse/PDFBOX-1038?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Andreas Lehmkühler closed PDFBOX-1038.
--
   Resolution: Fixed
Fix Version/s: 1.6.0
 Assignee: Andreas Lehmkühler

Works fine at least starting with 1.6.0 except a small part of the text which 
can't be extracted due to a missing mapping. Acrobat provides a similar result

> Strange signs after pdftohtml parsing.
> --
>
> Key: PDFBOX-1038
> URL: https://issues.apache.org/jira/browse/PDFBOX-1038
> Project: PDFBox
>  Issue Type: Bug
>  Components: Text extraction
>Affects Versions: 1.5.0
> Environment: windows vista
>Reporter: Funfel
>Assignee: Andreas Lehmkühler
> Fix For: 1.6.0
>
> Attachments: pg0007.html, pg0007.pdf
>
>
> After parsing pdf to html I've got a strange signs which supposed to be nice 
> letter (not chinese or japanese). I've noticed that font description for them 
> is UniversPro-Roman-Identity-H. 
> How can get it generated properly?



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Closed] (PDFBOX-1066) There is no functionlaity of reading the text line by line with its input field

2014-10-23 Thread JIRA

 [ 
https://issues.apache.org/jira/browse/PDFBOX-1066?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Andreas Lehmkühler closed PDFBOX-1066.
--
Resolution: Not a Problem
  Assignee: Andreas Lehmkühler

PDFs aren't organized in lines. So, if you want to read a pdf line by line you 
have to extract the whole text first. It should be easy to process that result 
line by line without PDFBox.


> There is no functionlaity of reading the text line by line with its input 
> field
> ---
>
> Key: PDFBOX-1066
> URL: https://issues.apache.org/jira/browse/PDFBOX-1066
> Project: PDFBox
>  Issue Type: Bug
>  Components: Text extraction
>Affects Versions: 0.7.3
> Environment: Windows
>Reporter: Nishant
>Assignee: Andreas Lehmkühler
>  Labels: patch
>   Original Estimate: 168h
>  Remaining Estimate: 168h
>
> I am trying to read the PDF texts along with its input type like 
> textfield/checkboxes. What i found is TextStripper is pasing the whole 
> document and retuning the string in getText(). And using Acroform.getfields i 
> am able ot get all fields. 
> But I have perticuler requierment of reading the texts and its input type. Do 
> we have any class/method which can resolve this issue. 
> Its very urgent.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (PDFBOX-1372) NullPointerException with loadDescriptorDictionary

2014-10-23 Thread JIRA

[ 
https://issues.apache.org/jira/browse/PDFBOX-1372?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14181645#comment-14181645
 ] 

Andreas Lehmkühler commented on PDFBOX-1372:


CODE128.TTF works fine using the current trunk. 

FRE3OF9X.TTF produces the following exception:

{code}
Exception in thread "main" java.lang.ArrayIndexOutOfBoundsException: 85
at 
org.apache.pdfbox.pdmodel.font.PDTrueTypeFontEmbedder.createFontDescriptor(PDTrueTypeFontEmbedder.java:304)
at 
org.apache.pdfbox.pdmodel.font.PDTrueTypeFontEmbedder.(PDTrueTypeFontEmbedder.java:89)
at 
org.apache.pdfbox.pdmodel.font.PDTrueTypeFont.(PDTrueTypeFont.java:178)
at 
org.apache.pdfbox.pdmodel.font.PDTrueTypeFont.loadTTF(PDTrueTypeFont.java:78)
at 
org.apache.pdfbox.examples.pdmodel.HelloWorldTTF.doIt(HelloWorldTTF.java:59)
at 
org.apache.pdfbox.examples.pdmodel.HelloWorldTTF.main(HelloWorldTTF.java:99)
{code}

> NullPointerException with loadDescriptorDictionary
> --
>
> Key: PDFBOX-1372
> URL: https://issues.apache.org/jira/browse/PDFBOX-1372
> Project: PDFBox
>  Issue Type: Bug
>  Components: PDModel
>Affects Versions: 1.6.0, 2.0.0
> Environment: Windows 7, WebLogic 10.3.2, jdk160_14_R27.6.5-32.
>Reporter: wentao
> Attachments: CODE128.TTF, FRE3OF9X.TTF, index128.jsp
>
>
> I downloaded a ttf from http://www.jtbarton.com/Barcodes/code128.ttf and 
> tried to use this with pdfbox 1.6.0 in my jsp.
> it returns below error
> java.lang.NullPointerException
>   at 
> org.apache.pdfbox.pdmodel.font.PDTrueTypeFont.loadDescriptorDictionary(PDTrueTypeFont.java:339)
>   at 
> org.apache.pdfbox.pdmodel.font.PDTrueTypeFont.loadTTF(PDTrueTypeFont.java:164)
>   at 
> org.apache.pdfbox.pdmodel.font.PDTrueTypeFont.loadTTF(PDTrueTypeFont.java:140)
>   at 
> org.apache.pdfbox.pdmodel.font.PDTrueTypeFont.loadTTF(PDTrueTypeFont.java:127)



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (PDFBOX-1372) NullPointerException with loadDescriptorDictionary

2014-10-23 Thread JIRA

 [ 
https://issues.apache.org/jira/browse/PDFBOX-1372?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Andreas Lehmkühler updated PDFBOX-1372:
---
Affects Version/s: 2.0.0

> NullPointerException with loadDescriptorDictionary
> --
>
> Key: PDFBOX-1372
> URL: https://issues.apache.org/jira/browse/PDFBOX-1372
> Project: PDFBox
>  Issue Type: Bug
>  Components: PDModel
>Affects Versions: 1.6.0, 2.0.0
> Environment: Windows 7, WebLogic 10.3.2, jdk160_14_R27.6.5-32.
>Reporter: wentao
> Attachments: CODE128.TTF, FRE3OF9X.TTF, index128.jsp
>
>
> I downloaded a ttf from http://www.jtbarton.com/Barcodes/code128.ttf and 
> tried to use this with pdfbox 1.6.0 in my jsp.
> it returns below error
> java.lang.NullPointerException
>   at 
> org.apache.pdfbox.pdmodel.font.PDTrueTypeFont.loadDescriptorDictionary(PDTrueTypeFont.java:339)
>   at 
> org.apache.pdfbox.pdmodel.font.PDTrueTypeFont.loadTTF(PDTrueTypeFont.java:164)
>   at 
> org.apache.pdfbox.pdmodel.font.PDTrueTypeFont.loadTTF(PDTrueTypeFont.java:140)
>   at 
> org.apache.pdfbox.pdmodel.font.PDTrueTypeFont.loadTTF(PDTrueTypeFont.java:127)



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Resolved] (PDFBOX-1273) java.io.IOException: Error: Unknown annotation type null

2014-10-23 Thread JIRA

 [ 
https://issues.apache.org/jira/browse/PDFBOX-1273?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Andreas Lehmkühler resolved PDFBOX-1273.

   Resolution: Fixed
Fix Version/s: 2.0.0
   1.8.8
 Assignee: Andreas Lehmkühler

I've fixed the issue as proposed.

Thanks for the contribution and sorry for the delay!

> java.io.IOException: Error: Unknown annotation type null
> 
>
> Key: PDFBOX-1273
> URL: https://issues.apache.org/jira/browse/PDFBOX-1273
> Project: PDFBox
>  Issue Type: Bug
>  Components: PDModel
>Affects Versions: 1.7.0, 1.8.7, 2.0.0
>Reporter: William
>Assignee: Andreas Lehmkühler
>Priority: Minor
> Fix For: 1.8.8, 2.0.0
>
> Attachments: PDPageQuickFix.patch
>
>
> Hi,
> I've come across the following exception on a very small number of documents:
> org.apache.tika.exception.TikaException: Unable to extract PDF content
> at org.apache.pdfbox.tika.PDF2XHTML.process(PDF2XHTML.java:80) 
> ~[extractor.jar:na]
> at org.apache.pdfbox.tika.PDFParser.parse(PDFParser.java:116) 
> ~[extractor.jar:na]
> at 
> org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:242) 
> ~[extractor.jar:na]
> at 
> org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:242) 
> ~[extractor.jar:na]
> at 
> org.apache.tika.parser.AutoDetectParser.parse(AutoDetectParser.java:120) 
> ~[extractor.jar:na]
> Caused by: java.io.IOException: Error: Unknown annotation type null
> at 
> org.apache.pdfbox.pdmodel.interactive.annotation.PDAnnotation.createAnnotation(PDAnnotation.java:165)
>  ~[extractor.jar:na]
> at org.apache.pdfbox.pdmodel.PDPage.getAnnotations(PDPage.java:785) 
> ~[extractor.jar:na]
> at org.apache.pdfbox.tika.PDF2XHTML.endPage(PDF2XHTML.java:142) 
> ~[extractor.jar:na]
> at 
> org.apache.pdfbox.util.PDFTextStripper.processPage(PDFTextStripper.java:450) 
> ~[extractor.jar:na]
> at 
> org.apache.pdfbox.util.PDFTextStripper.processPages(PDFTextStripper.java:372) 
> ~[extractor.jar:na]
> at 
> org.apache.pdfbox.util.PDFTextStripper.writeText(PDFTextStripper.java:328) 
> ~[extractor.jar:na]
> at org.apache.pdfbox.tika.PDF2XHTML.process(PDF2XHTML.java:63) 
> ~[extractor.jar:na]
> Here are a few examples:
> http://www.jdsupra.com/documents/01ece854-a961-4184-8de7-f6d5311d6a48.pdf
> http://www.jdsupra.com/documents/0aabecb4-094a-40e4-a507-8b49ecb90a3e.pdf
> http://www.jdsupra.com/documents/0d74ccf8-2d57-487d-88c2-98eee26f8236.pdf
> Thanks



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (PDFBOX-1273) java.io.IOException: Error: Unknown annotation type null

2014-10-23 Thread JIRA

 [ 
https://issues.apache.org/jira/browse/PDFBOX-1273?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Andreas Lehmkühler updated PDFBOX-1273:
---
Affects Version/s: 2.0.0
   1.8.7

> java.io.IOException: Error: Unknown annotation type null
> 
>
> Key: PDFBOX-1273
> URL: https://issues.apache.org/jira/browse/PDFBOX-1273
> Project: PDFBox
>  Issue Type: Bug
>  Components: PDModel
>Affects Versions: 1.7.0, 1.8.7, 2.0.0
>Reporter: William
>Priority: Minor
> Attachments: PDPageQuickFix.patch
>
>
> Hi,
> I've come across the following exception on a very small number of documents:
> org.apache.tika.exception.TikaException: Unable to extract PDF content
> at org.apache.pdfbox.tika.PDF2XHTML.process(PDF2XHTML.java:80) 
> ~[extractor.jar:na]
> at org.apache.pdfbox.tika.PDFParser.parse(PDFParser.java:116) 
> ~[extractor.jar:na]
> at 
> org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:242) 
> ~[extractor.jar:na]
> at 
> org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:242) 
> ~[extractor.jar:na]
> at 
> org.apache.tika.parser.AutoDetectParser.parse(AutoDetectParser.java:120) 
> ~[extractor.jar:na]
> Caused by: java.io.IOException: Error: Unknown annotation type null
> at 
> org.apache.pdfbox.pdmodel.interactive.annotation.PDAnnotation.createAnnotation(PDAnnotation.java:165)
>  ~[extractor.jar:na]
> at org.apache.pdfbox.pdmodel.PDPage.getAnnotations(PDPage.java:785) 
> ~[extractor.jar:na]
> at org.apache.pdfbox.tika.PDF2XHTML.endPage(PDF2XHTML.java:142) 
> ~[extractor.jar:na]
> at 
> org.apache.pdfbox.util.PDFTextStripper.processPage(PDFTextStripper.java:450) 
> ~[extractor.jar:na]
> at 
> org.apache.pdfbox.util.PDFTextStripper.processPages(PDFTextStripper.java:372) 
> ~[extractor.jar:na]
> at 
> org.apache.pdfbox.util.PDFTextStripper.writeText(PDFTextStripper.java:328) 
> ~[extractor.jar:na]
> at org.apache.pdfbox.tika.PDF2XHTML.process(PDF2XHTML.java:63) 
> ~[extractor.jar:na]
> Here are a few examples:
> http://www.jdsupra.com/documents/01ece854-a961-4184-8de7-f6d5311d6a48.pdf
> http://www.jdsupra.com/documents/0aabecb4-094a-40e4-a507-8b49ecb90a3e.pdf
> http://www.jdsupra.com/documents/0d74ccf8-2d57-487d-88c2-98eee26f8236.pdf
> Thanks



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (PDFBOX-1273) java.io.IOException: Error: Unknown annotation type null

2014-10-23 Thread ASF subversion and git services (JIRA)

[ 
https://issues.apache.org/jira/browse/PDFBOX-1273?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14181624#comment-14181624
 ] 

ASF subversion and git services commented on PDFBOX-1273:
-

Commit 1633900 from [~lehmi] in branch 'pdfbox/branches/1.8'
[ https://svn.apache.org/r1633900 ]

PDFBOX-1273: skip null references within an annotation array to avoid 
IOException as proposed by William

> java.io.IOException: Error: Unknown annotation type null
> 
>
> Key: PDFBOX-1273
> URL: https://issues.apache.org/jira/browse/PDFBOX-1273
> Project: PDFBox
>  Issue Type: Bug
>  Components: PDModel
>Affects Versions: 1.7.0, 1.8.7, 2.0.0
>Reporter: William
>Priority: Minor
> Attachments: PDPageQuickFix.patch
>
>
> Hi,
> I've come across the following exception on a very small number of documents:
> org.apache.tika.exception.TikaException: Unable to extract PDF content
> at org.apache.pdfbox.tika.PDF2XHTML.process(PDF2XHTML.java:80) 
> ~[extractor.jar:na]
> at org.apache.pdfbox.tika.PDFParser.parse(PDFParser.java:116) 
> ~[extractor.jar:na]
> at 
> org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:242) 
> ~[extractor.jar:na]
> at 
> org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:242) 
> ~[extractor.jar:na]
> at 
> org.apache.tika.parser.AutoDetectParser.parse(AutoDetectParser.java:120) 
> ~[extractor.jar:na]
> Caused by: java.io.IOException: Error: Unknown annotation type null
> at 
> org.apache.pdfbox.pdmodel.interactive.annotation.PDAnnotation.createAnnotation(PDAnnotation.java:165)
>  ~[extractor.jar:na]
> at org.apache.pdfbox.pdmodel.PDPage.getAnnotations(PDPage.java:785) 
> ~[extractor.jar:na]
> at org.apache.pdfbox.tika.PDF2XHTML.endPage(PDF2XHTML.java:142) 
> ~[extractor.jar:na]
> at 
> org.apache.pdfbox.util.PDFTextStripper.processPage(PDFTextStripper.java:450) 
> ~[extractor.jar:na]
> at 
> org.apache.pdfbox.util.PDFTextStripper.processPages(PDFTextStripper.java:372) 
> ~[extractor.jar:na]
> at 
> org.apache.pdfbox.util.PDFTextStripper.writeText(PDFTextStripper.java:328) 
> ~[extractor.jar:na]
> at org.apache.pdfbox.tika.PDF2XHTML.process(PDF2XHTML.java:63) 
> ~[extractor.jar:na]
> Here are a few examples:
> http://www.jdsupra.com/documents/01ece854-a961-4184-8de7-f6d5311d6a48.pdf
> http://www.jdsupra.com/documents/0aabecb4-094a-40e4-a507-8b49ecb90a3e.pdf
> http://www.jdsupra.com/documents/0d74ccf8-2d57-487d-88c2-98eee26f8236.pdf
> Thanks



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (PDFBOX-1273) java.io.IOException: Error: Unknown annotation type null

2014-10-23 Thread ASF subversion and git services (JIRA)

[ 
https://issues.apache.org/jira/browse/PDFBOX-1273?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14181620#comment-14181620
 ] 

ASF subversion and git services commented on PDFBOX-1273:
-

Commit 1633897 from [~lehmi] in branch 'pdfbox/trunk'
[ https://svn.apache.org/r1633897 ]

PDFBOX-1273: skip null references within an annotation array to avoid 
IOException as proposed by William

> java.io.IOException: Error: Unknown annotation type null
> 
>
> Key: PDFBOX-1273
> URL: https://issues.apache.org/jira/browse/PDFBOX-1273
> Project: PDFBox
>  Issue Type: Bug
>  Components: PDModel
>Affects Versions: 1.7.0
>Reporter: William
>Priority: Minor
> Attachments: PDPageQuickFix.patch
>
>
> Hi,
> I've come across the following exception on a very small number of documents:
> org.apache.tika.exception.TikaException: Unable to extract PDF content
> at org.apache.pdfbox.tika.PDF2XHTML.process(PDF2XHTML.java:80) 
> ~[extractor.jar:na]
> at org.apache.pdfbox.tika.PDFParser.parse(PDFParser.java:116) 
> ~[extractor.jar:na]
> at 
> org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:242) 
> ~[extractor.jar:na]
> at 
> org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:242) 
> ~[extractor.jar:na]
> at 
> org.apache.tika.parser.AutoDetectParser.parse(AutoDetectParser.java:120) 
> ~[extractor.jar:na]
> Caused by: java.io.IOException: Error: Unknown annotation type null
> at 
> org.apache.pdfbox.pdmodel.interactive.annotation.PDAnnotation.createAnnotation(PDAnnotation.java:165)
>  ~[extractor.jar:na]
> at org.apache.pdfbox.pdmodel.PDPage.getAnnotations(PDPage.java:785) 
> ~[extractor.jar:na]
> at org.apache.pdfbox.tika.PDF2XHTML.endPage(PDF2XHTML.java:142) 
> ~[extractor.jar:na]
> at 
> org.apache.pdfbox.util.PDFTextStripper.processPage(PDFTextStripper.java:450) 
> ~[extractor.jar:na]
> at 
> org.apache.pdfbox.util.PDFTextStripper.processPages(PDFTextStripper.java:372) 
> ~[extractor.jar:na]
> at 
> org.apache.pdfbox.util.PDFTextStripper.writeText(PDFTextStripper.java:328) 
> ~[extractor.jar:na]
> at org.apache.pdfbox.tika.PDF2XHTML.process(PDF2XHTML.java:63) 
> ~[extractor.jar:na]
> Here are a few examples:
> http://www.jdsupra.com/documents/01ece854-a961-4184-8de7-f6d5311d6a48.pdf
> http://www.jdsupra.com/documents/0aabecb4-094a-40e4-a507-8b49ecb90a3e.pdf
> http://www.jdsupra.com/documents/0d74ccf8-2d57-487d-88c2-98eee26f8236.pdf
> Thanks



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Closed] (PDFBOX-1222) PDFs created with idealsoftware.com's VPE are all wrong

2014-10-23 Thread JIRA

 [ 
https://issues.apache.org/jira/browse/PDFBOX-1222?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Andreas Lehmkühler closed PDFBOX-1222.
--
   Resolution: Fixed
Fix Version/s: 1.7.0
 Assignee: Andreas Lehmkühler

The text extraction works fine since PDFBox 1.7.0. The "The Comparison method 
violates its general contract" no longer appears starting with 1.7.0 too.


> PDFs created with idealsoftware.com's VPE are all wrong
> ---
>
> Key: PDFBOX-1222
> URL: https://issues.apache.org/jira/browse/PDFBOX-1222
> Project: PDFBox
>  Issue Type: Bug
>  Components: Text extraction
>Affects Versions: 1.6.0
>Reporter: Radek
>Assignee: Andreas Lehmkühler
> Fix For: 1.7.0
>
> Attachments: rtf.pdf
>
>
> Follow the steps:
> 1. Download the example pdf I'll attach. It's the same as "example rich text 
> format" pdf from idealsoftware.com but with text extraction protection 
> disabled.
> 2a. java -jar pdfbox-app-1.6.0.jar ExtractText -sort rtf.pdf extr.txt
> Actual results:
> Text is all gibberish. If you look at it very carefully, sorting "reads" the 
> text vertically and you find first characters of each line first, then second 
> characters of each line, etc.
> Moreover, on jdk7: java.lang.IllegalArgumentException: Comparison method 
> violates its general contract! (that's the text position sorting comparator)
> Poking around the code indicates that sorting is correct *if* character 
> rotation was 270 degrees. It (correctly?) calculates it as zero instead.
> 2b. java -jar pdfbox-app-1.6.0.jar ExtractText rtf.pdf extr.txt
> Actual results:
> Text is fine, but each page is glued to a single line. Poking around the code 
> indicates that character offsets go down correctly, but expected line height 
> is huge (full page height or width?) and therefore they never go down 
> sufficiently to trigger a newline detection.
> So, there's something very wrong with character positions in those files, 
> making pdfbox not extract text correctly.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Closed] (PDFBOX-1244) the text content extracted by PDFBOX is not as the same as it is displayed in Adobe reader

2014-10-23 Thread JIRA

 [ 
https://issues.apache.org/jira/browse/PDFBOX-1244?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Andreas Lehmkühler closed PDFBOX-1244.
--
Resolution: Not a Problem
  Assignee: Andreas Lehmkühler

PDFBox extracts the very same text than the acrobat reader. And yes it's not 
the displayed text, which leads to the assumption that the toUnicode mapping of 
the pdf is broken. 

Closed as "Not a problem"



> the text content extracted by PDFBOX is not as the same as it is displayed in 
> Adobe reader
> --
>
> Key: PDFBOX-1244
> URL: https://issues.apache.org/jira/browse/PDFBOX-1244
> Project: PDFBox
>  Issue Type: Bug
>  Components: Text extraction
>Affects Versions: 1.6.0, 1.7.0
> Environment: windows xp, Eclipse 3.2.0
>Reporter: huangchangan
>Assignee: Andreas Lehmkühler
> Attachments: P020101210619863754780 214.pdf
>
>
> Hello, 
> I useed pdfbox extract text content from the PDF document in the appendix, 
> founded the extracted text is "年预" but the text displayed in Adobe reader is 
> "年期".  I want to know how to get the correct text content (as Adobe reader 
> showing) from this kind of PDF documents by PDFBOX.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Resolved] (PDFBOX-1152) Gets scrambled japanese text while reading a PDF file

2014-10-23 Thread JIRA

 [ 
https://issues.apache.org/jira/browse/PDFBOX-1152?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Andreas Lehmkühler resolved PDFBOX-1152.

   Resolution: Fixed
Fix Version/s: 2.0.0
 Assignee: Andreas Lehmkühler

I've no idea how you created that xml output (AFAIK PDFBox doesn't provide any 
tool doing that), but what I know is that the text extraction works fine with 
the current trunk. 

> Gets scrambled japanese text while reading a PDF file
> -
>
> Key: PDFBOX-1152
> URL: https://issues.apache.org/jira/browse/PDFBOX-1152
> Project: PDFBox
>  Issue Type: Bug
>  Components: Text extraction
>Affects Versions: 1.6.0
> Environment: Windows XP Service Pack 3, P4, 1GB 
>Reporter: Suresh Somanathan
>Assignee: Andreas Lehmkühler
>  Labels: PDFBox
> Fix For: 2.0.0
>
> Attachments: SamplePDF.pdf, SamplePDF.xml
>
>   Original Estimate: 24h
>  Remaining Estimate: 24h
>
> During conversion of a Japanese PDF file to XML the output Japanese text gets 
> scrambled.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (PDFBOX-1151) StreamCorruptedException on bad PDF with -force

2014-10-23 Thread JIRA

 [ 
https://issues.apache.org/jira/browse/PDFBOX-1151?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Andreas Lehmkühler updated PDFBOX-1151:
---
Component/s: (was: Text extraction)
 Parsing

> StreamCorruptedException on bad PDF with -force
> ---
>
> Key: PDFBOX-1151
> URL: https://issues.apache.org/jira/browse/PDFBOX-1151
> Project: PDFBox
>  Issue Type: Bug
>  Components: Parsing
>Affects Versions: 1.6.0, 1.8.7, 2.0.0
> Environment: Windows Vista
> Sun JDK 1.6.0_26
>Reporter: Stas Shaposhnikov
> Attachments: PDFStreamEngine.patch, test.pdf
>
>
> I am getting the StreamCorruptedException when trying to parse a possibly 
> invalid PDF document even if the -force option is specified.
> Stack trace:
> java.io.StreamCorruptedException: Error: data is null
>   at org.apache.pdfbox.filter.LZWFilter.decode(LZWFilter.java:82)
>   at org.apache.pdfbox.cos.COSStream.doDecode(COSStream.java:301)
>   at org.apache.pdfbox.cos.COSStream.doDecode(COSStream.java:221)
>   at 
> org.apache.pdfbox.cos.COSStream.getUnfilteredStream(COSStream.java:156)
>   at 
> org.apache.pdfbox.pdfparser.PDFStreamParser.(PDFStreamParser.java:105)
>   at 
> org.apache.pdfbox.util.PDFStreamEngine.processSubStream(PDFStreamEngine.java:264)
>   at 
> org.apache.pdfbox.util.PDFStreamEngine.processSubStream(PDFStreamEngine.java:251)
>   at 
> org.apache.pdfbox.util.PDFStreamEngine.processStream(PDFStreamEngine.java:225)
>   at 
> org.apache.pdfbox.util.PDFTextStripper.processPage(PDFTextStripper.java:442)
>   at 
> org.apache.pdfbox.util.PDFTextStripper.processPages(PDFTextStripper.java:366)
>   at 
> org.apache.pdfbox.util.PDFTextStripper.writeText(PDFTextStripper.java:322)
>   at org.apache.pdfbox.ExtractText.startExtraction(ExtractText.java:256)
>   at org.apache.pdfbox.ExtractText.main(ExtractText.java:76)
>   at org.apache.pdfbox.PDFBox.main(PDFBox.java:42)
> My suggestion is to skip bad sub-streams without throwing exceptions in 
> PDFStreamEngine.processSubStream() in case of forceParsing is true.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (PDFBOX-1151) StreamCorruptedException on bad PDF with -force

2014-10-23 Thread JIRA

[ 
https://issues.apache.org/jira/browse/PDFBOX-1151?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14181559#comment-14181559
 ] 

Andreas Lehmkühler commented on PDFBOX-1151:


Even the new self repair doesn't work here. I've got another exception but the 
result is the same, nothinh is rendered.

{code}
Exception in thread "AWT-EventQueue-0" java.lang.NullPointerException
at org.apache.pdfbox.filter.LZWFilter.doLZWDecode(LZWFilter.java:120)
at org.apache.pdfbox.filter.LZWFilter.decode(LZWFilter.java:95)
at org.apache.pdfbox.cos.COSStream.doDecode(COSStream.java:386)
at org.apache.pdfbox.cos.COSStream.doDecode(COSStream.java:327)
at 
org.apache.pdfbox.cos.COSStream.getUnfilteredStream(COSStream.java:244)
at 
org.apache.pdfbox.pdfparser.PDFStreamParser.(PDFStreamParser.java:109)
at 
org.apache.pdfbox.contentstream.PDFStreamEngine.processSubStream(PDFStreamEngine.java:216)
at 
org.apache.pdfbox.contentstream.PDFStreamEngine.processSubStream(PDFStreamEngine.java:198)
at 
org.apache.pdfbox.contentstream.PDFStreamEngine.processStream(PDFStreamEngine.java:152)
at org.apache.pdfbox.rendering.PageDrawer.drawPage(PageDrawer.java:179)
at 
org.apache.pdfbox.rendering.PDFRenderer.renderPage(PDFRenderer.java:215)
at 
org.apache.pdfbox.rendering.PDFRenderer.renderPageToGraphics(PDFRenderer.java:177)
at 
org.apache.pdfbox.rendering.PDFRenderer.renderPageToGraphics(PDFRenderer.java:161)
at org.apache.pdfbox.tools.gui.PDFPagePanel.paint(PDFPagePanel.java:87)
{code}


> StreamCorruptedException on bad PDF with -force
> ---
>
> Key: PDFBOX-1151
> URL: https://issues.apache.org/jira/browse/PDFBOX-1151
> Project: PDFBox
>  Issue Type: Bug
>  Components: Text extraction
>Affects Versions: 1.6.0, 1.8.7, 2.0.0
> Environment: Windows Vista
> Sun JDK 1.6.0_26
>Reporter: Stas Shaposhnikov
> Attachments: PDFStreamEngine.patch, test.pdf
>
>
> I am getting the StreamCorruptedException when trying to parse a possibly 
> invalid PDF document even if the -force option is specified.
> Stack trace:
> java.io.StreamCorruptedException: Error: data is null
>   at org.apache.pdfbox.filter.LZWFilter.decode(LZWFilter.java:82)
>   at org.apache.pdfbox.cos.COSStream.doDecode(COSStream.java:301)
>   at org.apache.pdfbox.cos.COSStream.doDecode(COSStream.java:221)
>   at 
> org.apache.pdfbox.cos.COSStream.getUnfilteredStream(COSStream.java:156)
>   at 
> org.apache.pdfbox.pdfparser.PDFStreamParser.(PDFStreamParser.java:105)
>   at 
> org.apache.pdfbox.util.PDFStreamEngine.processSubStream(PDFStreamEngine.java:264)
>   at 
> org.apache.pdfbox.util.PDFStreamEngine.processSubStream(PDFStreamEngine.java:251)
>   at 
> org.apache.pdfbox.util.PDFStreamEngine.processStream(PDFStreamEngine.java:225)
>   at 
> org.apache.pdfbox.util.PDFTextStripper.processPage(PDFTextStripper.java:442)
>   at 
> org.apache.pdfbox.util.PDFTextStripper.processPages(PDFTextStripper.java:366)
>   at 
> org.apache.pdfbox.util.PDFTextStripper.writeText(PDFTextStripper.java:322)
>   at org.apache.pdfbox.ExtractText.startExtraction(ExtractText.java:256)
>   at org.apache.pdfbox.ExtractText.main(ExtractText.java:76)
>   at org.apache.pdfbox.PDFBox.main(PDFBox.java:42)
> My suggestion is to skip bad sub-streams without throwing exceptions in 
> PDFStreamEngine.processSubStream() in case of forceParsing is true.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (PDFBOX-1151) StreamCorruptedException on bad PDF with -force

2014-10-23 Thread JIRA

 [ 
https://issues.apache.org/jira/browse/PDFBOX-1151?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Andreas Lehmkühler updated PDFBOX-1151:
---
Affects Version/s: 2.0.0
   1.8.7

> StreamCorruptedException on bad PDF with -force
> ---
>
> Key: PDFBOX-1151
> URL: https://issues.apache.org/jira/browse/PDFBOX-1151
> Project: PDFBox
>  Issue Type: Bug
>  Components: Text extraction
>Affects Versions: 1.6.0, 1.8.7, 2.0.0
> Environment: Windows Vista
> Sun JDK 1.6.0_26
>Reporter: Stas Shaposhnikov
> Attachments: PDFStreamEngine.patch, test.pdf
>
>
> I am getting the StreamCorruptedException when trying to parse a possibly 
> invalid PDF document even if the -force option is specified.
> Stack trace:
> java.io.StreamCorruptedException: Error: data is null
>   at org.apache.pdfbox.filter.LZWFilter.decode(LZWFilter.java:82)
>   at org.apache.pdfbox.cos.COSStream.doDecode(COSStream.java:301)
>   at org.apache.pdfbox.cos.COSStream.doDecode(COSStream.java:221)
>   at 
> org.apache.pdfbox.cos.COSStream.getUnfilteredStream(COSStream.java:156)
>   at 
> org.apache.pdfbox.pdfparser.PDFStreamParser.(PDFStreamParser.java:105)
>   at 
> org.apache.pdfbox.util.PDFStreamEngine.processSubStream(PDFStreamEngine.java:264)
>   at 
> org.apache.pdfbox.util.PDFStreamEngine.processSubStream(PDFStreamEngine.java:251)
>   at 
> org.apache.pdfbox.util.PDFStreamEngine.processStream(PDFStreamEngine.java:225)
>   at 
> org.apache.pdfbox.util.PDFTextStripper.processPage(PDFTextStripper.java:442)
>   at 
> org.apache.pdfbox.util.PDFTextStripper.processPages(PDFTextStripper.java:366)
>   at 
> org.apache.pdfbox.util.PDFTextStripper.writeText(PDFTextStripper.java:322)
>   at org.apache.pdfbox.ExtractText.startExtraction(ExtractText.java:256)
>   at org.apache.pdfbox.ExtractText.main(ExtractText.java:76)
>   at org.apache.pdfbox.PDFBox.main(PDFBox.java:42)
> My suggestion is to skip bad sub-streams without throwing exceptions in 
> PDFStreamEngine.processSubStream() in case of forceParsing is true.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Comment Edited] (PDFBOX-2447) "Cannot save a document which has been closed" when encrypting

2014-10-23 Thread Ralf Hauser (JIRA)

[ 
https://issues.apache.org/jira/browse/PDFBOX-2447?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14181510#comment-14181510
 ] 

Ralf Hauser edited comment on PDFBOX-2447 at 10/23/14 4:09 PM:
---

good point, but I guess the patch works with a minor enhancment as per 
patch2447a.txt


was (Author: hau...@acm.org):
good point, but I guess the patch works with a minor enhancment

> "Cannot save a document which has been closed" when encrypting
> --
>
> Key: PDFBOX-2447
> URL: https://issues.apache.org/jira/browse/PDFBOX-2447
> Project: PDFBox
>  Issue Type: Bug
>  Components: PDModel
>Affects Versions: 2.0.0
> Environment: java7 deb7
>Reporter: Ralf Hauser
> Attachments: patch2447.txt, patch2447a.txt
>
>
> InputStream content = ...;
> int keyLength = 256;
> AccessPermission ap = new AccessPermission();
> StandardProtectionPolicy spp = new 
> StandardProtectionPolicy(
> symmPw, symmPw, ap);
> spp.setEncryptionKeyLength(keyLength);
> document.protect(spp);
> ByteArrayOutputStream baos = new ByteArrayOutputStream();
> document.save(baos);
> in the save() the above mentioned exception is thrown (wasn't with the 
> 2013-11 snapshot)



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (PDFBOX-2447) "Cannot save a document which has been closed" when encrypting

2014-10-23 Thread Ralf Hauser (JIRA)

 [ 
https://issues.apache.org/jira/browse/PDFBOX-2447?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ralf Hauser updated PDFBOX-2447:

Attachment: patch2447a.txt

good point, but I guess the patch works with a minor enhancment

> "Cannot save a document which has been closed" when encrypting
> --
>
> Key: PDFBOX-2447
> URL: https://issues.apache.org/jira/browse/PDFBOX-2447
> Project: PDFBox
>  Issue Type: Bug
>  Components: PDModel
>Affects Versions: 2.0.0
> Environment: java7 deb7
>Reporter: Ralf Hauser
> Attachments: patch2447.txt, patch2447a.txt
>
>
> InputStream content = ...;
> int keyLength = 256;
> AccessPermission ap = new AccessPermission();
> StandardProtectionPolicy spp = new 
> StandardProtectionPolicy(
> symmPw, symmPw, ap);
> spp.setEncryptionKeyLength(keyLength);
> document.protect(spp);
> ByteArrayOutputStream baos = new ByteArrayOutputStream();
> document.save(baos);
> in the save() the above mentioned exception is thrown (wasn't with the 
> 2013-11 snapshot)



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (PDFBOX-2447) "Cannot save a document which has been closed" when encrypting

2014-10-23 Thread Tilman Hausherr (JIRA)

[ 
https://issues.apache.org/jira/browse/PDFBOX-2447?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14181486#comment-14181486
 ] 

Tilman Hausherr commented on PDFBOX-2447:
-

I'm afraid that this won't work, what if the user did really close the 
document? I didn't test it, but from looking at the source code, I'd expect an 
NPE.

> "Cannot save a document which has been closed" when encrypting
> --
>
> Key: PDFBOX-2447
> URL: https://issues.apache.org/jira/browse/PDFBOX-2447
> Project: PDFBox
>  Issue Type: Bug
>  Components: PDModel
>Affects Versions: 2.0.0
> Environment: java7 deb7
>Reporter: Ralf Hauser
> Attachments: patch2447.txt
>
>
> InputStream content = ...;
> int keyLength = 256;
> AccessPermission ap = new AccessPermission();
> StandardProtectionPolicy spp = new 
> StandardProtectionPolicy(
> symmPw, symmPw, ap);
> spp.setEncryptionKeyLength(keyLength);
> document.protect(spp);
> ByteArrayOutputStream baos = new ByteArrayOutputStream();
> document.save(baos);
> in the save() the above mentioned exception is thrown (wasn't with the 
> 2013-11 snapshot)



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (PDFBOX-1595) PDFMerger failed with the following exception: java.lang.NullPointerException

2014-10-23 Thread Tilman Hausherr (JIRA)

[ 
https://issues.apache.org/jira/browse/PDFBOX-1595?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14181485#comment-14181485
 ] 

Tilman Hausherr commented on PDFBOX-1595:
-

Did you or somebody you know create the "bid.pdf" file? It has no xref table.

> PDFMerger failed with the following exception: java.lang.NullPointerException
> -
>
> Key: PDFBOX-1595
> URL: https://issues.apache.org/jira/browse/PDFBOX-1595
> Project: PDFBox
>  Issue Type: Bug
>  Components: Utilities
>Affects Versions: 1.8.1
> Environment: Windows
>Reporter: Ernst Eibensteiner
>Assignee: Andreas Lehmkühler
>  Labels: PDFMergerUtility
> Attachments: 2nd Testfile.pdf, bid.pdf
>
>
> Merging 2 PDF documents leads to a null pointer exception:
> From my point of view the PDF document misses the xref and startxref tag.
> java -jar pdfbox-app-1.8.1.jar PDFMerger "bid.pdf" "2nd Testfile.pdf" 
> output.pdf
> Mai 08, 2013 12:52:03 PM org.apache.pdfbox.pdfparser.XrefTrailerResolver 
> setTrailer
> WARNING: Cannot add trailer because XRef start was not signalled.
> Mai 08, 2013 12:52:03 PM org.apache.pdfbox.pdfparser.XrefTrailerResolver 
> setStartxref
> WARNING: Did not found XRef object at specified startxref position 0
> PDFMerger failed with the following exception:
> java.lang.NullPointerException
> at 
> org.apache.pdfbox.util.PDFMergerUtility.appendDocument(PDFMergerUtility.java:257)
> at 
> org.apache.pdfbox.util.PDFMergerUtility.mergeDocuments(PDFMergerUtility.java:188)
> at org.apache.pdfbox.PDFMerger.merge(PDFMerger.java:68)
> at org.apache.pdfbox.PDFMerger.main(PDFMerger.java:44)
> at org.apache.pdfbox.PDFBox.main(PDFBox.java:83)



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (PDFBOX-2409) got the wrong result from Arabic text extraction

2014-10-23 Thread EugenePig (JIRA)

 [ 
https://issues.apache.org/jira/browse/PDFBOX-2409?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

EugenePig updated PDFBOX-2409:
--
Attachment: THESSALONIANS_win7_firefox.jpg

That is not my mac. I can’t install fonts on it. So I tested it on Windows 7 
Arabic version with Firefox. I got the result as same as yours. Please see 
THESSALONIANS_win7_firefox.jpg. However it is very strange. Only the diacritic 
marks of the first line are almost wrong. Others are correct. 

> got the wrong result from Arabic text extraction
> 
>
> Key: PDFBOX-2409
> URL: https://issues.apache.org/jira/browse/PDFBOX-2409
> Project: PDFBox
>  Issue Type: Bug
>  Components: Text extraction
>Affects Versions: 1.8.7, 2.0.0
> Environment: Ubuntu 14.04 64bit
> java version "1.8.0_20"
>Reporter: EugenePig
>Assignee: John Hewson
> Attachments: THESSALONIANS.pdf, THESSALONIANS.txt, 
> THESSALONIANS_win7_firefox.jpg, jahewson.mac.png
>
>
> java -jar pdfbox-app-1.8.7.jar ExtractText -sort -encoding UTF-8 
> THESSALONIANS.pdf
> java -jar pdfbox-app-2.0.0-SNAPSHOT.jar ExtractText -sort -encoding UTF-8 
> THESSALONIANS.pdf
> Please compare THESSALONIANS.txt.jpg with THESSALONIANS.pdf. There are a lot 
> of differences. I just marked a few differences with red circles.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (PDFBOX-2409) got the wrong result from Arabic text extraction

2014-10-23 Thread EugenePig (JIRA)

 [ 
https://issues.apache.org/jira/browse/PDFBOX-2409?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

EugenePig updated PDFBOX-2409:
--
Attachment: (was: THESSALONIANS.txt.jpg)

> got the wrong result from Arabic text extraction
> 
>
> Key: PDFBOX-2409
> URL: https://issues.apache.org/jira/browse/PDFBOX-2409
> Project: PDFBox
>  Issue Type: Bug
>  Components: Text extraction
>Affects Versions: 1.8.7, 2.0.0
> Environment: Ubuntu 14.04 64bit
> java version "1.8.0_20"
>Reporter: EugenePig
>Assignee: John Hewson
> Attachments: THESSALONIANS.pdf, THESSALONIANS.txt, jahewson.mac.png
>
>
> java -jar pdfbox-app-1.8.7.jar ExtractText -sort -encoding UTF-8 
> THESSALONIANS.pdf
> java -jar pdfbox-app-2.0.0-SNAPSHOT.jar ExtractText -sort -encoding UTF-8 
> THESSALONIANS.pdf
> Please compare THESSALONIANS.txt.jpg with THESSALONIANS.pdf. There are a lot 
> of differences. I just marked a few differences with red circles.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (PDFBOX-2409) got the wrong result from Arabic text extraction

2014-10-23 Thread EugenePig (JIRA)

 [ 
https://issues.apache.org/jira/browse/PDFBOX-2409?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

EugenePig updated PDFBOX-2409:
--
Attachment: (was: THESSALONIANS.txt.mac.jpg)

> got the wrong result from Arabic text extraction
> 
>
> Key: PDFBOX-2409
> URL: https://issues.apache.org/jira/browse/PDFBOX-2409
> Project: PDFBox
>  Issue Type: Bug
>  Components: Text extraction
>Affects Versions: 1.8.7, 2.0.0
> Environment: Ubuntu 14.04 64bit
> java version "1.8.0_20"
>Reporter: EugenePig
>Assignee: John Hewson
> Attachments: THESSALONIANS.pdf, THESSALONIANS.txt, 
> THESSALONIANS.txt.jpg, jahewson.mac.png
>
>
> java -jar pdfbox-app-1.8.7.jar ExtractText -sort -encoding UTF-8 
> THESSALONIANS.pdf
> java -jar pdfbox-app-2.0.0-SNAPSHOT.jar ExtractText -sort -encoding UTF-8 
> THESSALONIANS.pdf
> Please compare THESSALONIANS.txt.jpg with THESSALONIANS.pdf. There are a lot 
> of differences. I just marked a few differences with red circles.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (PDFBOX-2439) ArrayIndexOutOfBoundsException in multithreaded system

2014-10-23 Thread simon steiner (JIRA)

[ 
https://issues.apache.org/jira/browse/PDFBOX-2439?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14181359#comment-14181359
 ] 

simon steiner commented on PDFBOX-2439:
---

I also get
Exception LCMS error 13: Couldn't link the profiles

> ArrayIndexOutOfBoundsException in multithreaded system
> --
>
> Key: PDFBOX-2439
> URL: https://issues.apache.org/jira/browse/PDFBOX-2439
> Project: PDFBox
>  Issue Type: Bug
>  Components: FontBox
>Affects Versions: 2.0.0
>Reporter: simon steiner
>Priority: Critical
> Fix For: 2.0.0
>
> Attachments: TestPDFBox.java
>
>
> When it loads replacement font from OS i sometimes get:
> Exception in thread "Thread-27" java.lang.ArrayIndexOutOfBoundsException: 
> 40036
>   at 
> org.apache.fontbox.ttf.GlyfSimpleDescript.readFlags(GlyfSimpleDescript.java:197)
>   at 
> org.apache.fontbox.ttf.GlyfSimpleDescript.(GlyfSimpleDescript.java:78)
>   at org.apache.fontbox.ttf.GlyphData.initData(GlyphData.java:58)
>   at org.apache.fontbox.ttf.GlyphTable.getGlyph(GlyphTable.java:161)
>   at 
> org.apache.pdfbox.rendering.font.TTFGlyph2D.getPathForGID(TTFGlyph2D.java:140)
>   at 
> org.apache.pdfbox.rendering.font.TTFGlyph2D.getPathForCharacterCode(TTFGlyph2D.java:92)
>   at 
> org.apache.pdfbox.rendering.PageDrawer.drawGlyph2D(PageDrawer.java:392)
>   at org.apache.pdfbox.rendering.PageDrawer.showGlyph(PageDrawer.java:372)
>   at 
> org.apache.pdfbox.contentstream.PDFStreamEngine.showText(PDFStreamEngine.java:411)
>   at org.apache.pdfbox.rendering.PageDrawer.showText(PageDrawer.java:346)
>   at 
> org.apache.pdfbox.contentstream.PDFStreamEngine.showTextStrings(PDFStreamEngine.java:322)
>   at 
> org.apache.pdfbox.contentstream.operator.text.ShowTextAdjusted.process(ShowTextAdjusted.java:38)
>   at 
> org.apache.pdfbox.contentstream.PDFStreamEngine.processOperator(PDFStreamEngine.java:482)
>   at 
> org.apache.pdfbox.contentstream.PDFStreamEngine.processSubStream(PDFStreamEngine.java:236)
>   at 
> org.apache.pdfbox.contentstream.PDFStreamEngine.processSubStream(PDFStreamEngine.java:201)
>   at 
> org.apache.pdfbox.contentstream.PDFStreamEngine.processStream(PDFStreamEngine.java:155)
>   at org.apache.pdfbox.rendering.PageDrawer.drawPage(PageDrawer.java:177)
>   at 
> org.apache.pdfbox.rendering.PDFRenderer.renderPage(PDFRenderer.java:228)
>   at 
> org.apache.pdfbox.rendering.PDFRenderer.renderImage(PDFRenderer.java:160)
>   at 
> org.apache.pdfbox.rendering.PDFRenderer.renderImageWithDPI(PDFRenderer.java:109)



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (PDFBOX-2439) ArrayIndexOutOfBoundsException in multithreaded system

2014-10-23 Thread simon steiner (JIRA)

[ 
https://issues.apache.org/jira/browse/PDFBOX-2439?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14181353#comment-14181353
 ] 

simon steiner commented on PDFBOX-2439:
---

Is GlyphTable.cache used?

> ArrayIndexOutOfBoundsException in multithreaded system
> --
>
> Key: PDFBOX-2439
> URL: https://issues.apache.org/jira/browse/PDFBOX-2439
> Project: PDFBox
>  Issue Type: Bug
>  Components: FontBox
>Affects Versions: 2.0.0
>Reporter: simon steiner
>Priority: Critical
> Fix For: 2.0.0
>
> Attachments: TestPDFBox.java
>
>
> When it loads replacement font from OS i sometimes get:
> Exception in thread "Thread-27" java.lang.ArrayIndexOutOfBoundsException: 
> 40036
>   at 
> org.apache.fontbox.ttf.GlyfSimpleDescript.readFlags(GlyfSimpleDescript.java:197)
>   at 
> org.apache.fontbox.ttf.GlyfSimpleDescript.(GlyfSimpleDescript.java:78)
>   at org.apache.fontbox.ttf.GlyphData.initData(GlyphData.java:58)
>   at org.apache.fontbox.ttf.GlyphTable.getGlyph(GlyphTable.java:161)
>   at 
> org.apache.pdfbox.rendering.font.TTFGlyph2D.getPathForGID(TTFGlyph2D.java:140)
>   at 
> org.apache.pdfbox.rendering.font.TTFGlyph2D.getPathForCharacterCode(TTFGlyph2D.java:92)
>   at 
> org.apache.pdfbox.rendering.PageDrawer.drawGlyph2D(PageDrawer.java:392)
>   at org.apache.pdfbox.rendering.PageDrawer.showGlyph(PageDrawer.java:372)
>   at 
> org.apache.pdfbox.contentstream.PDFStreamEngine.showText(PDFStreamEngine.java:411)
>   at org.apache.pdfbox.rendering.PageDrawer.showText(PageDrawer.java:346)
>   at 
> org.apache.pdfbox.contentstream.PDFStreamEngine.showTextStrings(PDFStreamEngine.java:322)
>   at 
> org.apache.pdfbox.contentstream.operator.text.ShowTextAdjusted.process(ShowTextAdjusted.java:38)
>   at 
> org.apache.pdfbox.contentstream.PDFStreamEngine.processOperator(PDFStreamEngine.java:482)
>   at 
> org.apache.pdfbox.contentstream.PDFStreamEngine.processSubStream(PDFStreamEngine.java:236)
>   at 
> org.apache.pdfbox.contentstream.PDFStreamEngine.processSubStream(PDFStreamEngine.java:201)
>   at 
> org.apache.pdfbox.contentstream.PDFStreamEngine.processStream(PDFStreamEngine.java:155)
>   at org.apache.pdfbox.rendering.PageDrawer.drawPage(PageDrawer.java:177)
>   at 
> org.apache.pdfbox.rendering.PDFRenderer.renderPage(PDFRenderer.java:228)
>   at 
> org.apache.pdfbox.rendering.PDFRenderer.renderImage(PDFRenderer.java:160)
>   at 
> org.apache.pdfbox.rendering.PDFRenderer.renderImageWithDPI(PDFRenderer.java:109)



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (PDFBOX-2439) ArrayIndexOutOfBoundsException in multithreaded system

2014-10-23 Thread simon steiner (JIRA)

[ 
https://issues.apache.org/jira/browse/PDFBOX-2439?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14181334#comment-14181334
 ] 

simon steiner commented on PDFBOX-2439:
---

Quick hack:
--- 
a/pdfbox/src/main/java/org/apache/pdfbox/pdmodel/font/FileSystemFontProvider.java
+++ 
b/pdfbox/src/main/java/org/apache/pdfbox/pdmodel/font/FileSystemFontProvider.java
@@ -206,7 +206,7 @@ final class FileSystemFontProvider implements FontProvider
 {
 ttf = ttfParser.parse(file);
 
-ttfFonts.put(postScriptName, ttf);
+//ttfFonts.put(postScriptName, ttf);
 if (LOG.isDebugEnabled())
 {
 LOG.debug("Loaded " + postScriptName + " from " + file);
@@ -244,7 +244,7 @@ final class FileSystemFontProvider implements FontProvider
 byte[] bytes = IOUtils.toByteArray(input);
 CFFParser cffParser = new CFFParser();
 cff = cffParser.parse(bytes).get(0);
-cffFonts.put(postScriptName, cff);
+//cffFonts.put(postScriptName, cff);
 if (LOG.isDebugEnabled())
 {
 LOG.debug("Loaded " + postScriptName + " from " + file);
@@ -280,7 +280,7 @@ final class FileSystemFontProvider implements FontProvider
 {
 input = new FileInputStream(file);
 type1 = Type1Font.createWithPFB(input);
-type1Fonts.put(postScriptName, type1);
+//type1Fonts.put(postScriptName, type1);

> ArrayIndexOutOfBoundsException in multithreaded system
> --
>
> Key: PDFBOX-2439
> URL: https://issues.apache.org/jira/browse/PDFBOX-2439
> Project: PDFBox
>  Issue Type: Bug
>  Components: FontBox
>Affects Versions: 2.0.0
>Reporter: simon steiner
>Priority: Critical
> Fix For: 2.0.0
>
> Attachments: TestPDFBox.java
>
>
> When it loads replacement font from OS i sometimes get:
> Exception in thread "Thread-27" java.lang.ArrayIndexOutOfBoundsException: 
> 40036
>   at 
> org.apache.fontbox.ttf.GlyfSimpleDescript.readFlags(GlyfSimpleDescript.java:197)
>   at 
> org.apache.fontbox.ttf.GlyfSimpleDescript.(GlyfSimpleDescript.java:78)
>   at org.apache.fontbox.ttf.GlyphData.initData(GlyphData.java:58)
>   at org.apache.fontbox.ttf.GlyphTable.getGlyph(GlyphTable.java:161)
>   at 
> org.apache.pdfbox.rendering.font.TTFGlyph2D.getPathForGID(TTFGlyph2D.java:140)
>   at 
> org.apache.pdfbox.rendering.font.TTFGlyph2D.getPathForCharacterCode(TTFGlyph2D.java:92)
>   at 
> org.apache.pdfbox.rendering.PageDrawer.drawGlyph2D(PageDrawer.java:392)
>   at org.apache.pdfbox.rendering.PageDrawer.showGlyph(PageDrawer.java:372)
>   at 
> org.apache.pdfbox.contentstream.PDFStreamEngine.showText(PDFStreamEngine.java:411)
>   at org.apache.pdfbox.rendering.PageDrawer.showText(PageDrawer.java:346)
>   at 
> org.apache.pdfbox.contentstream.PDFStreamEngine.showTextStrings(PDFStreamEngine.java:322)
>   at 
> org.apache.pdfbox.contentstream.operator.text.ShowTextAdjusted.process(ShowTextAdjusted.java:38)
>   at 
> org.apache.pdfbox.contentstream.PDFStreamEngine.processOperator(PDFStreamEngine.java:482)
>   at 
> org.apache.pdfbox.contentstream.PDFStreamEngine.processSubStream(PDFStreamEngine.java:236)
>   at 
> org.apache.pdfbox.contentstream.PDFStreamEngine.processSubStream(PDFStreamEngine.java:201)
>   at 
> org.apache.pdfbox.contentstream.PDFStreamEngine.processStream(PDFStreamEngine.java:155)
>   at org.apache.pdfbox.rendering.PageDrawer.drawPage(PageDrawer.java:177)
>   at 
> org.apache.pdfbox.rendering.PDFRenderer.renderPage(PDFRenderer.java:228)
>   at 
> org.apache.pdfbox.rendering.PDFRenderer.renderImage(PDFRenderer.java:160)
>   at 
> org.apache.pdfbox.rendering.PDFRenderer.renderImageWithDPI(PDFRenderer.java:109)



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Comment Edited] (PDFBOX-2439) ArrayIndexOutOfBoundsException in multithreaded system

2014-10-23 Thread simon steiner (JIRA)

[ 
https://issues.apache.org/jira/browse/PDFBOX-2439?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14181309#comment-14181309
 ] 

simon steiner edited comment on PDFBOX-2439 at 10/23/14 1:09 PM:
-

I guess you cant share FileSystemFontProvider

from ExternalFonts.java:
// lazy thread safe singleton
private static class DefaultFontProvider
{
private static final FontProvider INSTANCE = new 
FileSystemFontProvider();
}


was (Author: ssteiner1):
I guess you cant share FileSystemFontProvider

// lazy thread safe singleton
private static class DefaultFontProvider
{
private static final FontProvider INSTANCE = new 
FileSystemFontProvider();
}

> ArrayIndexOutOfBoundsException in multithreaded system
> --
>
> Key: PDFBOX-2439
> URL: https://issues.apache.org/jira/browse/PDFBOX-2439
> Project: PDFBox
>  Issue Type: Bug
>  Components: FontBox
>Affects Versions: 2.0.0
>Reporter: simon steiner
>Priority: Critical
> Fix For: 2.0.0
>
> Attachments: TestPDFBox.java
>
>
> When it loads replacement font from OS i sometimes get:
> Exception in thread "Thread-27" java.lang.ArrayIndexOutOfBoundsException: 
> 40036
>   at 
> org.apache.fontbox.ttf.GlyfSimpleDescript.readFlags(GlyfSimpleDescript.java:197)
>   at 
> org.apache.fontbox.ttf.GlyfSimpleDescript.(GlyfSimpleDescript.java:78)
>   at org.apache.fontbox.ttf.GlyphData.initData(GlyphData.java:58)
>   at org.apache.fontbox.ttf.GlyphTable.getGlyph(GlyphTable.java:161)
>   at 
> org.apache.pdfbox.rendering.font.TTFGlyph2D.getPathForGID(TTFGlyph2D.java:140)
>   at 
> org.apache.pdfbox.rendering.font.TTFGlyph2D.getPathForCharacterCode(TTFGlyph2D.java:92)
>   at 
> org.apache.pdfbox.rendering.PageDrawer.drawGlyph2D(PageDrawer.java:392)
>   at org.apache.pdfbox.rendering.PageDrawer.showGlyph(PageDrawer.java:372)
>   at 
> org.apache.pdfbox.contentstream.PDFStreamEngine.showText(PDFStreamEngine.java:411)
>   at org.apache.pdfbox.rendering.PageDrawer.showText(PageDrawer.java:346)
>   at 
> org.apache.pdfbox.contentstream.PDFStreamEngine.showTextStrings(PDFStreamEngine.java:322)
>   at 
> org.apache.pdfbox.contentstream.operator.text.ShowTextAdjusted.process(ShowTextAdjusted.java:38)
>   at 
> org.apache.pdfbox.contentstream.PDFStreamEngine.processOperator(PDFStreamEngine.java:482)
>   at 
> org.apache.pdfbox.contentstream.PDFStreamEngine.processSubStream(PDFStreamEngine.java:236)
>   at 
> org.apache.pdfbox.contentstream.PDFStreamEngine.processSubStream(PDFStreamEngine.java:201)
>   at 
> org.apache.pdfbox.contentstream.PDFStreamEngine.processStream(PDFStreamEngine.java:155)
>   at org.apache.pdfbox.rendering.PageDrawer.drawPage(PageDrawer.java:177)
>   at 
> org.apache.pdfbox.rendering.PDFRenderer.renderPage(PDFRenderer.java:228)
>   at 
> org.apache.pdfbox.rendering.PDFRenderer.renderImage(PDFRenderer.java:160)
>   at 
> org.apache.pdfbox.rendering.PDFRenderer.renderImageWithDPI(PDFRenderer.java:109)



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (PDFBOX-2439) ArrayIndexOutOfBoundsException in multithreaded system

2014-10-23 Thread simon steiner (JIRA)

[ 
https://issues.apache.org/jira/browse/PDFBOX-2439?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14181309#comment-14181309
 ] 

simon steiner commented on PDFBOX-2439:
---

I guess you cant share FileSystemFontProvider

// lazy thread safe singleton
private static class DefaultFontProvider
{
private static final FontProvider INSTANCE = new 
FileSystemFontProvider();
}

> ArrayIndexOutOfBoundsException in multithreaded system
> --
>
> Key: PDFBOX-2439
> URL: https://issues.apache.org/jira/browse/PDFBOX-2439
> Project: PDFBox
>  Issue Type: Bug
>  Components: FontBox
>Affects Versions: 2.0.0
>Reporter: simon steiner
>Priority: Critical
> Fix For: 2.0.0
>
> Attachments: TestPDFBox.java
>
>
> When it loads replacement font from OS i sometimes get:
> Exception in thread "Thread-27" java.lang.ArrayIndexOutOfBoundsException: 
> 40036
>   at 
> org.apache.fontbox.ttf.GlyfSimpleDescript.readFlags(GlyfSimpleDescript.java:197)
>   at 
> org.apache.fontbox.ttf.GlyfSimpleDescript.(GlyfSimpleDescript.java:78)
>   at org.apache.fontbox.ttf.GlyphData.initData(GlyphData.java:58)
>   at org.apache.fontbox.ttf.GlyphTable.getGlyph(GlyphTable.java:161)
>   at 
> org.apache.pdfbox.rendering.font.TTFGlyph2D.getPathForGID(TTFGlyph2D.java:140)
>   at 
> org.apache.pdfbox.rendering.font.TTFGlyph2D.getPathForCharacterCode(TTFGlyph2D.java:92)
>   at 
> org.apache.pdfbox.rendering.PageDrawer.drawGlyph2D(PageDrawer.java:392)
>   at org.apache.pdfbox.rendering.PageDrawer.showGlyph(PageDrawer.java:372)
>   at 
> org.apache.pdfbox.contentstream.PDFStreamEngine.showText(PDFStreamEngine.java:411)
>   at org.apache.pdfbox.rendering.PageDrawer.showText(PageDrawer.java:346)
>   at 
> org.apache.pdfbox.contentstream.PDFStreamEngine.showTextStrings(PDFStreamEngine.java:322)
>   at 
> org.apache.pdfbox.contentstream.operator.text.ShowTextAdjusted.process(ShowTextAdjusted.java:38)
>   at 
> org.apache.pdfbox.contentstream.PDFStreamEngine.processOperator(PDFStreamEngine.java:482)
>   at 
> org.apache.pdfbox.contentstream.PDFStreamEngine.processSubStream(PDFStreamEngine.java:236)
>   at 
> org.apache.pdfbox.contentstream.PDFStreamEngine.processSubStream(PDFStreamEngine.java:201)
>   at 
> org.apache.pdfbox.contentstream.PDFStreamEngine.processStream(PDFStreamEngine.java:155)
>   at org.apache.pdfbox.rendering.PageDrawer.drawPage(PageDrawer.java:177)
>   at 
> org.apache.pdfbox.rendering.PDFRenderer.renderPage(PDFRenderer.java:228)
>   at 
> org.apache.pdfbox.rendering.PDFRenderer.renderImage(PDFRenderer.java:160)
>   at 
> org.apache.pdfbox.rendering.PDFRenderer.renderImageWithDPI(PDFRenderer.java:109)



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (PDFBOX-2447) "Cannot save a document which has been closed" when encrypting

2014-10-23 Thread Ralf Hauser (JIRA)

 [ 
https://issues.apache.org/jira/browse/PDFBOX-2447?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ralf Hauser updated PDFBOX-2447:

Attachment: patch2447.txt

it turns out that if I do an unmotivated

PDDocumentCatalog cat = document.getDocumentCatalog();

before the save(), it doesn't happen.

Apparently, the 
   PDDocument.documentCatalog
is no longer initialized...

> "Cannot save a document which has been closed" when encrypting
> --
>
> Key: PDFBOX-2447
> URL: https://issues.apache.org/jira/browse/PDFBOX-2447
> Project: PDFBox
>  Issue Type: Bug
>  Components: PDModel
>Affects Versions: 2.0.0
> Environment: java7 deb7
>Reporter: Ralf Hauser
> Attachments: patch2447.txt
>
>
> InputStream content = ...;
> int keyLength = 256;
> AccessPermission ap = new AccessPermission();
> StandardProtectionPolicy spp = new 
> StandardProtectionPolicy(
> symmPw, symmPw, ap);
> spp.setEncryptionKeyLength(keyLength);
> document.protect(spp);
> ByteArrayOutputStream baos = new ByteArrayOutputStream();
> document.save(baos);
> in the save() the above mentioned exception is thrown (wasn't with the 
> 2013-11 snapshot)



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Created] (PDFBOX-2447) "Cannot save a document which has been closed" when encrypting

2014-10-23 Thread Ralf Hauser (JIRA)
Ralf Hauser created PDFBOX-2447:
---

 Summary: "Cannot save a document which has been closed" when 
encrypting
 Key: PDFBOX-2447
 URL: https://issues.apache.org/jira/browse/PDFBOX-2447
 Project: PDFBox
  Issue Type: Bug
  Components: PDModel
Affects Versions: 2.0.0
 Environment: java7 deb7
Reporter: Ralf Hauser


InputStream content = ...;
int keyLength = 256;
AccessPermission ap = new AccessPermission();
StandardProtectionPolicy spp = new StandardProtectionPolicy(
symmPw, symmPw, ap);
spp.setEncryptionKeyLength(keyLength);
document.protect(spp);
ByteArrayOutputStream baos = new ByteArrayOutputStream();
document.save(baos);

in the save() the above mentioned exception is thrown (wasn't with the 2013-11 
snapshot)



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (PDFBOX-1031) PDFMergerUtility - form fields disappear

2014-10-23 Thread JIRA

 [ 
https://issues.apache.org/jira/browse/PDFBOX-1031?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Andreas Lehmkühler updated PDFBOX-1031:
---
Labels: PDFMergerUtility  (was: form util)

> PDFMergerUtility - form fields disappear
> 
>
> Key: PDFBOX-1031
> URL: https://issues.apache.org/jira/browse/PDFBOX-1031
> Project: PDFBox
>  Issue Type: Bug
>  Components: Utilities
>Affects Versions: 1.5.0
> Environment: Windows 7, Acrobat Pro 9, Eclipse Helios SR2
>Reporter: Gilad Denneboom
>  Labels: PDFMergerUtility
> Attachments: 1.pdf, 2.pdf, 3.pdf
>
>
> I merge 2 PDF files with fields in them,  but the result PDF contains no 
> fields.
> I believe this is related to 
> https://issues.apache.org/jira/browse/PDFBOX-930, which remains unsolved.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (PDFBOX-1595) PDFMerger failed with the following exception: java.lang.NullPointerException

2014-10-23 Thread JIRA

 [ 
https://issues.apache.org/jira/browse/PDFBOX-1595?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Andreas Lehmkühler updated PDFBOX-1595:
---
Labels: PDFMergerUtility  (was: )

> PDFMerger failed with the following exception: java.lang.NullPointerException
> -
>
> Key: PDFBOX-1595
> URL: https://issues.apache.org/jira/browse/PDFBOX-1595
> Project: PDFBox
>  Issue Type: Bug
>  Components: Utilities
>Affects Versions: 1.8.1
> Environment: Windows
>Reporter: Ernst Eibensteiner
>  Labels: PDFMergerUtility
> Attachments: 2nd Testfile.pdf, bid.pdf
>
>
> Merging 2 PDF documents leads to a null pointer exception:
> From my point of view the PDF document misses the xref and startxref tag.
> java -jar pdfbox-app-1.8.1.jar PDFMerger "bid.pdf" "2nd Testfile.pdf" 
> output.pdf
> Mai 08, 2013 12:52:03 PM org.apache.pdfbox.pdfparser.XrefTrailerResolver 
> setTrailer
> WARNING: Cannot add trailer because XRef start was not signalled.
> Mai 08, 2013 12:52:03 PM org.apache.pdfbox.pdfparser.XrefTrailerResolver 
> setStartxref
> WARNING: Did not found XRef object at specified startxref position 0
> PDFMerger failed with the following exception:
> java.lang.NullPointerException
> at 
> org.apache.pdfbox.util.PDFMergerUtility.appendDocument(PDFMergerUtility.java:257)
> at 
> org.apache.pdfbox.util.PDFMergerUtility.mergeDocuments(PDFMergerUtility.java:188)
> at org.apache.pdfbox.PDFMerger.merge(PDFMerger.java:68)
> at org.apache.pdfbox.PDFMerger.main(PDFMerger.java:44)
> at org.apache.pdfbox.PDFBox.main(PDFBox.java:83)



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Assigned] (PDFBOX-1595) PDFMerger failed with the following exception: java.lang.NullPointerException

2014-10-23 Thread JIRA

 [ 
https://issues.apache.org/jira/browse/PDFBOX-1595?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Andreas Lehmkühler reassigned PDFBOX-1595:
--

Assignee: Andreas Lehmkühler

> PDFMerger failed with the following exception: java.lang.NullPointerException
> -
>
> Key: PDFBOX-1595
> URL: https://issues.apache.org/jira/browse/PDFBOX-1595
> Project: PDFBox
>  Issue Type: Bug
>  Components: Utilities
>Affects Versions: 1.8.1
> Environment: Windows
>Reporter: Ernst Eibensteiner
>Assignee: Andreas Lehmkühler
>  Labels: PDFMergerUtility
> Attachments: 2nd Testfile.pdf, bid.pdf
>
>
> Merging 2 PDF documents leads to a null pointer exception:
> From my point of view the PDF document misses the xref and startxref tag.
> java -jar pdfbox-app-1.8.1.jar PDFMerger "bid.pdf" "2nd Testfile.pdf" 
> output.pdf
> Mai 08, 2013 12:52:03 PM org.apache.pdfbox.pdfparser.XrefTrailerResolver 
> setTrailer
> WARNING: Cannot add trailer because XRef start was not signalled.
> Mai 08, 2013 12:52:03 PM org.apache.pdfbox.pdfparser.XrefTrailerResolver 
> setStartxref
> WARNING: Did not found XRef object at specified startxref position 0
> PDFMerger failed with the following exception:
> java.lang.NullPointerException
> at 
> org.apache.pdfbox.util.PDFMergerUtility.appendDocument(PDFMergerUtility.java:257)
> at 
> org.apache.pdfbox.util.PDFMergerUtility.mergeDocuments(PDFMergerUtility.java:188)
> at org.apache.pdfbox.PDFMerger.merge(PDFMerger.java:68)
> at org.apache.pdfbox.PDFMerger.main(PDFMerger.java:44)
> at org.apache.pdfbox.PDFBox.main(PDFBox.java:83)



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (PDFBOX-2003) Merging PDFs with interactive forms produces incorrect result

2014-10-23 Thread JIRA

 [ 
https://issues.apache.org/jira/browse/PDFBOX-2003?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Andreas Lehmkühler updated PDFBOX-2003:
---
Labels: PDFMergerUtility  (was: )

> Merging PDFs with interactive forms produces incorrect result
> -
>
> Key: PDFBOX-2003
> URL: https://issues.apache.org/jira/browse/PDFBOX-2003
> Project: PDFBox
>  Issue Type: Bug
>  Components: AcroForm, Utilities
>Affects Versions: 1.8.4
>Reporter: Gerhard Temper
>  Labels: PDFMergerUtility
> Attachments: page1.pdf, page2.pdf, sample.pdf, sample2.pdf
>
>
> When merging a PDF with form fields (page2.pdf) to a PDF generated via FOP 
> (page1.pdf), the form fields of page2.pdf are not shown in the result in 
> Acrobat Reader.
> When merging page2.pdf twice, the form fields are not shown for the first 
> occurrence but are shown for the second occurrence.
> When merging page2.pdf with a PDF created by MS Word, the problem is not 
> reproducible. 
> Command line to reproduce the problem:
> java -classpath pdfbox-app-1.8.4.jar org.apache.pdfbox.PDFMerger page1.pdf 
> page2.pdf page2.pdf result.pdf



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Assigned] (PDFBOX-2001) Digital Signature information

2014-10-23 Thread JIRA

 [ 
https://issues.apache.org/jira/browse/PDFBOX-2001?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Andreas Lehmkühler reassigned PDFBOX-2001:
--

Assignee: Andreas Lehmkühler

> Digital Signature information
> -
>
> Key: PDFBOX-2001
> URL: https://issues.apache.org/jira/browse/PDFBOX-2001
> Project: PDFBox
>  Issue Type: Bug
>Affects Versions: 1.8.3
>Reporter: Nicolas Kaczmarski
>Assignee: Andreas Lehmkühler
> Attachments: D.1_signiert.pdf, acrobatSignatureExample.PNG
>
>
> We have a signed PDF but signature is described without key "Sig".
> As you can see in the standard PDF 32000-1:2008 - Table 252 - Entries in a 
> signature dictionary, this key is optional :
> "(Optional) The type of PDF object that this dictionary describes; if 
> present, shall be Sig for a signature dictionary. "
> But PDFBox seems to limit its research of signature only if this key "Sig" is 
> present.
> What is your position about that?



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (PDFBOX-2308) setPageSeparator method in PDFTextStripper class has no effect

2014-10-23 Thread JIRA

 [ 
https://issues.apache.org/jira/browse/PDFBOX-2308?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Andreas Lehmkühler updated PDFBOX-2308:
---
Fix Version/s: 2.0.0
   1.8.8

> setPageSeparator method in PDFTextStripper class has no effect
> --
>
> Key: PDFBOX-2308
> URL: https://issues.apache.org/jira/browse/PDFBOX-2308
> Project: PDFBox
>  Issue Type: Bug
>  Components: Rendering
>Affects Versions: 1.8.6, 1.8.7, 2.0.0
> Environment: Eclipse/Windows 64
>Reporter: Julien Savoyet
>Priority: Minor
> Fix For: 1.8.8, 2.0.0
>
>   Original Estimate: 1h
>  Remaining Estimate: 1h
>
> I tried to use the setPageSeparator method within the PDFTextStripper class 
> but it had no effect. 
> After a check within the sources at a glance I discovered that the 
> writePageSeparator method that uses the pageSeparator attribute is never 
> called or used anywhere.
> Thus, to work normally, It seems to my point of view that a call to the 
> writePageSeparator should be added for example at the beginning of the 
> writePage() method. Could someone of the core team check it to say me if i'am 
> right ?



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (PDFBOX-2308) setPageSeparator method in PDFTextStripper class has no effect

2014-10-23 Thread JIRA

 [ 
https://issues.apache.org/jira/browse/PDFBOX-2308?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Andreas Lehmkühler updated PDFBOX-2308:
---
Affects Version/s: 2.0.0
   1.8.7

> setPageSeparator method in PDFTextStripper class has no effect
> --
>
> Key: PDFBOX-2308
> URL: https://issues.apache.org/jira/browse/PDFBOX-2308
> Project: PDFBox
>  Issue Type: Bug
>  Components: Rendering
>Affects Versions: 1.8.6, 1.8.7, 2.0.0
> Environment: Eclipse/Windows 64
>Reporter: Julien Savoyet
>Priority: Minor
> Fix For: 1.8.8, 2.0.0
>
>   Original Estimate: 1h
>  Remaining Estimate: 1h
>
> I tried to use the setPageSeparator method within the PDFTextStripper class 
> but it had no effect. 
> After a check within the sources at a glance I discovered that the 
> writePageSeparator method that uses the pageSeparator attribute is never 
> called or used anywhere.
> Thus, to work normally, It seems to my point of view that a call to the 
> writePageSeparator should be added for example at the beginning of the 
> writePage() method. Could someone of the core team check it to say me if i'am 
> right ?



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (PDFBOX-2377) Apparent regression in character mapping in a few files from govdocs1

2014-10-23 Thread JIRA

 [ 
https://issues.apache.org/jira/browse/PDFBOX-2377?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Andreas Lehmkühler updated PDFBOX-2377:
---
Fix Version/s: 1.8.8

> Apparent regression in character mapping in a few files from govdocs1
> -
>
> Key: PDFBOX-2377
> URL: https://issues.apache.org/jira/browse/PDFBOX-2377
> Project: PDFBox
>  Issue Type: Bug
>  Components: Text extraction
>Affects Versions: 1.8.7, 2.0.0
>Reporter: Tim Allison
>Assignee: Andreas Lehmkühler
>Priority: Minor
>  Labels: regression
> Fix For: 1.8.8, 2.0.0
>
> Attachments: 290991-6.txt, 290991-7.txt, 290991-8.txt, 290991.pdf, 
> 312888.pdf, 357094-1.8.6.txt, 357094-1.8.8.txt, 357094.pdf, 764929.pdf, 
> PDFBOX2247-701542.pdf
>
>
> On a small number of test files in a 50k sample of pdfs from govdocs1, it 
> appears that some characters are no longer being extracted correctly in 1.8.7 
> when compared to 1.8.6.  I ran pdfbox's app.jar with ExtractText
> {noformat}
> 764929.pdf
> 1.8.6: Lang, Astrophysical Data: Planets and Stars
> 1.8.7: Lang, AefdaphyeiUSl DSfS: PlSnefe Snd EfSde,
> {noformat}
> and
> {noformat}
> 312888.pdf
> 1.8.6: Self-Assessment \u0026 Capability Description
> 1.8.7: Seff-Ammemmmehn \u0026 Cajabcfcns Demclcjncih
> {noformat}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (PDFBOX-2377) Apparent regression in character mapping in a few files from govdocs1

2014-10-23 Thread JIRA

 [ 
https://issues.apache.org/jira/browse/PDFBOX-2377?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Andreas Lehmkühler updated PDFBOX-2377:
---
Affects Version/s: 2.0.0

> Apparent regression in character mapping in a few files from govdocs1
> -
>
> Key: PDFBOX-2377
> URL: https://issues.apache.org/jira/browse/PDFBOX-2377
> Project: PDFBox
>  Issue Type: Bug
>  Components: Text extraction
>Affects Versions: 1.8.7, 2.0.0
>Reporter: Tim Allison
>Assignee: Andreas Lehmkühler
>Priority: Minor
>  Labels: regression
> Fix For: 1.8.8, 2.0.0
>
> Attachments: 290991-6.txt, 290991-7.txt, 290991-8.txt, 290991.pdf, 
> 312888.pdf, 357094-1.8.6.txt, 357094-1.8.8.txt, 357094.pdf, 764929.pdf, 
> PDFBOX2247-701542.pdf
>
>
> On a small number of test files in a 50k sample of pdfs from govdocs1, it 
> appears that some characters are no longer being extracted correctly in 1.8.7 
> when compared to 1.8.6.  I ran pdfbox's app.jar with ExtractText
> {noformat}
> 764929.pdf
> 1.8.6: Lang, Astrophysical Data: Planets and Stars
> 1.8.7: Lang, AefdaphyeiUSl DSfS: PlSnefe Snd EfSde,
> {noformat}
> and
> {noformat}
> 312888.pdf
> 1.8.6: Self-Assessment \u0026 Capability Description
> 1.8.7: Seff-Ammemmmehn \u0026 Cajabcfcns Demclcjncih
> {noformat}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (PDFBOX-2377) Apparent regression in character mapping in a few files from govdocs1

2014-10-23 Thread JIRA

 [ 
https://issues.apache.org/jira/browse/PDFBOX-2377?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Andreas Lehmkühler updated PDFBOX-2377:
---
Fix Version/s: 2.0.0

> Apparent regression in character mapping in a few files from govdocs1
> -
>
> Key: PDFBOX-2377
> URL: https://issues.apache.org/jira/browse/PDFBOX-2377
> Project: PDFBox
>  Issue Type: Bug
>  Components: Text extraction
>Affects Versions: 1.8.7, 2.0.0
>Reporter: Tim Allison
>Assignee: Andreas Lehmkühler
>Priority: Minor
>  Labels: regression
> Fix For: 1.8.8, 2.0.0
>
> Attachments: 290991-6.txt, 290991-7.txt, 290991-8.txt, 290991.pdf, 
> 312888.pdf, 357094-1.8.6.txt, 357094-1.8.8.txt, 357094.pdf, 764929.pdf, 
> PDFBOX2247-701542.pdf
>
>
> On a small number of test files in a 50k sample of pdfs from govdocs1, it 
> appears that some characters are no longer being extracted correctly in 1.8.7 
> when compared to 1.8.6.  I ran pdfbox's app.jar with ExtractText
> {noformat}
> 764929.pdf
> 1.8.6: Lang, Astrophysical Data: Planets and Stars
> 1.8.7: Lang, AefdaphyeiUSl DSfS: PlSnefe Snd EfSde,
> {noformat}
> and
> {noformat}
> 312888.pdf
> 1.8.6: Self-Assessment \u0026 Capability Description
> 1.8.7: Seff-Ammemmmehn \u0026 Cajabcfcns Demclcjncih
> {noformat}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (PDFBOX-2412) Loading XFDF document fails with ClassCastException

2014-10-23 Thread JIRA

 [ 
https://issues.apache.org/jira/browse/PDFBOX-2412?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Andreas Lehmkühler updated PDFBOX-2412:
---
Component/s: Parsing

> Loading XFDF document fails with ClassCastException
> ---
>
> Key: PDFBOX-2412
> URL: https://issues.apache.org/jira/browse/PDFBOX-2412
> Project: PDFBox
>  Issue Type: Bug
>  Components: Parsing
>Affects Versions: 1.8.7, 2.0.0
> Environment: MacOS X 10.9.5, Java 1.7.0_65
>Reporter: Thomas Krammer
>
> When loading the the this XFDF Document
> {code:xml}
> 
> http://ns.adobe.com/xfdf/"; xml:space="preserve">
> 
> Erblasser
> 
> 
> {code}
> using {{FDFDocument.loadXFDF(new File("ttt.xfdf"));}}
> I get the following exception:
> {code}
> java.lang.ClassCastException: 
> com.sun.org.apache.xerces.internal.dom.DeferredTextImpl cannot be cast to 
> org.w3c.dom.Element
>   at 
> org.apache.pdfbox.pdmodel.fdf.FDFDictionary.(FDFDictionary.java:105)
>   at org.apache.pdfbox.pdmodel.fdf.FDFCatalog.(FDFCatalog.java:68)
>   at 
> org.apache.pdfbox.pdmodel.fdf.FDFDocument.(FDFDocument.java:101)
>   at 
> org.apache.pdfbox.pdmodel.fdf.FDFDocument.loadXFDF(FDFDocument.java:251)
>   at 
> org.apache.pdfbox.pdmodel.fdf.FDFDocument.loadXFDF(FDFDocument.java:236)
> ...
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (PDFBOX-2412) Loading XFDF document fails with ClassCastException

2014-10-23 Thread JIRA

 [ 
https://issues.apache.org/jira/browse/PDFBOX-2412?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Andreas Lehmkühler updated PDFBOX-2412:
---
Affects Version/s: 2.0.0

> Loading XFDF document fails with ClassCastException
> ---
>
> Key: PDFBOX-2412
> URL: https://issues.apache.org/jira/browse/PDFBOX-2412
> Project: PDFBox
>  Issue Type: Bug
>  Components: Parsing
>Affects Versions: 1.8.7, 2.0.0
> Environment: MacOS X 10.9.5, Java 1.7.0_65
>Reporter: Thomas Krammer
>
> When loading the the this XFDF Document
> {code:xml}
> 
> http://ns.adobe.com/xfdf/"; xml:space="preserve">
> 
> Erblasser
> 
> 
> {code}
> using {{FDFDocument.loadXFDF(new File("ttt.xfdf"));}}
> I get the following exception:
> {code}
> java.lang.ClassCastException: 
> com.sun.org.apache.xerces.internal.dom.DeferredTextImpl cannot be cast to 
> org.w3c.dom.Element
>   at 
> org.apache.pdfbox.pdmodel.fdf.FDFDictionary.(FDFDictionary.java:105)
>   at org.apache.pdfbox.pdmodel.fdf.FDFCatalog.(FDFCatalog.java:68)
>   at 
> org.apache.pdfbox.pdmodel.fdf.FDFDocument.(FDFDocument.java:101)
>   at 
> org.apache.pdfbox.pdmodel.fdf.FDFDocument.loadXFDF(FDFDocument.java:251)
>   at 
> org.apache.pdfbox.pdmodel.fdf.FDFDocument.loadXFDF(FDFDocument.java:236)
> ...
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (PDFBOX-2413) Loaded FDF document returns null fields

2014-10-23 Thread JIRA

 [ 
https://issues.apache.org/jira/browse/PDFBOX-2413?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Andreas Lehmkühler updated PDFBOX-2413:
---
Affects Version/s: 2.0.0

> Loaded FDF document returns null fields
> ---
>
> Key: PDFBOX-2413
> URL: https://issues.apache.org/jira/browse/PDFBOX-2413
> Project: PDFBox
>  Issue Type: Bug
>  Components: Parsing
>Affects Versions: 1.8.7, 2.0.0
> Environment: MacOS X 10.9.5, Java 1.7.0_65
>Reporter: Thomas Krammer
> Attachments: erbst_erkl_form.fdf
>
>
> When loading the FDF document below using {{FDFDocument.load(InputStream)}} 
> it will load fine but the returned FDFDocument instance returns null when I 
> call {{fdf.getCatalog().getFDF().getFields()}}.
> In the log I get the following warnings:
> {code}
> Oct 08, 2014 4:40:37 PM org.apache.pdfbox.pdfparser.XrefTrailerResolver 
> setTrailer
> WARNING: Cannot add trailer because XRef start was not signalled.
> Oct 08, 2014 4:40:37 PM org.apache.pdfbox.pdfparser.XrefTrailerResolver 
> setStartxref
> WARNING: Did not found XRef object at specified startxref position 0
> {code}
> Loading the same FDF file using PDFBox 1.4.0 works fine. All later versions I 
> tried have the same problem (including 1.8.7).
> The FDF document was created using Adobe's FDF Toolkit 6.0 on Windows 8.1.
> You can download the FDF file from 
> https://www.sixtyten.de/ifam/erbst_erkl_form.fdf.zip



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (PDFBOX-2413) Loaded FDF document returns null fields

2014-10-23 Thread JIRA

 [ 
https://issues.apache.org/jira/browse/PDFBOX-2413?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Andreas Lehmkühler updated PDFBOX-2413:
---
Component/s: Parsing

> Loaded FDF document returns null fields
> ---
>
> Key: PDFBOX-2413
> URL: https://issues.apache.org/jira/browse/PDFBOX-2413
> Project: PDFBox
>  Issue Type: Bug
>  Components: Parsing
>Affects Versions: 1.8.7, 2.0.0
> Environment: MacOS X 10.9.5, Java 1.7.0_65
>Reporter: Thomas Krammer
> Attachments: erbst_erkl_form.fdf
>
>
> When loading the FDF document below using {{FDFDocument.load(InputStream)}} 
> it will load fine but the returned FDFDocument instance returns null when I 
> call {{fdf.getCatalog().getFDF().getFields()}}.
> In the log I get the following warnings:
> {code}
> Oct 08, 2014 4:40:37 PM org.apache.pdfbox.pdfparser.XrefTrailerResolver 
> setTrailer
> WARNING: Cannot add trailer because XRef start was not signalled.
> Oct 08, 2014 4:40:37 PM org.apache.pdfbox.pdfparser.XrefTrailerResolver 
> setStartxref
> WARNING: Did not found XRef object at specified startxref position 0
> {code}
> Loading the same FDF file using PDFBox 1.4.0 works fine. All later versions I 
> tried have the same problem (including 1.8.7).
> The FDF document was created using Adobe's FDF Toolkit 6.0 on Windows 8.1.
> You can download the FDF file from 
> https://www.sixtyten.de/ifam/erbst_erkl_form.fdf.zip



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (PDFBOX-2413) Loaded FDF document returns null fields

2014-10-23 Thread JIRA

 [ 
https://issues.apache.org/jira/browse/PDFBOX-2413?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Andreas Lehmkühler updated PDFBOX-2413:
---
Attachment: erbst_erkl_form.fdf

> Loaded FDF document returns null fields
> ---
>
> Key: PDFBOX-2413
> URL: https://issues.apache.org/jira/browse/PDFBOX-2413
> Project: PDFBox
>  Issue Type: Bug
>Affects Versions: 1.8.7
> Environment: MacOS X 10.9.5, Java 1.7.0_65
>Reporter: Thomas Krammer
> Attachments: erbst_erkl_form.fdf
>
>
> When loading the FDF document below using {{FDFDocument.load(InputStream)}} 
> it will load fine but the returned FDFDocument instance returns null when I 
> call {{fdf.getCatalog().getFDF().getFields()}}.
> In the log I get the following warnings:
> {code}
> Oct 08, 2014 4:40:37 PM org.apache.pdfbox.pdfparser.XrefTrailerResolver 
> setTrailer
> WARNING: Cannot add trailer because XRef start was not signalled.
> Oct 08, 2014 4:40:37 PM org.apache.pdfbox.pdfparser.XrefTrailerResolver 
> setStartxref
> WARNING: Did not found XRef object at specified startxref position 0
> {code}
> Loading the same FDF file using PDFBox 1.4.0 works fine. All later versions I 
> tried have the same problem (including 1.8.7).
> The FDF document was created using Adobe's FDF Toolkit 6.0 on Windows 8.1.
> You can download the FDF file from 
> https://www.sixtyten.de/ifam/erbst_erkl_form.fdf.zip



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Resolved] (PDFBOX-2441) Improve XRef self healing mechanism when more than one xref table

2014-10-23 Thread JIRA

 [ 
https://issues.apache.org/jira/browse/PDFBOX-2441?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Andreas Lehmkühler resolved PDFBOX-2441.

Resolution: Fixed

{quote}
Sorry, I just see that the file I attached doesn't display properly in AR.
{quote}
No need to worry, I've got caught in the same trap. I've opened it in AR and 
the first page looks fine. There isn't any error not until I scroll down. 
Saying that, thanks for the sample pdf.

However the xref stream issue is solved. The remaining issue is something which 
can't be fixed by any more or less intelligent algorithm. We have to skip such 
broken parts in the future but that is another story, so that I'm setting this 
issue to resolved.

> Improve XRef self healing mechanism when more than one xref table
> -
>
> Key: PDFBOX-2441
> URL: https://issues.apache.org/jira/browse/PDFBOX-2441
> Project: PDFBox
>  Issue Type: Bug
>  Components: Parsing
>Affects Versions: 1.8.7, 1.8.8, 2.0.0
>Reporter: Tilman Hausherr
>Assignee: Andreas Lehmkühler
> Fix For: 1.8.8, 2.0.0
>
> Attachments: 260105.pdf
>
>
> This is a follow-up issue to PDFBOX-2250:
> {quote}
> the xref repair algorithm simply searches for the nearest offset, which may 
> fail if more than one xref table is present
> ...
> Once we have a sample pdf which can't be parsed with the simple algorithm, we 
> can open a new issue.
> {quote}
> And here's one:
> {code}
> Exception in thread "main" java.io.IOException: Error: Expected a long type 
> at offset 1180, instead got '50/Filter/FlateDecode/DecodeParms'
> at 
> org.apache.pdfbox.pdfparser.BaseParser.readLong(BaseParser.java:1690)
> {code}
> That file does have more than one xref table.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Comment Edited] (PDFBOX-2441) Improve XRef self healing mechanism when more than one xref table

2014-10-23 Thread Tilman Hausherr (JIRA)

[ 
https://issues.apache.org/jira/browse/PDFBOX-2441?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14180256#comment-14180256
 ] 

Tilman Hausherr edited comment on PDFBOX-2441 at 10/23/14 7:26 AM:
---

{code}
I got a DataFormatException while parsing the object stream 69 0 R. Any ideas?
{code}
Sorry, I just see that the file I attached doesn't display properly in AR.

Recently I rechecked many of the files that had such offset problems, and a few 
of them had new (different) problems, also in AR.

Here's what probably happened: these files were uploaded in ASCII mode. This 
didn't make any trouble for some older PDF files that were encoded with 
ascii85, but did make trouble for files like this one, that have Flate decode.


was (Author: tilman):
{code}
I got a DataFormatException while parsing the object stream 69 0 R. Any ideas?
{code}
Sorry, I just see that the file I attached doesn't display properly in AR.

Recently I rechecked many of the files that had such offset problems, and a few 
of them had new problems, also in AR.

Here's what probably happened: these files were uploaded in ASCII mode. This 
didn't make any trouble for some older PDF files that were encoded with 
ascii85, but did make trouble for files like this one, that have Flate decode.

> Improve XRef self healing mechanism when more than one xref table
> -
>
> Key: PDFBOX-2441
> URL: https://issues.apache.org/jira/browse/PDFBOX-2441
> Project: PDFBox
>  Issue Type: Bug
>  Components: Parsing
>Affects Versions: 1.8.7, 1.8.8, 2.0.0
>Reporter: Tilman Hausherr
>Assignee: Andreas Lehmkühler
> Fix For: 1.8.8, 2.0.0
>
> Attachments: 260105.pdf
>
>
> This is a follow-up issue to PDFBOX-2250:
> {quote}
> the xref repair algorithm simply searches for the nearest offset, which may 
> fail if more than one xref table is present
> ...
> Once we have a sample pdf which can't be parsed with the simple algorithm, we 
> can open a new issue.
> {quote}
> And here's one:
> {code}
> Exception in thread "main" java.io.IOException: Error: Expected a long type 
> at offset 1180, instead got '50/Filter/FlateDecode/DecodeParms'
> at 
> org.apache.pdfbox.pdfparser.BaseParser.readLong(BaseParser.java:1690)
> {code}
> That file does have more than one xref table.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)