[jira] [Commented] (PDFBOX-4116) could not add text without unicode in the font

2018-02-20 Thread xing Wang (JIRA)

[ 
https://issues.apache.org/jira/browse/PDFBOX-4116?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16369988#comment-16369988
 ] 

xing Wang commented on PDFBOX-4116:
---

Hi [~tilman] , 

 

Thanks for your follow up. I have no more question on how to get accurate 
bounding box now. 

> could not add text without unicode in the font
> --
>
> Key: PDFBOX-4116
> URL: https://issues.apache.org/jira/browse/PDFBOX-4116
> Project: PDFBox
>  Issue Type: Wish
>  Components: PDModel
>Affects Versions: 2.0.8
> Environment: Windows
>Reporter: xing Wang
>Priority: Minor
> Attachments: 6076-learn589519560.pdf.CMSY10.minus.pdf, 
> 6076-learn589519560.pdf.CMSY10.minus.pdf.adj_char_bbox.0.png, 
> 6076-learn589519560.pdf.CMSY10.minus.pdf.org_char_bbox.0.png, 
> image-2018-02-19-09-23-00-110.png, image-2018-02-19-16-11-24-611.png, 
> image-2018-02-19-16-12-23-438.png, image-2018-02-19-17-42-18-260.png
>
>
> !image-2018-02-19-09-23-00-110.png!
> As shown in the debugger, that the PDFType1Font map the code of 33 to 
> "minus", but there is no unicode value associated with it. 
> If we use the code `contentStream.showText("\u0021");` to add content, it 
> will cause an error of following. 
> Exception in thread "main" java.lang.IllegalArgumentException: U+0021 
> ('exclam') is not available in this font AMZNGR+CMSY10 (generic: 
> FREBPT+CMSY10) encoding: built-in (Type 1) with differences
> at org.apache.pdfbox.pdmodel.font.PDType1Font.encode(PDType1Font.java:439)
> at org.apache.pdfbox.pdmodel.font.PDFont.encode(PDFont.java:323)
> at org.apache.pdfbox.debugger.CreatePDF.main(CreatePDF.java:63)
> The best way I could do is used the "appendRawCommands", but I find it's 
> marked as deprecated. I am wondering why or is there any replacement for this 
> function?
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: dev-unsubscr...@pdfbox.apache.org
For additional commands, e-mail: dev-h...@pdfbox.apache.org



[jira] [Commented] (PDFBOX-4116) could not add text without unicode in the font

2018-02-19 Thread Tilman Hausherr (JIRA)

[ 
https://issues.apache.org/jira/browse/PDFBOX-4116?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16369763#comment-16369763
 ] 

Tilman Hausherr commented on PDFBOX-4116:
-

If it works fine, what is your question now? Re DrawPrintTextLocations, see the 
comments there... the cyan bounds is the most reliable, except for type 3 fonts.

> could not add text without unicode in the font
> --
>
> Key: PDFBOX-4116
> URL: https://issues.apache.org/jira/browse/PDFBOX-4116
> Project: PDFBox
>  Issue Type: Wish
>  Components: PDModel
>Affects Versions: 2.0.8
> Environment: Windows
>Reporter: xing Wang
>Priority: Minor
> Attachments: 6076-learn589519560.pdf.CMSY10.minus.pdf, 
> 6076-learn589519560.pdf.CMSY10.minus.pdf.adj_char_bbox.0.png, 
> 6076-learn589519560.pdf.CMSY10.minus.pdf.org_char_bbox.0.png, 
> image-2018-02-19-09-23-00-110.png, image-2018-02-19-16-11-24-611.png, 
> image-2018-02-19-16-12-23-438.png, image-2018-02-19-17-42-18-260.png
>
>
> !image-2018-02-19-09-23-00-110.png!
> As shown in the debugger, that the PDFType1Font map the code of 33 to 
> "minus", but there is no unicode value associated with it. 
> If we use the code `contentStream.showText("\u0021");` to add content, it 
> will cause an error of following. 
> Exception in thread "main" java.lang.IllegalArgumentException: U+0021 
> ('exclam') is not available in this font AMZNGR+CMSY10 (generic: 
> FREBPT+CMSY10) encoding: built-in (Type 1) with differences
> at org.apache.pdfbox.pdmodel.font.PDType1Font.encode(PDType1Font.java:439)
> at org.apache.pdfbox.pdmodel.font.PDFont.encode(PDFont.java:323)
> at org.apache.pdfbox.debugger.CreatePDF.main(CreatePDF.java:63)
> The best way I could do is used the "appendRawCommands", but I find it's 
> marked as deprecated. I am wondering why or is there any replacement for this 
> function?
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: dev-unsubscr...@pdfbox.apache.org
For additional commands, e-mail: dev-h...@pdfbox.apache.org



[jira] [Commented] (PDFBOX-4116) could not add text without unicode in the font

2018-02-19 Thread xing Wang (JIRA)

[ 
https://issues.apache.org/jira/browse/PDFBOX-4116?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16369571#comment-16369571
 ] 

xing Wang commented on PDFBOX-4116:
---

Hi [~tilman]

> So sometimes the code there is identical to the unicode value, but often it 
> is not.

I agree with your statement, especially I am working on the math expression 
intensive papers. 

The reason I want to create PDF document is to get the tight bounding box for 
the glyph, which is critical for my analysis. For example, in the attached 
file. I could not correctly get the tight bounding box of the glyph. 

[^6076-learn589519560.pdf.CMSY10.minus.pdf][^6076-learn589519560.pdf.CMSY10.minus.pdf]

 

^I am using a different tool, but I will try pdfbox soon.^ ^My procedure is as 
follows:^
 # ^use the pdfbox to render the glyph as the pdfbox-debugger do. and on the 
rendered image for the glyph to get how to adjust the top and bottom. Such as 
the one with code 33, with glyph name "minus"^
^!image-2018-02-19-16-11-24-611.png!^

 # ^Then I use pdfminer, which is a python tool to get the glyph bounding box 
show in the red.^ 
^!image-2018-02-19-16-12-23-438.png!^

 # ^I assume the black pixels in red bbox should be of the same portion w.r.t 
the rendered bufferimage in step. But this is not true. For the minus in step 
1, it's roughly in the middle, but for the red bbox, it's a bit in the upper 
half.^ 

^I will try to repeat the process in pdfbox. If possible, could you point me 
the procedures to get the glyph bbox?^

 

> could not add text without unicode in the font
> --
>
> Key: PDFBOX-4116
> URL: https://issues.apache.org/jira/browse/PDFBOX-4116
> Project: PDFBox
>  Issue Type: Wish
>  Components: PDModel
>Affects Versions: 2.0.8
> Environment: Windows
>Reporter: xing Wang
>Priority: Minor
> Attachments: 6076-learn589519560.pdf.CMSY10.minus.pdf, 
> 6076-learn589519560.pdf.CMSY10.minus.pdf.adj_char_bbox.0.png, 
> 6076-learn589519560.pdf.CMSY10.minus.pdf.org_char_bbox.0.png, 
> image-2018-02-19-09-23-00-110.png, image-2018-02-19-16-11-24-611.png, 
> image-2018-02-19-16-12-23-438.png
>
>
> !image-2018-02-19-09-23-00-110.png!
> As shown in the debugger, that the PDFType1Font map the code of 33 to 
> "minus", but there is no unicode value associated with it. 
> If we use the code `contentStream.showText("\u0021");` to add content, it 
> will cause an error of following. 
> Exception in thread "main" java.lang.IllegalArgumentException: U+0021 
> ('exclam') is not available in this font AMZNGR+CMSY10 (generic: 
> FREBPT+CMSY10) encoding: built-in (Type 1) with differences
> at org.apache.pdfbox.pdmodel.font.PDType1Font.encode(PDType1Font.java:439)
> at org.apache.pdfbox.pdmodel.font.PDFont.encode(PDFont.java:323)
> at org.apache.pdfbox.debugger.CreatePDF.main(CreatePDF.java:63)
> The best way I could do is used the "appendRawCommands", but I find it's 
> marked as deprecated. I am wondering why or is there any replacement for this 
> function?
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: dev-unsubscr...@pdfbox.apache.org
For additional commands, e-mail: dev-h...@pdfbox.apache.org



[jira] [Commented] (PDFBOX-4116) could not add text without unicode in the font

2018-02-19 Thread Tilman Hausherr (JIRA)

[ 
https://issues.apache.org/jira/browse/PDFBOX-4116?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16369570#comment-16369570
 ] 

Tilman Hausherr commented on PDFBOX-4116:
-

The newline is are missing after the operator in the PDF you attached. (if that 
was your question... I am now going to sleep)

> could not add text without unicode in the font
> --
>
> Key: PDFBOX-4116
> URL: https://issues.apache.org/jira/browse/PDFBOX-4116
> Project: PDFBox
>  Issue Type: Wish
>  Components: PDModel
>Affects Versions: 2.0.8
> Environment: Windows
>Reporter: xing Wang
>Priority: Minor
> Attachments: 6076-learn589519560.pdf.CMSY10.minus.pdf, 
> 6076-learn589519560.pdf.CMSY10.minus.pdf.adj_char_bbox.0.png, 
> 6076-learn589519560.pdf.CMSY10.minus.pdf.org_char_bbox.0.png, 
> image-2018-02-19-09-23-00-110.png, image-2018-02-19-16-11-24-611.png, 
> image-2018-02-19-16-12-23-438.png
>
>
> !image-2018-02-19-09-23-00-110.png!
> As shown in the debugger, that the PDFType1Font map the code of 33 to 
> "minus", but there is no unicode value associated with it. 
> If we use the code `contentStream.showText("\u0021");` to add content, it 
> will cause an error of following. 
> Exception in thread "main" java.lang.IllegalArgumentException: U+0021 
> ('exclam') is not available in this font AMZNGR+CMSY10 (generic: 
> FREBPT+CMSY10) encoding: built-in (Type 1) with differences
> at org.apache.pdfbox.pdmodel.font.PDType1Font.encode(PDType1Font.java:439)
> at org.apache.pdfbox.pdmodel.font.PDFont.encode(PDFont.java:323)
> at org.apache.pdfbox.debugger.CreatePDF.main(CreatePDF.java:63)
> The best way I could do is used the "appendRawCommands", but I find it's 
> marked as deprecated. I am wondering why or is there any replacement for this 
> function?
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: dev-unsubscr...@pdfbox.apache.org
For additional commands, e-mail: dev-h...@pdfbox.apache.org



[jira] [Commented] (PDFBOX-4116) could not add text without unicode in the font

2018-02-19 Thread Tilman Hausherr (JIRA)

[ 
https://issues.apache.org/jira/browse/PDFBOX-4116?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16369555#comment-16369555
 ] 

Tilman Hausherr commented on PDFBOX-4116:
-

"embedded" is the correct word. The embedded subsetted fonts shouldn't be 
reused for the two reasons I mentioned, i.e. (1) sometimes missing unicode, and 
(2) missing glyphs. (1) is what you had, (2) is because it is subsetted, i.e. 
such a subset won't have all the glyphs, so you may have "a", "b" and "d" but 
not "c".

{color:#33}"This may or may not work"{color} is because the "raw" command 
parameter is just the end of a chain of calculations. So sometimes the code 
there is identical to the unicode value, but often it is not.

In other words: you'll often have a very bad day reusing subsetted fonts. 
Better get the original font.

I am surprised that you wrote "{color:#33}I am working on extracting 
information from PDF publications" but what you really did was adding 
text.{color}

> could not add text without unicode in the font
> --
>
> Key: PDFBOX-4116
> URL: https://issues.apache.org/jira/browse/PDFBOX-4116
> Project: PDFBox
>  Issue Type: Wish
>  Components: PDModel
>Affects Versions: 2.0.8
> Environment: Windows
>Reporter: xing Wang
>Priority: Minor
> Attachments: image-2018-02-19-09-23-00-110.png
>
>
> !image-2018-02-19-09-23-00-110.png!
> As shown in the debugger, that the PDFType1Font map the code of 33 to 
> "minus", but there is no unicode value associated with it. 
> If we use the code `contentStream.showText("\u0021");` to add content, it 
> will cause an error of following. 
> Exception in thread "main" java.lang.IllegalArgumentException: U+0021 
> ('exclam') is not available in this font AMZNGR+CMSY10 (generic: 
> FREBPT+CMSY10) encoding: built-in (Type 1) with differences
> at org.apache.pdfbox.pdmodel.font.PDType1Font.encode(PDType1Font.java:439)
> at org.apache.pdfbox.pdmodel.font.PDFont.encode(PDFont.java:323)
> at org.apache.pdfbox.debugger.CreatePDF.main(CreatePDF.java:63)
> The best way I could do is used the "appendRawCommands", but I find it's 
> marked as deprecated. I am wondering why or is there any replacement for this 
> function?
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: dev-unsubscr...@pdfbox.apache.org
For additional commands, e-mail: dev-h...@pdfbox.apache.org



[jira] [Commented] (PDFBOX-4116) could not add text without unicode in the font

2018-02-19 Thread xing Wang (JIRA)

[ 
https://issues.apache.org/jira/browse/PDFBOX-4116?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16369547#comment-16369547
 ] 

xing Wang commented on PDFBOX-4116:
---

Hi Tilman, 

Thanks for your response.

I am working on extracting information from PDF publications. It seems very 
common that the subsetted fonts are "embedded" (not sure whether I used the 
right word). I agree that some missing information did cause great trouble in 
my analysis. 

May I ask why they should not be reused? Why the "first" time use works?

Another thing is that you mentioned "This may or may not work", do you mean the 
appendRawCommands is not stable? As later I managed to convert string PDF 
command into bytes and change the content of a page. 

 

> could not add text without unicode in the font
> --
>
> Key: PDFBOX-4116
> URL: https://issues.apache.org/jira/browse/PDFBOX-4116
> Project: PDFBox
>  Issue Type: Wish
>  Components: PDModel
>Affects Versions: 2.0.8
> Environment: Windows
>Reporter: xing Wang
>Priority: Minor
> Attachments: image-2018-02-19-09-23-00-110.png
>
>
> !image-2018-02-19-09-23-00-110.png!
> As shown in the debugger, that the PDFType1Font map the code of 33 to 
> "minus", but there is no unicode value associated with it. 
> If we use the code `contentStream.showText("\u0021");` to add content, it 
> will cause an error of following. 
> Exception in thread "main" java.lang.IllegalArgumentException: U+0021 
> ('exclam') is not available in this font AMZNGR+CMSY10 (generic: 
> FREBPT+CMSY10) encoding: built-in (Type 1) with differences
> at org.apache.pdfbox.pdmodel.font.PDType1Font.encode(PDType1Font.java:439)
> at org.apache.pdfbox.pdmodel.font.PDFont.encode(PDFont.java:323)
> at org.apache.pdfbox.debugger.CreatePDF.main(CreatePDF.java:63)
> The best way I could do is used the "appendRawCommands", but I find it's 
> marked as deprecated. I am wondering why or is there any replacement for this 
> function?
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: dev-unsubscr...@pdfbox.apache.org
For additional commands, e-mail: dev-h...@pdfbox.apache.org



[jira] [Commented] (PDFBOX-4116) could not add text without unicode in the font

2018-02-19 Thread Tilman Hausherr (JIRA)

[ 
https://issues.apache.org/jira/browse/PDFBOX-4116?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16369465#comment-16369465
 ] 

Tilman Hausherr commented on PDFBOX-4116:
-

You shouldn't reuse subsetted fonts. There are many things that can go wrong, 
the most common is missing unicode (this is needed for PDFBox) or missing 
glyphs.

We're not planning to remove {{{color:#33}appendRawCommands{color}}} 
(Maruan and I agreed on that a few days ago), but it is risky. This may or may 
not work.

> could not add text without unicode in the font
> --
>
> Key: PDFBOX-4116
> URL: https://issues.apache.org/jira/browse/PDFBOX-4116
> Project: PDFBox
>  Issue Type: Wish
>  Components: PDModel
>Affects Versions: 2.0.8
> Environment: Windows, 2.0.8 of pdfbox. 
>Reporter: xing Wang
>Priority: Minor
> Attachments: image-2018-02-19-09-23-00-110.png
>
>
> !image-2018-02-19-09-23-00-110.png!
> As shown in the debugger, that the PDFType1Font map the code of 33 to 
> "minus", but there is no unicode value associated with it. 
> If we use the code `contentStream.showText("\u0021");` to add content, it 
> will cause an error of following. 
> Exception in thread "main" java.lang.IllegalArgumentException: U+0021 
> ('exclam') is not available in this font AMZNGR+CMSY10 (generic: 
> FREBPT+CMSY10) encoding: built-in (Type 1) with differences
> at org.apache.pdfbox.pdmodel.font.PDType1Font.encode(PDType1Font.java:439)
> at org.apache.pdfbox.pdmodel.font.PDFont.encode(PDFont.java:323)
> at org.apache.pdfbox.debugger.CreatePDF.main(CreatePDF.java:63)
> The best way I could do is used the "appendRawCommands", but I find it's 
> marked as deprecated. I am wondering why or is there any replacement for this 
> function?
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: dev-unsubscr...@pdfbox.apache.org
For additional commands, e-mail: dev-h...@pdfbox.apache.org