from:"Hesham \(Jira\)"

[jira] [Updated] (PDFBOX-4834) Wrong read characters for Hindi conjuncts

2020-05-14 Thread Hesham (Jira)



 [ 
https://issues.apache.org/jira/browse/PDFBOX-4834?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hesham updated PDFBOX-4834:
---
Priority: Minor  (was: Major)

> Wrong read characters for Hindi conjuncts
> -
>
> Key: PDFBOX-4834
> URL: https://issues.apache.org/jira/browse/PDFBOX-4834
> Project: PDFBox
>  Issue Type: Bug
>  Components: Parsing, PDModel
>Affects Versions: 2.0.19
> Environment: Windows 10, Java 9.
>Reporter: Hesham
>Priority: Minor
>
> When reading this Hindi PDF book using PDFBox 2.0.19:
> [https://dl.dropboxusercontent.com/s/laixlb5omvjqr7y/Hindi%20Book.pdf?dl=0]
>  
> It reads it with some wrong characters for conjuncts as it appears in this 
> file:
> [https://dl.dropboxusercontent.com/s/efyxz2eg37gvn4c/Text%20read%20by%20PDFBox%202.0.19.txt?dl=0]



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: dev-unsubscr...@pdfbox.apache.org
For additional commands, e-mail: dev-h...@pdfbox.apache.org

[jira] [Created] (PDFBOX-4834) Wrong read characters for Hindi conjuncts

2020-05-14 Thread Hesham (Jira)

Hesham created PDFBOX-4834:
--

 Summary: Wrong read characters for Hindi conjuncts
 Key: PDFBOX-4834
 URL: https://issues.apache.org/jira/browse/PDFBOX-4834
 Project: PDFBox
  Issue Type: Bug
  Components: Parsing, PDModel
Affects Versions: 2.0.19
 Environment: Windows 10, Java 9.
Reporter: Hesham


When reading this Hindi PDF book using PDFBox 2.0.19:

[https://dl.dropboxusercontent.com/s/laixlb5omvjqr7y/Hindi%20Book.pdf?dl=0]

 

It reads it with some wrong characters for conjuncts as it appears in this file:

[https://dl.dropboxusercontent.com/s/efyxz2eg37gvn4c/Text%20read%20by%20PDFBox%202.0.19.txt?dl=0]



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: dev-unsubscr...@pdfbox.apache.org
For additional commands, e-mail: dev-h...@pdfbox.apache.org

[jira] [Updated] (PDFBOX-1552) Uppercase letters are read in lowercase manner

2013-03-26 Thread Hesham (JIRA)


 [ 
https://issues.apache.org/jira/browse/PDFBOX-1552?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hesham updated PDFBOX-1552:
---

Attachment: pdf_with_uppercase_letters.pdf

This is a 1 page sample file to test.

 Uppercase letters are read in lowercase manner
 --

 Key: PDFBOX-1552
 URL: https://issues.apache.org/jira/browse/PDFBOX-1552
 Project: PDFBox
  Issue Type: Bug
  Components: Text extraction
Affects Versions: 1.7.1
 Environment: Windows XP
Reporter: Hesham
 Attachments: pdf_with_uppercase_letters.pdf


 I have a PDF that when I read its contents using PDFBox some uppercase 
 letters are being read as lowercase. For example :
 - Word Testing is read as testing
 - Word Eve is read as eve
 - Word Deuteronomy is read as deuteronomy
 Andreas commented on this by: The pdf uses marked content to replace a 
 string (14.9.4 Replacement Text of the PDF specs provides a simple example). 
 And yes, PDFBox doesn't support it, yet.
 Please check this 1-page sample PDF.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Created] (PDFBOX-1552) Uppercase letters are read in lowercase manner

2013-03-26 Thread Hesham (JIRA)

Hesham created PDFBOX-1552:
--

 Summary: Uppercase letters are read in lowercase manner
 Key: PDFBOX-1552
 URL: https://issues.apache.org/jira/browse/PDFBOX-1552
 Project: PDFBox
  Issue Type: Bug
  Components: Text extraction
Affects Versions: 1.7.1
 Environment: Windows XP
Reporter: Hesham
 Attachments: pdf_with_uppercase_letters.pdf

I have a PDF that when I read its contents using PDFBox some uppercase letters 
are being read as lowercase. For example :
- Word Testing is read as testing
- Word Eve is read as eve
- Word Deuteronomy is read as deuteronomy

Andreas commented on this by: The pdf uses marked content to replace a string 
(14.9.4 Replacement Text of the PDF specs provides a simple example). And yes, 
PDFBox doesn't support it, yet.


Please check this 1-page sample PDF.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (PDFBOX-1423) An error exists on this page. Acrobat may not display the page correctly.

2013-02-06 Thread Hesham (JIRA)


[ 
https://issues.apache.org/jira/browse/PDFBOX-1423?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13572312#comment-13572312
 ] 

Hesham commented on PDFBOX-1423:


The problem in my case is that i write text then draw shapes then write text 
then draw shapes,... etc many times in the same page. If i use endText() 
everytime before fillRect(...) then beginText() after fillRect(...) to continue 
writing text, i think problems may occur in that case.

 An error exists on this page. Acrobat may not display the page correctly.
 ---

 Key: PDFBOX-1423
 URL: https://issues.apache.org/jira/browse/PDFBOX-1423
 Project: PDFBox
  Issue Type: Bug
Affects Versions: 1.6.0
 Environment: Windows 7, WebLogic 10.3.0 and a jsp
Reporter: wentao
 Attachments: generate_pdf.pdf


 after generate the pdf. Open it within Adobe Reader X has no problem, but if 
 print it, a window pops up with An error exits on this page. Acrobat may not 
 display the page correctly. Please contact the person who created the PDF 
 document to correct the problem, printed result looks ok. 
 It seems there is no such popup message on Adobe Reader 9.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (PDFBOX-1423) An error exists on this page. Acrobat may not display the page correctly.

2013-02-05 Thread Hesham (JIRA)


[ 
https://issues.apache.org/jira/browse/PDFBOX-1423?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13572251#comment-13572251
 ] 

Hesham commented on PDFBOX-1423:


After some investigation i now know the reason of this. It is because i have 
opened a stream for text writing in the PDF then i wrote some text, then i 
started drawing a rectangle without closing the stream. That's the main 
problem. I have to close the stream before drawing anything.

example :
PDPage p = new PDPage();
PDPageContentStream ps= new PDPageContentStream( pdfFile, p );
ps.beginText();
ps.drawString( Write some text );
ps.fillRect(...);
ps.endText();
ps.close(); 
pdfFile.save( path );

I've also found this reported in here: http://forums.adobe.com/thread/464841
What do you think Andreas ?


 An error exists on this page. Acrobat may not display the page correctly.
 ---

 Key: PDFBOX-1423
 URL: https://issues.apache.org/jira/browse/PDFBOX-1423
 Project: PDFBox
  Issue Type: Bug
Affects Versions: 1.6.0
 Environment: Windows 7, WebLogic 10.3.0 and a jsp
Reporter: wentao
 Attachments: generate_pdf.pdf


 after generate the pdf. Open it within Adobe Reader X has no problem, but if 
 print it, a window pops up with An error exits on this page. Acrobat may not 
 display the page correctly. Please contact the person who created the PDF 
 document to correct the problem, printed result looks ok. 
 It seems there is no such popup message on Adobe Reader 9.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (PDFBOX-1423) An error exists on this page. Acrobat may not display the page correctly.

2013-02-03 Thread Hesham (JIRA)


[ 
https://issues.apache.org/jira/browse/PDFBOX-1423?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13569781#comment-13569781
 ] 

Hesham commented on PDFBOX-1423:


I can replicate this when printing a PDF generated by PDFBox 1.7.1  using Adobe 
reader version 9.4.6.

 An error exists on this page. Acrobat may not display the page correctly.
 ---

 Key: PDFBOX-1423
 URL: https://issues.apache.org/jira/browse/PDFBOX-1423
 Project: PDFBox
  Issue Type: Bug
Affects Versions: 1.6.0
 Environment: Windows 7, WebLogic 10.3.0 and a jsp
Reporter: wentao
 Attachments: generate_pdf.pdf


 after generate the pdf. Open it within Adobe Reader X has no problem, but if 
 print it, a window pops up with An error exits on this page. Acrobat may not 
 display the page correctly. Please contact the person who created the PDF 
 document to correct the problem, printed result looks ok. 
 It seems there is no such popup message on Adobe Reader 9.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (PDFBOX-954) Embedded font: value for /Widths faulty (worked in PDFBox 1.3.0!)

2012-07-31 Thread Hesham (JIRA)


[ 
https://issues.apache.org/jira/browse/PDFBOX-954?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13425684#comment-13425684
 ] 

Hesham commented on PDFBOX-954:
---

I have tested this on Windows  Mac OS X, and it works fine.
Thanks Wolfgang ... Thanks Andreas :)

 Embedded font: value for /Widths faulty (worked in PDFBox 1.3.0!)
 -

 Key: PDFBOX-954
 URL: https://issues.apache.org/jira/browse/PDFBOX-954
 Project: PDFBox
  Issue Type: Bug
  Components: FontBox
Affects Versions: 1.4.0
 Environment: JDK1.6.0_23, Windows XP
Reporter: MH
Assignee: Andreas Lehmkühler
 Fix For: 1.7.1

 Attachments: Imagen 1.png, Imagen 2.png, Imagen 3.png, Main.java, 
 MainVer2.java, MainVer2.java, hello_ttf_1.1.0.pdf, hello_ttf_1.4.0.pdf, 
 out.pdf, outVer2.pdf, pdfbox-1.7.0-ttf-widths-encoding-fix.patch


 We have a problem with the font 'LucidiaSansUnicode (l_10646.ttf). It is 
 embedded in a PDF and when viewing this PDF (with Acrobat Reader 9), an error
In der Schrift LucidaSansUnicode ist der Wert für /Widths fehlerhaft.
 occurs (roughly translated: In font LucidaSansUNicode the value for 
 /Widths is faulty.). I noticed that this error only occurs when the first 
 page is displayed that has text added by PDFBox! The same font is also used 
 for all other text (used by Apache FOP to generate). When I look at the 
 dialog window of Acrobat 3. tab Fonts, I notice lots of entries
 LucidaSansUnicode (Eingebettete Untergruppe)
 Typ: TrueType (CID)
 Kodierung: Identity-H
 but only 1 entry at the very top that looks different:
 LucidaSansUnicode (Eingebettet)
  Typ: TrueType
 Kodierung: Ansi
 I guess one is from Apache FOP (generation of PDF) and one is from PDFBox 
 (adding additional text to the PDF). However, both use the same source file 
 l_10646.ttf!
 Using PDFBox 1.3.0-snapshot (or iText 2.1.7), this problem does NOT occur!
 This only occurs with this LucidaSansUnicode font - all our other custom 
 fonts don't cause this problem.
 The difference I notice in Acrobat Reader Fonts tab is the first font entry:
 PDFBox 1.4.0:
 LucidaSansUnicode (Eingebettet)
 Typ: TrueType
 Kodierung: Ansi
 PDFBox 1.3.0 or iText 2.1.7:
 LucidaSansUnicode (Eingebettete Untergruppe)
 Typ: TrueType
 Kodierung: Ansi
 So, PDFBox 1.4.0 only shows embedded (Eingebettet) but PDFBox 1.3.0/iText 
 version shows embedded subgroup (Eingebettete Untergruppe)! Perhaps this 
 is the problem?

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (PDFBOX-954) Embedded font: value for /Widths faulty (worked in PDFBox 1.3.0!)

2011-09-07 Thread Hesham (JIRA)


[ 
https://issues.apache.org/jira/browse/PDFBOX-954?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13098862#comment-13098862
 ] 

Hesham commented on PDFBOX-954:
---

Will this issue be fixed in the next program version ? I see this is a critical 
issue !

 Embedded font: value for /Widths faulty (worked in PDFBox 1.3.0!)
 -

 Key: PDFBOX-954
 URL: https://issues.apache.org/jira/browse/PDFBOX-954
 Project: PDFBox
  Issue Type: Bug
  Components: FontBox
Affects Versions: 1.4.0
 Environment: JDK1.6.0_23, Windows XP
Reporter: MH
 Attachments: hello_ttf_1.1.0.pdf, hello_ttf_1.4.0.pdf


 We have a problem with the font 'LucidiaSansUnicode (l_10646.ttf). It is 
 embedded in a PDF and when viewing this PDF (with Acrobat Reader 9), an error
In der Schrift LucidaSansUnicode ist der Wert für /Widths fehlerhaft.
 occurs (roughly translated: In font LucidaSansUNicode the value for 
 /Widths is faulty.). I noticed that this error only occurs when the first 
 page is displayed that has text added by PDFBox! The same font is also used 
 for all other text (used by Apache FOP to generate). When I look at the 
 dialog window of Acrobat 3. tab Fonts, I notice lots of entries
 LucidaSansUnicode (Eingebettete Untergruppe)
 Typ: TrueType (CID)
 Kodierung: Identity-H
 but only 1 entry at the very top that looks different:
 LucidaSansUnicode (Eingebettet)
  Typ: TrueType
 Kodierung: Ansi
 I guess one is from Apache FOP (generation of PDF) and one is from PDFBox 
 (adding additional text to the PDF). However, both use the same source file 
 l_10646.ttf!
 Using PDFBox 1.3.0-snapshot (or iText 2.1.7), this problem does NOT occur!
 This only occurs with this LucidaSansUnicode font - all our other custom 
 fonts don't cause this problem.
 The difference I notice in Acrobat Reader Fonts tab is the first font entry:
 PDFBox 1.4.0:
 LucidaSansUnicode (Eingebettet)
 Typ: TrueType
 Kodierung: Ansi
 PDFBox 1.3.0 or iText 2.1.7:
 LucidaSansUnicode (Eingebettete Untergruppe)
 Typ: TrueType
 Kodierung: Ansi
 So, PDFBox 1.4.0 only shows embedded (Eingebettet) but PDFBox 1.3.0/iText 
 version shows embedded subgroup (Eingebettete Untergruppe)! Perhaps this 
 is the problem?

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] Commented: (PDFBOX-938) Wrong extracted text using PDFBox 1.4

2011-02-03 Thread Hesham (JIRA)


[ 
https://issues.apache.org/jira/browse/PDFBOX-938?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12990014#comment-12990014
 ] 

Hesham commented on PDFBOX-938:
---

@Andreas ... That is why i have sent you an application sample. The font used 
in the JTextArea is Tahoma :
pdfTextArea.setFont(new Font(Tahoma, Font.PLAIN, 12));

And the encoding used to extract text :
PDFTextStripper stripper = new PDFTextStripper( utf-8 );

Is there anything else that may cause such a problem ?

 Wrong extracted text using PDFBox 1.4
 -

 Key: PDFBOX-938
 URL: https://issues.apache.org/jira/browse/PDFBOX-938
 Project: PDFBox
  Issue Type: Bug
  Components: Text extraction
Affects Versions: 1.4.0
Reporter: Hesham
Assignee: Andreas Lehmkühler
 Fix For: 1.5.0

 Attachments: Another book - Wrong extracted f char.pdf, 
 Another+book+-+Wrong+extracted+f+char.txt, Sample.zip, Wrong extracted f 
 char.pdf


 Hello ,
  
 I am using PDFBox v1.4 to extract some text from a PDF, but some words are 
 not extracted right.
 For example words :
 Nefteiugansk is read: Nežeiugansk
 fiancee is read: Äancée
 first is read: Ärst
  
 Please check the attached file to test this.
 Best regards

-- 
This message is automatically generated by JIRA.
-
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] Commented: (PDFBOX-938) Wrong extracted text using PDFBox 1.4

2011-01-26 Thread Hesham (JIRA)


[ 
https://issues.apache.org/jira/browse/PDFBOX-938?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12987368#action_12987368
 ] 

Hesham commented on PDFBOX-938:
---

@Andreas ... Did the jar work fine with you ?

 Wrong extracted text using PDFBox 1.4
 -

 Key: PDFBOX-938
 URL: https://issues.apache.org/jira/browse/PDFBOX-938
 Project: PDFBox
  Issue Type: Bug
  Components: Text extraction
Affects Versions: 1.4.0
Reporter: Hesham
 Attachments: Another book - Wrong extracted f char.pdf, 
 Another+book+-+Wrong+extracted+f+char.txt, Sample.zip, Wrong extracted f 
 char.pdf


 Hello ,
  
 I am using PDFBox v1.4 to extract some text from a PDF, but some words are 
 not extracted right.
 For example words :
 Nefteiugansk is read: Nežeiugansk
 fiancee is read: Äancée
 first is read: Ärst
  
 Please check the attached file to test this.
 Best regards

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Updated: (PDFBOX-938) Wrong extracted text using PDFBox 1.4

2011-01-23 Thread Hesham (JIRA)


 [ 
https://issues.apache.org/jira/browse/PDFBOX-938?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hesham updated PDFBOX-938:
--

Attachment: Sample.zip

 Wrong extracted text using PDFBox 1.4
 -

 Key: PDFBOX-938
 URL: https://issues.apache.org/jira/browse/PDFBOX-938
 Project: PDFBox
  Issue Type: Bug
  Components: Text extraction
Affects Versions: 1.4.0
Reporter: Hesham
 Attachments: Another book - Wrong extracted f char.pdf, 
 Another+book+-+Wrong+extracted+f+char.txt, Sample.zip, Wrong extracted f 
 char.pdf


 Hello ,
  
 I am using PDFBox v1.4 to extract some text from a PDF, but some words are 
 not extracted right.
 For example words :
 Nefteiugansk is read: Nežeiugansk
 fiancee is read: Äancée
 first is read: Ärst
  
 Please check the attached file to test this.
 Best regards

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (PDFBOX-938) Wrong extracted text using PDFBox 1.4

2011-01-23 Thread Hesham (JIRA)


[ 
https://issues.apache.org/jira/browse/PDFBOX-938?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12985341#action_12985341
 ] 

Hesham commented on PDFBOX-938:
---

@Andreas ... Thanks for your reply.
I have attached a sample executable jar Sample.zip to test it ... Please 
download it, extract the zip and just double click the jar file. The source 
code is also inside.

If you see any problems with it please tell me about it. I am still getting the 
same problems when using it. 

 Wrong extracted text using PDFBox 1.4
 -

 Key: PDFBOX-938
 URL: https://issues.apache.org/jira/browse/PDFBOX-938
 Project: PDFBox
  Issue Type: Bug
  Components: Text extraction
Affects Versions: 1.4.0
Reporter: Hesham
 Attachments: Another book - Wrong extracted f char.pdf, 
 Another+book+-+Wrong+extracted+f+char.txt, Sample.zip, Wrong extracted f 
 char.pdf


 Hello ,
  
 I am using PDFBox v1.4 to extract some text from a PDF, but some words are 
 not extracted right.
 For example words :
 Nefteiugansk is read: Nežeiugansk
 fiancee is read: Äancée
 first is read: Ärst
  
 Please check the attached file to test this.
 Best regards

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (PDFBOX-588) Problem extracting text in newline characters

2011-01-18 Thread Hesham (JIRA)

[
https://issues.apache.org/jira/browse/PDFBOX-588?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12983065#action_12983065
]

Hesham commented on PDFBOX-588:
---

Just a notice ... I have tested extracting the PDF reference data on my Mac
today, and it worked fine ... it took 2 minutes. The last trial was on my
normal PC(Windows XP - Core 2 Duo - 2.5 GB Rams) which took about 6 minutes. I
don't know why it is that slow !! ... If i find any reason for this i will
write it here.

Problem extracting text in newline characters
-

Key: PDFBOX-588
URL: https://issues.apache.org/jira/browse/PDFBOX-588
Project: PDFBox
Issue Type: Bug
Components: Text extraction
Affects Versions: 0.8.0-incubator, 1.3.1, 1.4.0
Environment: Win XP
Reporter: Hesham
Assignee: Andreas Lehmkühler
Attachments: Enters-sample.pdf, PDFBOX588-Enters-sample.txt,
PDFBOX588-Enters-sample1.png, PDFBOX588-Enters-sample1.png,
PDFTextStripper.patch

Hello ,

I have a PDF file with 1 page only, when I try to extract its text using :
String pageData = stripper.getText( pdfFile );
It ignores some Enter characters between lines, so the last word in the line
and the first word in the next line appear as 1 word without spaces between
them !!
While if I copy the PDF text manually from the PDF and paste it in a text
editor, Enter characters appear after the same lines that caused the problem
in PDFBox.
Please check the attached file as a sample.

Is there a way to fix this ?

Best regards ,

--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Updated: (PDFBOX-943) Creating a link without borders appears with borders in Mac's Preview

2011-01-18 Thread Hesham (JIRA)


 [ 
https://issues.apache.org/jira/browse/PDFBOX-943?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hesham updated PDFBOX-943:
--

Attachment: links_testing.pdf

 Creating a link without borders appears with borders in Mac's Preview
 -

 Key: PDFBOX-943
 URL: https://issues.apache.org/jira/browse/PDFBOX-943
 Project: PDFBox
  Issue Type: Bug
  Components: Writing
Affects Versions: 1.4.0
 Environment: Mac book
Reporter: Hesham
 Attachments: links_testing.pdf


 I am trying to create a link with no borders. The link appears and works 
 perfect in Adobe reader, but in Mac Preview the link appears with a border 
 around it. Here is my code :
PDAnnotationLink link = new PDAnnotationLink();
PDBorderStyleDictionary border = new PDBorderStyleDictionary();
border.setWidth( 0f );
link.setBorderStyle( border );
 Can this be fixed to show no border in Mac's Preview ?
 I have attached a sample PDF with a link in its last page ... You can test it 
 on Adobe reader and Mac's Preview programs.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (PDFBOX-588) Problem extracting text in newline characters

2011-01-14 Thread Hesham (JIRA)


[ 
https://issues.apache.org/jira/browse/PDFBOX-588?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12981910#action_12981910
 ] 

Hesham commented on PDFBOX-588:
---

I do not know what is a fragmented font !
But i have created a sample project to test extracting text from the PDF 
reference, and it took the same time i mentioned for the 2 PDFBox versions. I 
do not understand how it works fine with you !

Here is my code :
private void readPDFButtonActionPerformed() {
try {
PDDocument pdfRef = PDDocument.load( 
C:\\pdf_reference_1.7.pdf );
PDFTextStripper stripper = new PDFTextStripper();

for( int pageNum = 1; pageNum  pdfRef.getNumberOfPages(); 
pageNum++ ) {
System.out.println( pageNum );
stripper.setStartPage( pageNum );
stripper.setEndPage( pageNum );
stripper.getText( pdfRef ); 
}
System.out.println( Done );
} catch (IOException e) {
e.printStackTrace();
}
}

 Problem extracting text in newline characters
 -

 Key: PDFBOX-588
 URL: https://issues.apache.org/jira/browse/PDFBOX-588
 Project: PDFBox
  Issue Type: Bug
  Components: Text extraction
Affects Versions: 0.8.0-incubator, 1.3.1, 1.4.0
 Environment: Win XP
Reporter: Hesham
Assignee: Andreas Lehmkühler
 Attachments: Enters-sample.pdf, PDFBOX588-Enters-sample.txt, 
 PDFBOX588-Enters-sample1.png, PDFBOX588-Enters-sample1.png, 
 PDFTextStripper.patch


 Hello ,
  
 I have a PDF file with 1 page only, when I try to extract its text using :
 String pageData = stripper.getText( pdfFile );
 It ignores some Enter characters between lines, so the last word in the line 
 and the first word in the next line appear as 1 word without spaces between 
 them !!
 While if I copy the PDF text manually from the PDF and paste it in a text 
 editor, Enter characters appear after the same lines that caused the problem 
 in PDFBox.
 Please check the attached file as a sample.
  
 Is there a way to fix this ?
  
 Best regards ,

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (PDFBOX-588) Problem extracting text in newline characters

2011-01-13 Thread Hesham (JIRA)


[ 
https://issues.apache.org/jira/browse/PDFBOX-588?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12981472#action_12981472
 ] 

Hesham commented on PDFBOX-588:
---

Strange !!

I have PDF Reference v1.7 ... It is 1310 pages, right ?
Extracting all its text using PDFBox v0.7.3 took 35 seconds.
Extracting all the text using PDFBox v1.4 took 6 minutes and 10 seconds.



 Problem extracting text in newline characters
 -

 Key: PDFBOX-588
 URL: https://issues.apache.org/jira/browse/PDFBOX-588
 Project: PDFBox
  Issue Type: Bug
  Components: Text extraction
Affects Versions: 0.8.0-incubator, 1.3.1, 1.4.0
 Environment: Win XP
Reporter: Hesham
Assignee: Andreas Lehmkühler
 Attachments: Enters-sample.pdf, PDFBOX588-Enters-sample.txt, 
 PDFBOX588-Enters-sample1.png, PDFBOX588-Enters-sample1.png, 
 PDFTextStripper.patch


 Hello ,
  
 I have a PDF file with 1 page only, when I try to extract its text using :
 String pageData = stripper.getText( pdfFile );
 It ignores some Enter characters between lines, so the last word in the line 
 and the first word in the next line appear as 1 word without spaces between 
 them !!
 While if I copy the PDF text manually from the PDF and paste it in a text 
 editor, Enter characters appear after the same lines that caused the problem 
 in PDFBox.
 Please check the attached file as a sample.
  
 Is there a way to fix this ?
  
 Best regards ,

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (PDFBOX-938) Wrong extracted text using PDFBox 1.4

2011-01-12 Thread Hesham (JIRA)


[ 
https://issues.apache.org/jira/browse/PDFBOX-938?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12980623#action_12980623
 ] 

Hesham commented on PDFBOX-938:
---

I am using Windows XP ... I have tested ICU4J with an Arabic PDF and it parses 
it right,(From right to left, while without ICU4J it reads the Arabic 
characters reversed).

Can i do anything else ?

 Wrong extracted text using PDFBox 1.4
 -

 Key: PDFBOX-938
 URL: https://issues.apache.org/jira/browse/PDFBOX-938
 Project: PDFBox
  Issue Type: Bug
  Components: Text extraction
Affects Versions: 1.4.0
Reporter: Hesham
 Attachments: Wrong extracted f char.pdf


 Hello ,
  
 I am using PDFBox v1.4 to extract some text from a PDF, but some words are 
 not extracted right.
 For example words :
 Nefteiugansk is read: Nežeiugansk
 fiancee is read: Äancée
 first is read: Ärst
  
 Please check the attached file to test this.
 Best regards

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (PDFBOX-938) Wrong extracted text using PDFBox 1.4

2011-01-12 Thread Hesham (JIRA)


[ 
https://issues.apache.org/jira/browse/PDFBOX-938?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12980713#action_12980713
 ] 

Hesham commented on PDFBOX-938:
---

I am using eclipse .. Its default encoding is CP1252 ... There're 2 points here 
:
1. Arabic characters appear fine, which needs a similar encoding.
2. I have created a jar sample that reads the PDF + writes the output to a 
textArea(Or whatever output component) to see it(The component font is Tahoma).

 Wrong extracted text using PDFBox 1.4
 -

 Key: PDFBOX-938
 URL: https://issues.apache.org/jira/browse/PDFBOX-938
 Project: PDFBox
  Issue Type: Bug
  Components: Text extraction
Affects Versions: 1.4.0
Reporter: Hesham
 Attachments: Wrong extracted f char.pdf


 Hello ,
  
 I am using PDFBox v1.4 to extract some text from a PDF, but some words are 
 not extracted right.
 For example words :
 Nefteiugansk is read: Nežeiugansk
 fiancee is read: Äancée
 first is read: Ärst
  
 Please check the attached file to test this.
 Best regards

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Created: (PDFBOX-938) Wrong extracted text using PDFBox 1.4

2011-01-11 Thread Hesham (JIRA)

Wrong extracted text using PDFBox 1.4
-

 Key: PDFBOX-938
 URL: https://issues.apache.org/jira/browse/PDFBOX-938
 Project: PDFBox
  Issue Type: Bug
  Components: Text extraction
Affects Versions: 1.4.0
Reporter: Hesham
 Attachments: Wrong extracted f char.pdf

Hello ,
 
I am using PDFBox v1.4 to extract some text from a PDF, but some words are not 
extracted right.
For example words :
Nefteiugansk is read: Nežeiugansk
fiancee is read: Äancée
first is read: Ärst
 
Please check the attached file to test this.

Best regards

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Updated: (PDFBOX-938) Wrong extracted text using PDFBox 1.4

2011-01-11 Thread Hesham (JIRA)


 [ 
https://issues.apache.org/jira/browse/PDFBOX-938?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hesham updated PDFBOX-938:
--

Attachment: Wrong extracted f char.pdf

 Wrong extracted text using PDFBox 1.4
 -

 Key: PDFBOX-938
 URL: https://issues.apache.org/jira/browse/PDFBOX-938
 Project: PDFBox
  Issue Type: Bug
  Components: Text extraction
Affects Versions: 1.4.0
Reporter: Hesham
 Attachments: Wrong extracted f char.pdf


 Hello ,
  
 I am using PDFBox v1.4 to extract some text from a PDF, but some words are 
 not extracted right.
 For example words :
 Nefteiugansk is read: Nežeiugansk
 fiancee is read: Äancée
 first is read: Ärst
  
 Please check the attached file to test this.
 Best regards

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Issue Comment Edited: (PDFBOX-938) Wrong extracted text using PDFBox 1.4

2011-01-11 Thread Hesham (JIRA)


[ 
https://issues.apache.org/jira/browse/PDFBOX-938?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12980608#action_12980608
 ] 

Hesham edited comment on PDFBOX-938 at 1/12/11 2:44 AM:


Thanks Johannes ... I see that ICU4J is now included in PDFBox 1.4. I have 
tried it but it is still giving the same results !  You can try it yourself.

Should i add a special code to apply the ICU4J. I only use this :
PDFTextStripper myStripper = new PDFTextStripper();
myStripper.getText( myPDFFile )

  was (Author: hesham):
Thanks Johannes ... I see that ICU4J is now included in PDFBox 1.4. I have 
tried it but it is still giving the same results !

You can try it yourself.
  
 Wrong extracted text using PDFBox 1.4
 -

 Key: PDFBOX-938
 URL: https://issues.apache.org/jira/browse/PDFBOX-938
 Project: PDFBox
  Issue Type: Bug
  Components: Text extraction
Affects Versions: 1.4.0
Reporter: Hesham
 Attachments: Wrong extracted f char.pdf


 Hello ,
  
 I am using PDFBox v1.4 to extract some text from a PDF, but some words are 
 not extracted right.
 For example words :
 Nefteiugansk is read: Nežeiugansk
 fiancee is read: Äancée
 first is read: Ärst
  
 Please check the attached file to test this.
 Best regards

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (PDFBOX-588) Problem extracting text in newline characters

2011-01-05 Thread Hesham (JIRA)


[ 
https://issues.apache.org/jira/browse/PDFBOX-588?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12977745#action_12977745
 ] 

Hesham commented on PDFBOX-588:
---

@Andreas ... Nice work :)
As you are saying, it just merges the 2 lines together in the left  right 
paragraphs.

 Problem extracting text in newline characters
 -

 Key: PDFBOX-588
 URL: https://issues.apache.org/jira/browse/PDFBOX-588
 Project: PDFBox
  Issue Type: Bug
  Components: Text extraction
Affects Versions: 0.8.0-incubator, 1.3.1, 1.4.0
 Environment: Win XP
Reporter: Hesham
Assignee: Andreas Lehmkühler
 Attachments: Enters-sample.pdf, PDFBOX588-Enters-sample.txt, 
 PDFBOX588-Enters-sample1.png, PDFBOX588-Enters-sample1.png, 
 PDFTextStripper.patch


 Hello ,
  
 I have a PDF file with 1 page only, when I try to extract its text using :
 String pageData = stripper.getText( pdfFile );
 It ignores some Enter characters between lines, so the last word in the line 
 and the first word in the next line appear as 1 word without spaces between 
 them !!
 While if I copy the PDF text manually from the PDF and paste it in a text 
 editor, Enter characters appear after the same lines that caused the problem 
 in PDFBox.
 Please check the attached file as a sample.
  
 Is there a way to fix this ?
  
 Best regards ,

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Updated: (PDFBOX-935) Text not extracted with PDFBox 1.4

2011-01-05 Thread Hesham (JIRA)


 [ 
https://issues.apache.org/jira/browse/PDFBOX-935?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hesham updated PDFBOX-935:
--

Attachment: data_not_extracted.pdf

 Text not extracted with PDFBox 1.4
 --

 Key: PDFBOX-935
 URL: https://issues.apache.org/jira/browse/PDFBOX-935
 Project: PDFBox
  Issue Type: Bug
  Components: Text extraction
Affects Versions: 1.4.0
Reporter: Hesham
 Fix For: 1.2.1

 Attachments: data_not_extracted.pdf


 I have used PDFBox v1.2.1 to extract text from a PDF file, and it works 
 perfect. But now I have tested it with PDFBox v1.4 and most of the text is 
 not extracted.
 I have attached a 1-page PDF file to test.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (PDFBOX-588) Problem extracting text in newline characters

2011-01-04 Thread Hesham (JIRA)

[
https://issues.apache.org/jira/browse/PDFBOX-588?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12977281#action_12977281
]

Hesham commented on PDFBOX-588:
---

Thanks a lot Mel and Andreas for the investigation ...
'PDFTextStripper.setSpacingTolerance(float)' method is very interesting. I have
tested it on 5 PDFs the best value for me was (0.3f). It mostly extracts all
words right.

As for the attached PDF in this issue, the problem of spaces is now limited to
the last words of the paragraph at the low left side like :
able to - ableto
in order - inorder
But not - Butnot
who set - whoset

I think this is because of the 'Enters' problem. I will check it now in
PDFBox-521.

Problem extracting text in newline characters
-

Key: PDFBOX-588
URL: https://issues.apache.org/jira/browse/PDFBOX-588
Project: PDFBox
Issue Type: Bug
Components: Text extraction
Affects Versions: 0.8.0-incubator
Environment: Win XP
Reporter: Hesham
Attachments: Enters-sample.pdf, PDFBOX588-Enters-sample1.png,
PDFTextStripper.patch

Hello ,

Is there a way to fix this ?

Best regards ,

--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Issue Comment Edited: (PDFBOX-588) Problem extracting text in newline characters

2011-01-04 Thread Hesham (JIRA)

[
https://issues.apache.org/jira/browse/PDFBOX-588?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12977281#action_12977281
]

Hesham edited comment on PDFBOX-588 at 1/4/11 9:39 AM:
---

As for the attached PDF in this issue, the problem of spaces is now limited to
the last words of the paragraph at the low left side like :
be able to read about Paul Revere's midnight -
beabletoreadaboutPaulRevere'smidnight
journey only a - journeyonlya

If i used a spacing tolerance (0.1f), those words will be extracted right, but
in return other words will appear wrong like :
UNCENSORED REVOLUTIONARY WAR HISTORY - U N C E N S O R E D R E V O L U T I
O N A R Y W A R H I S T O R Y

So i guess i will leave it with value (0.3)f which is much better. I will check
now the Enters problem in PDFBox-521.

was (Author: hesham):
Thanks a lot Mel and Andreas for the investigation ...
'PDFTextStripper.setSpacingTolerance(float)' method is very interesting. I have
tested it on 5 PDFs the best value for me was (0.3f). It mostly extracts all
words right.

I think this is because of the 'Enters' problem. I will check it now in
PDFBox-521.

Problem extracting text in newline characters
-

Hello ,

Is there a way to fix this ?

Best regards ,

--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (PDFBOX-588) Problem extracting text in newline characters

2011-01-04 Thread Hesham (JIRA)


[ 
https://issues.apache.org/jira/browse/PDFBOX-588?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12977390#action_12977390
 ] 

Hesham commented on PDFBOX-588:
---

I have checked the Enters problem in PDFBox-52. I am still trying to understand 
things ... Should i use the isParagraphSeparation(...) method ? Can you please 
give me an example so i can understand this ?

 Problem extracting text in newline characters
 -

 Key: PDFBOX-588
 URL: https://issues.apache.org/jira/browse/PDFBOX-588
 Project: PDFBox
  Issue Type: Bug
  Components: Text extraction
Affects Versions: 0.8.0-incubator
 Environment: Win XP
Reporter: Hesham
 Attachments: Enters-sample.pdf, PDFBOX588-Enters-sample1.png, 
 PDFTextStripper.patch


 Hello ,
  
 I have a PDF file with 1 page only, when I try to extract its text using :
 String pageData = stripper.getText( pdfFile );
 It ignores some Enter characters between lines, so the last word in the line 
 and the first word in the next line appear as 1 word without spaces between 
 them !!
 While if I copy the PDF text manually from the PDF and paste it in a text 
 editor, Enter characters appear after the same lines that caused the problem 
 in PDFBox.
 Please check the attached file as a sample.
  
 Is there a way to fix this ?
  
 Best regards ,

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] [Updated] (PDFBOX-4834) Wrong read characters for Hindi conjuncts

[jira] [Created] (PDFBOX-4834) Wrong read characters for Hindi conjuncts

[jira] [Updated] (PDFBOX-1552) Uppercase letters are read in lowercase manner

[jira] [Created] (PDFBOX-1552) Uppercase letters are read in lowercase manner

[jira] [Commented] (PDFBOX-1423) An error exists on this page. Acrobat may not display the page correctly.

[jira] [Commented] (PDFBOX-1423) An error exists on this page. Acrobat may not display the page correctly.

[jira] [Commented] (PDFBOX-1423) An error exists on this page. Acrobat may not display the page correctly.

[jira] [Commented] (PDFBOX-954) Embedded font: value for /Widths faulty (worked in PDFBox 1.3.0!)

[jira] [Commented] (PDFBOX-954) Embedded font: value for /Widths faulty (worked in PDFBox 1.3.0!)

[jira] Commented: (PDFBOX-938) Wrong extracted text using PDFBox 1.4

[jira] Commented: (PDFBOX-938) Wrong extracted text using PDFBox 1.4

[jira] Updated: (PDFBOX-938) Wrong extracted text using PDFBox 1.4

[jira] Commented: (PDFBOX-938) Wrong extracted text using PDFBox 1.4

[jira] Commented: (PDFBOX-588) Problem extracting text in newline characters

[jira] Updated: (PDFBOX-943) Creating a link without borders appears with borders in Mac's Preview

[jira] Commented: (PDFBOX-588) Problem extracting text in newline characters

[jira] Commented: (PDFBOX-588) Problem extracting text in newline characters

[jira] Commented: (PDFBOX-938) Wrong extracted text using PDFBox 1.4

[jira] Commented: (PDFBOX-938) Wrong extracted text using PDFBox 1.4

[jira] Created: (PDFBOX-938) Wrong extracted text using PDFBox 1.4

[jira] Updated: (PDFBOX-938) Wrong extracted text using PDFBox 1.4

[jira] Issue Comment Edited: (PDFBOX-938) Wrong extracted text using PDFBox 1.4

[jira] Commented: (PDFBOX-588) Problem extracting text in newline characters

[jira] Updated: (PDFBOX-935) Text not extracted with PDFBox 1.4

[jira] Commented: (PDFBOX-588) Problem extracting text in newline characters

[jira] Issue Comment Edited: (PDFBOX-588) Problem extracting text in newline characters

[jira] Commented: (PDFBOX-588) Problem extracting text in newline characters

27 matches

Site Navigation

Mail list logo

Footer information