[jira] [Updated] (PDFBOX-4834) Wrong read characters for Hindi conjuncts
[ https://issues.apache.org/jira/browse/PDFBOX-4834?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hesham updated PDFBOX-4834: --- Priority: Minor (was: Major) > Wrong read characters for Hindi conjuncts > - > > Key: PDFBOX-4834 > URL: https://issues.apache.org/jira/browse/PDFBOX-4834 > Project: PDFBox > Issue Type: Bug > Components: Parsing, PDModel >Affects Versions: 2.0.19 > Environment: Windows 10, Java 9. >Reporter: Hesham >Priority: Minor > > When reading this Hindi PDF book using PDFBox 2.0.19: > [https://dl.dropboxusercontent.com/s/laixlb5omvjqr7y/Hindi%20Book.pdf?dl=0] > > It reads it with some wrong characters for conjuncts as it appears in this > file: > [https://dl.dropboxusercontent.com/s/efyxz2eg37gvn4c/Text%20read%20by%20PDFBox%202.0.19.txt?dl=0] -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: dev-unsubscr...@pdfbox.apache.org For additional commands, e-mail: dev-h...@pdfbox.apache.org
[jira] [Created] (PDFBOX-4834) Wrong read characters for Hindi conjuncts
Hesham created PDFBOX-4834: -- Summary: Wrong read characters for Hindi conjuncts Key: PDFBOX-4834 URL: https://issues.apache.org/jira/browse/PDFBOX-4834 Project: PDFBox Issue Type: Bug Components: Parsing, PDModel Affects Versions: 2.0.19 Environment: Windows 10, Java 9. Reporter: Hesham When reading this Hindi PDF book using PDFBox 2.0.19: [https://dl.dropboxusercontent.com/s/laixlb5omvjqr7y/Hindi%20Book.pdf?dl=0] It reads it with some wrong characters for conjuncts as it appears in this file: [https://dl.dropboxusercontent.com/s/efyxz2eg37gvn4c/Text%20read%20by%20PDFBox%202.0.19.txt?dl=0] -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: dev-unsubscr...@pdfbox.apache.org For additional commands, e-mail: dev-h...@pdfbox.apache.org
[jira] [Updated] (PDFBOX-1552) Uppercase letters are read in lowercase manner
[ https://issues.apache.org/jira/browse/PDFBOX-1552?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hesham updated PDFBOX-1552: --- Attachment: pdf_with_uppercase_letters.pdf This is a 1 page sample file to test. Uppercase letters are read in lowercase manner -- Key: PDFBOX-1552 URL: https://issues.apache.org/jira/browse/PDFBOX-1552 Project: PDFBox Issue Type: Bug Components: Text extraction Affects Versions: 1.7.1 Environment: Windows XP Reporter: Hesham Attachments: pdf_with_uppercase_letters.pdf I have a PDF that when I read its contents using PDFBox some uppercase letters are being read as lowercase. For example : - Word Testing is read as testing - Word Eve is read as eve - Word Deuteronomy is read as deuteronomy Andreas commented on this by: The pdf uses marked content to replace a string (14.9.4 Replacement Text of the PDF specs provides a simple example). And yes, PDFBox doesn't support it, yet. Please check this 1-page sample PDF. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Created] (PDFBOX-1552) Uppercase letters are read in lowercase manner
Hesham created PDFBOX-1552: -- Summary: Uppercase letters are read in lowercase manner Key: PDFBOX-1552 URL: https://issues.apache.org/jira/browse/PDFBOX-1552 Project: PDFBox Issue Type: Bug Components: Text extraction Affects Versions: 1.7.1 Environment: Windows XP Reporter: Hesham Attachments: pdf_with_uppercase_letters.pdf I have a PDF that when I read its contents using PDFBox some uppercase letters are being read as lowercase. For example : - Word Testing is read as testing - Word Eve is read as eve - Word Deuteronomy is read as deuteronomy Andreas commented on this by: The pdf uses marked content to replace a string (14.9.4 Replacement Text of the PDF specs provides a simple example). And yes, PDFBox doesn't support it, yet. Please check this 1-page sample PDF. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (PDFBOX-1423) An error exists on this page. Acrobat may not display the page correctly.
[ https://issues.apache.org/jira/browse/PDFBOX-1423?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13572312#comment-13572312 ] Hesham commented on PDFBOX-1423: The problem in my case is that i write text then draw shapes then write text then draw shapes,... etc many times in the same page. If i use endText() everytime before fillRect(...) then beginText() after fillRect(...) to continue writing text, i think problems may occur in that case. An error exists on this page. Acrobat may not display the page correctly. --- Key: PDFBOX-1423 URL: https://issues.apache.org/jira/browse/PDFBOX-1423 Project: PDFBox Issue Type: Bug Affects Versions: 1.6.0 Environment: Windows 7, WebLogic 10.3.0 and a jsp Reporter: wentao Attachments: generate_pdf.pdf after generate the pdf. Open it within Adobe Reader X has no problem, but if print it, a window pops up with An error exits on this page. Acrobat may not display the page correctly. Please contact the person who created the PDF document to correct the problem, printed result looks ok. It seems there is no such popup message on Adobe Reader 9. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (PDFBOX-1423) An error exists on this page. Acrobat may not display the page correctly.
[ https://issues.apache.org/jira/browse/PDFBOX-1423?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13572251#comment-13572251 ] Hesham commented on PDFBOX-1423: After some investigation i now know the reason of this. It is because i have opened a stream for text writing in the PDF then i wrote some text, then i started drawing a rectangle without closing the stream. That's the main problem. I have to close the stream before drawing anything. example : PDPage p = new PDPage(); PDPageContentStream ps= new PDPageContentStream( pdfFile, p ); ps.beginText(); ps.drawString( Write some text ); ps.fillRect(...); ps.endText(); ps.close(); pdfFile.save( path ); I've also found this reported in here: http://forums.adobe.com/thread/464841 What do you think Andreas ? An error exists on this page. Acrobat may not display the page correctly. --- Key: PDFBOX-1423 URL: https://issues.apache.org/jira/browse/PDFBOX-1423 Project: PDFBox Issue Type: Bug Affects Versions: 1.6.0 Environment: Windows 7, WebLogic 10.3.0 and a jsp Reporter: wentao Attachments: generate_pdf.pdf after generate the pdf. Open it within Adobe Reader X has no problem, but if print it, a window pops up with An error exits on this page. Acrobat may not display the page correctly. Please contact the person who created the PDF document to correct the problem, printed result looks ok. It seems there is no such popup message on Adobe Reader 9. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (PDFBOX-1423) An error exists on this page. Acrobat may not display the page correctly.
[ https://issues.apache.org/jira/browse/PDFBOX-1423?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13569781#comment-13569781 ] Hesham commented on PDFBOX-1423: I can replicate this when printing a PDF generated by PDFBox 1.7.1 using Adobe reader version 9.4.6. An error exists on this page. Acrobat may not display the page correctly. --- Key: PDFBOX-1423 URL: https://issues.apache.org/jira/browse/PDFBOX-1423 Project: PDFBox Issue Type: Bug Affects Versions: 1.6.0 Environment: Windows 7, WebLogic 10.3.0 and a jsp Reporter: wentao Attachments: generate_pdf.pdf after generate the pdf. Open it within Adobe Reader X has no problem, but if print it, a window pops up with An error exits on this page. Acrobat may not display the page correctly. Please contact the person who created the PDF document to correct the problem, printed result looks ok. It seems there is no such popup message on Adobe Reader 9. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (PDFBOX-954) Embedded font: value for /Widths faulty (worked in PDFBox 1.3.0!)
[ https://issues.apache.org/jira/browse/PDFBOX-954?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13425684#comment-13425684 ] Hesham commented on PDFBOX-954: --- I have tested this on Windows Mac OS X, and it works fine. Thanks Wolfgang ... Thanks Andreas :) Embedded font: value for /Widths faulty (worked in PDFBox 1.3.0!) - Key: PDFBOX-954 URL: https://issues.apache.org/jira/browse/PDFBOX-954 Project: PDFBox Issue Type: Bug Components: FontBox Affects Versions: 1.4.0 Environment: JDK1.6.0_23, Windows XP Reporter: MH Assignee: Andreas Lehmkühler Fix For: 1.7.1 Attachments: Imagen 1.png, Imagen 2.png, Imagen 3.png, Main.java, MainVer2.java, MainVer2.java, hello_ttf_1.1.0.pdf, hello_ttf_1.4.0.pdf, out.pdf, outVer2.pdf, pdfbox-1.7.0-ttf-widths-encoding-fix.patch We have a problem with the font 'LucidiaSansUnicode (l_10646.ttf). It is embedded in a PDF and when viewing this PDF (with Acrobat Reader 9), an error In der Schrift LucidaSansUnicode ist der Wert für /Widths fehlerhaft. occurs (roughly translated: In font LucidaSansUNicode the value for /Widths is faulty.). I noticed that this error only occurs when the first page is displayed that has text added by PDFBox! The same font is also used for all other text (used by Apache FOP to generate). When I look at the dialog window of Acrobat 3. tab Fonts, I notice lots of entries LucidaSansUnicode (Eingebettete Untergruppe) Typ: TrueType (CID) Kodierung: Identity-H but only 1 entry at the very top that looks different: LucidaSansUnicode (Eingebettet) Typ: TrueType Kodierung: Ansi I guess one is from Apache FOP (generation of PDF) and one is from PDFBox (adding additional text to the PDF). However, both use the same source file l_10646.ttf! Using PDFBox 1.3.0-snapshot (or iText 2.1.7), this problem does NOT occur! This only occurs with this LucidaSansUnicode font - all our other custom fonts don't cause this problem. The difference I notice in Acrobat Reader Fonts tab is the first font entry: PDFBox 1.4.0: LucidaSansUnicode (Eingebettet) Typ: TrueType Kodierung: Ansi PDFBox 1.3.0 or iText 2.1.7: LucidaSansUnicode (Eingebettete Untergruppe) Typ: TrueType Kodierung: Ansi So, PDFBox 1.4.0 only shows embedded (Eingebettet) but PDFBox 1.3.0/iText version shows embedded subgroup (Eingebettete Untergruppe)! Perhaps this is the problem? -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (PDFBOX-954) Embedded font: value for /Widths faulty (worked in PDFBox 1.3.0!)
[ https://issues.apache.org/jira/browse/PDFBOX-954?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13098862#comment-13098862 ] Hesham commented on PDFBOX-954: --- Will this issue be fixed in the next program version ? I see this is a critical issue ! Embedded font: value for /Widths faulty (worked in PDFBox 1.3.0!) - Key: PDFBOX-954 URL: https://issues.apache.org/jira/browse/PDFBOX-954 Project: PDFBox Issue Type: Bug Components: FontBox Affects Versions: 1.4.0 Environment: JDK1.6.0_23, Windows XP Reporter: MH Attachments: hello_ttf_1.1.0.pdf, hello_ttf_1.4.0.pdf We have a problem with the font 'LucidiaSansUnicode (l_10646.ttf). It is embedded in a PDF and when viewing this PDF (with Acrobat Reader 9), an error In der Schrift LucidaSansUnicode ist der Wert für /Widths fehlerhaft. occurs (roughly translated: In font LucidaSansUNicode the value for /Widths is faulty.). I noticed that this error only occurs when the first page is displayed that has text added by PDFBox! The same font is also used for all other text (used by Apache FOP to generate). When I look at the dialog window of Acrobat 3. tab Fonts, I notice lots of entries LucidaSansUnicode (Eingebettete Untergruppe) Typ: TrueType (CID) Kodierung: Identity-H but only 1 entry at the very top that looks different: LucidaSansUnicode (Eingebettet) Typ: TrueType Kodierung: Ansi I guess one is from Apache FOP (generation of PDF) and one is from PDFBox (adding additional text to the PDF). However, both use the same source file l_10646.ttf! Using PDFBox 1.3.0-snapshot (or iText 2.1.7), this problem does NOT occur! This only occurs with this LucidaSansUnicode font - all our other custom fonts don't cause this problem. The difference I notice in Acrobat Reader Fonts tab is the first font entry: PDFBox 1.4.0: LucidaSansUnicode (Eingebettet) Typ: TrueType Kodierung: Ansi PDFBox 1.3.0 or iText 2.1.7: LucidaSansUnicode (Eingebettete Untergruppe) Typ: TrueType Kodierung: Ansi So, PDFBox 1.4.0 only shows embedded (Eingebettet) but PDFBox 1.3.0/iText version shows embedded subgroup (Eingebettete Untergruppe)! Perhaps this is the problem? -- This message is automatically generated by JIRA. For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] Commented: (PDFBOX-938) Wrong extracted text using PDFBox 1.4
[ https://issues.apache.org/jira/browse/PDFBOX-938?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12990014#comment-12990014 ] Hesham commented on PDFBOX-938: --- @Andreas ... That is why i have sent you an application sample. The font used in the JTextArea is Tahoma : pdfTextArea.setFont(new Font(Tahoma, Font.PLAIN, 12)); And the encoding used to extract text : PDFTextStripper stripper = new PDFTextStripper( utf-8 ); Is there anything else that may cause such a problem ? Wrong extracted text using PDFBox 1.4 - Key: PDFBOX-938 URL: https://issues.apache.org/jira/browse/PDFBOX-938 Project: PDFBox Issue Type: Bug Components: Text extraction Affects Versions: 1.4.0 Reporter: Hesham Assignee: Andreas Lehmkühler Fix For: 1.5.0 Attachments: Another book - Wrong extracted f char.pdf, Another+book+-+Wrong+extracted+f+char.txt, Sample.zip, Wrong extracted f char.pdf Hello , I am using PDFBox v1.4 to extract some text from a PDF, but some words are not extracted right. For example words : Nefteiugansk is read: Nežeiugansk fiancee is read: Äancée first is read: Ärst Please check the attached file to test this. Best regards -- This message is automatically generated by JIRA. - For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] Commented: (PDFBOX-938) Wrong extracted text using PDFBox 1.4
[ https://issues.apache.org/jira/browse/PDFBOX-938?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12987368#action_12987368 ] Hesham commented on PDFBOX-938: --- @Andreas ... Did the jar work fine with you ? Wrong extracted text using PDFBox 1.4 - Key: PDFBOX-938 URL: https://issues.apache.org/jira/browse/PDFBOX-938 Project: PDFBox Issue Type: Bug Components: Text extraction Affects Versions: 1.4.0 Reporter: Hesham Attachments: Another book - Wrong extracted f char.pdf, Another+book+-+Wrong+extracted+f+char.txt, Sample.zip, Wrong extracted f char.pdf Hello , I am using PDFBox v1.4 to extract some text from a PDF, but some words are not extracted right. For example words : Nefteiugansk is read: Nežeiugansk fiancee is read: Äancée first is read: Ärst Please check the attached file to test this. Best regards -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Updated: (PDFBOX-938) Wrong extracted text using PDFBox 1.4
[ https://issues.apache.org/jira/browse/PDFBOX-938?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hesham updated PDFBOX-938: -- Attachment: Sample.zip Wrong extracted text using PDFBox 1.4 - Key: PDFBOX-938 URL: https://issues.apache.org/jira/browse/PDFBOX-938 Project: PDFBox Issue Type: Bug Components: Text extraction Affects Versions: 1.4.0 Reporter: Hesham Attachments: Another book - Wrong extracted f char.pdf, Another+book+-+Wrong+extracted+f+char.txt, Sample.zip, Wrong extracted f char.pdf Hello , I am using PDFBox v1.4 to extract some text from a PDF, but some words are not extracted right. For example words : Nefteiugansk is read: Nežeiugansk fiancee is read: Äancée first is read: Ärst Please check the attached file to test this. Best regards -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (PDFBOX-938) Wrong extracted text using PDFBox 1.4
[ https://issues.apache.org/jira/browse/PDFBOX-938?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12985341#action_12985341 ] Hesham commented on PDFBOX-938: --- @Andreas ... Thanks for your reply. I have attached a sample executable jar Sample.zip to test it ... Please download it, extract the zip and just double click the jar file. The source code is also inside. If you see any problems with it please tell me about it. I am still getting the same problems when using it. Wrong extracted text using PDFBox 1.4 - Key: PDFBOX-938 URL: https://issues.apache.org/jira/browse/PDFBOX-938 Project: PDFBox Issue Type: Bug Components: Text extraction Affects Versions: 1.4.0 Reporter: Hesham Attachments: Another book - Wrong extracted f char.pdf, Another+book+-+Wrong+extracted+f+char.txt, Sample.zip, Wrong extracted f char.pdf Hello , I am using PDFBox v1.4 to extract some text from a PDF, but some words are not extracted right. For example words : Nefteiugansk is read: Nežeiugansk fiancee is read: Äancée first is read: Ärst Please check the attached file to test this. Best regards -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (PDFBOX-588) Problem extracting text in newline characters
[ https://issues.apache.org/jira/browse/PDFBOX-588?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12983065#action_12983065 ] Hesham commented on PDFBOX-588: --- Just a notice ... I have tested extracting the PDF reference data on my Mac today, and it worked fine ... it took 2 minutes. The last trial was on my normal PC(Windows XP - Core 2 Duo - 2.5 GB Rams) which took about 6 minutes. I don't know why it is that slow !! ... If i find any reason for this i will write it here. Problem extracting text in newline characters - Key: PDFBOX-588 URL: https://issues.apache.org/jira/browse/PDFBOX-588 Project: PDFBox Issue Type: Bug Components: Text extraction Affects Versions: 0.8.0-incubator, 1.3.1, 1.4.0 Environment: Win XP Reporter: Hesham Assignee: Andreas Lehmkühler Attachments: Enters-sample.pdf, PDFBOX588-Enters-sample.txt, PDFBOX588-Enters-sample1.png, PDFBOX588-Enters-sample1.png, PDFTextStripper.patch Hello , I have a PDF file with 1 page only, when I try to extract its text using : String pageData = stripper.getText( pdfFile ); It ignores some Enter characters between lines, so the last word in the line and the first word in the next line appear as 1 word without spaces between them !! While if I copy the PDF text manually from the PDF and paste it in a text editor, Enter characters appear after the same lines that caused the problem in PDFBox. Please check the attached file as a sample. Is there a way to fix this ? Best regards , -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Updated: (PDFBOX-943) Creating a link without borders appears with borders in Mac's Preview
[ https://issues.apache.org/jira/browse/PDFBOX-943?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hesham updated PDFBOX-943: -- Attachment: links_testing.pdf Creating a link without borders appears with borders in Mac's Preview - Key: PDFBOX-943 URL: https://issues.apache.org/jira/browse/PDFBOX-943 Project: PDFBox Issue Type: Bug Components: Writing Affects Versions: 1.4.0 Environment: Mac book Reporter: Hesham Attachments: links_testing.pdf I am trying to create a link with no borders. The link appears and works perfect in Adobe reader, but in Mac Preview the link appears with a border around it. Here is my code : PDAnnotationLink link = new PDAnnotationLink(); PDBorderStyleDictionary border = new PDBorderStyleDictionary(); border.setWidth( 0f ); link.setBorderStyle( border ); Can this be fixed to show no border in Mac's Preview ? I have attached a sample PDF with a link in its last page ... You can test it on Adobe reader and Mac's Preview programs. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (PDFBOX-588) Problem extracting text in newline characters
[ https://issues.apache.org/jira/browse/PDFBOX-588?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12981910#action_12981910 ] Hesham commented on PDFBOX-588: --- I do not know what is a fragmented font ! But i have created a sample project to test extracting text from the PDF reference, and it took the same time i mentioned for the 2 PDFBox versions. I do not understand how it works fine with you ! Here is my code : private void readPDFButtonActionPerformed() { try { PDDocument pdfRef = PDDocument.load( C:\\pdf_reference_1.7.pdf ); PDFTextStripper stripper = new PDFTextStripper(); for( int pageNum = 1; pageNum pdfRef.getNumberOfPages(); pageNum++ ) { System.out.println( pageNum ); stripper.setStartPage( pageNum ); stripper.setEndPage( pageNum ); stripper.getText( pdfRef ); } System.out.println( Done ); } catch (IOException e) { e.printStackTrace(); } } Problem extracting text in newline characters - Key: PDFBOX-588 URL: https://issues.apache.org/jira/browse/PDFBOX-588 Project: PDFBox Issue Type: Bug Components: Text extraction Affects Versions: 0.8.0-incubator, 1.3.1, 1.4.0 Environment: Win XP Reporter: Hesham Assignee: Andreas Lehmkühler Attachments: Enters-sample.pdf, PDFBOX588-Enters-sample.txt, PDFBOX588-Enters-sample1.png, PDFBOX588-Enters-sample1.png, PDFTextStripper.patch Hello , I have a PDF file with 1 page only, when I try to extract its text using : String pageData = stripper.getText( pdfFile ); It ignores some Enter characters between lines, so the last word in the line and the first word in the next line appear as 1 word without spaces between them !! While if I copy the PDF text manually from the PDF and paste it in a text editor, Enter characters appear after the same lines that caused the problem in PDFBox. Please check the attached file as a sample. Is there a way to fix this ? Best regards , -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (PDFBOX-588) Problem extracting text in newline characters
[ https://issues.apache.org/jira/browse/PDFBOX-588?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12981472#action_12981472 ] Hesham commented on PDFBOX-588: --- Strange !! I have PDF Reference v1.7 ... It is 1310 pages, right ? Extracting all its text using PDFBox v0.7.3 took 35 seconds. Extracting all the text using PDFBox v1.4 took 6 minutes and 10 seconds. Problem extracting text in newline characters - Key: PDFBOX-588 URL: https://issues.apache.org/jira/browse/PDFBOX-588 Project: PDFBox Issue Type: Bug Components: Text extraction Affects Versions: 0.8.0-incubator, 1.3.1, 1.4.0 Environment: Win XP Reporter: Hesham Assignee: Andreas Lehmkühler Attachments: Enters-sample.pdf, PDFBOX588-Enters-sample.txt, PDFBOX588-Enters-sample1.png, PDFBOX588-Enters-sample1.png, PDFTextStripper.patch Hello , I have a PDF file with 1 page only, when I try to extract its text using : String pageData = stripper.getText( pdfFile ); It ignores some Enter characters between lines, so the last word in the line and the first word in the next line appear as 1 word without spaces between them !! While if I copy the PDF text manually from the PDF and paste it in a text editor, Enter characters appear after the same lines that caused the problem in PDFBox. Please check the attached file as a sample. Is there a way to fix this ? Best regards , -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (PDFBOX-938) Wrong extracted text using PDFBox 1.4
[ https://issues.apache.org/jira/browse/PDFBOX-938?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12980623#action_12980623 ] Hesham commented on PDFBOX-938: --- I am using Windows XP ... I have tested ICU4J with an Arabic PDF and it parses it right,(From right to left, while without ICU4J it reads the Arabic characters reversed). Can i do anything else ? Wrong extracted text using PDFBox 1.4 - Key: PDFBOX-938 URL: https://issues.apache.org/jira/browse/PDFBOX-938 Project: PDFBox Issue Type: Bug Components: Text extraction Affects Versions: 1.4.0 Reporter: Hesham Attachments: Wrong extracted f char.pdf Hello , I am using PDFBox v1.4 to extract some text from a PDF, but some words are not extracted right. For example words : Nefteiugansk is read: Nežeiugansk fiancee is read: Äancée first is read: Ärst Please check the attached file to test this. Best regards -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (PDFBOX-938) Wrong extracted text using PDFBox 1.4
[ https://issues.apache.org/jira/browse/PDFBOX-938?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12980713#action_12980713 ] Hesham commented on PDFBOX-938: --- I am using eclipse .. Its default encoding is CP1252 ... There're 2 points here : 1. Arabic characters appear fine, which needs a similar encoding. 2. I have created a jar sample that reads the PDF + writes the output to a textArea(Or whatever output component) to see it(The component font is Tahoma). Wrong extracted text using PDFBox 1.4 - Key: PDFBOX-938 URL: https://issues.apache.org/jira/browse/PDFBOX-938 Project: PDFBox Issue Type: Bug Components: Text extraction Affects Versions: 1.4.0 Reporter: Hesham Attachments: Wrong extracted f char.pdf Hello , I am using PDFBox v1.4 to extract some text from a PDF, but some words are not extracted right. For example words : Nefteiugansk is read: Nežeiugansk fiancee is read: Äancée first is read: Ärst Please check the attached file to test this. Best regards -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Created: (PDFBOX-938) Wrong extracted text using PDFBox 1.4
Wrong extracted text using PDFBox 1.4 - Key: PDFBOX-938 URL: https://issues.apache.org/jira/browse/PDFBOX-938 Project: PDFBox Issue Type: Bug Components: Text extraction Affects Versions: 1.4.0 Reporter: Hesham Attachments: Wrong extracted f char.pdf Hello , I am using PDFBox v1.4 to extract some text from a PDF, but some words are not extracted right. For example words : Nefteiugansk is read: Nežeiugansk fiancee is read: Äancée first is read: Ärst Please check the attached file to test this. Best regards -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Updated: (PDFBOX-938) Wrong extracted text using PDFBox 1.4
[ https://issues.apache.org/jira/browse/PDFBOX-938?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hesham updated PDFBOX-938: -- Attachment: Wrong extracted f char.pdf Wrong extracted text using PDFBox 1.4 - Key: PDFBOX-938 URL: https://issues.apache.org/jira/browse/PDFBOX-938 Project: PDFBox Issue Type: Bug Components: Text extraction Affects Versions: 1.4.0 Reporter: Hesham Attachments: Wrong extracted f char.pdf Hello , I am using PDFBox v1.4 to extract some text from a PDF, but some words are not extracted right. For example words : Nefteiugansk is read: Nežeiugansk fiancee is read: Äancée first is read: Ärst Please check the attached file to test this. Best regards -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Issue Comment Edited: (PDFBOX-938) Wrong extracted text using PDFBox 1.4
[ https://issues.apache.org/jira/browse/PDFBOX-938?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12980608#action_12980608 ] Hesham edited comment on PDFBOX-938 at 1/12/11 2:44 AM: Thanks Johannes ... I see that ICU4J is now included in PDFBox 1.4. I have tried it but it is still giving the same results ! You can try it yourself. Should i add a special code to apply the ICU4J. I only use this : PDFTextStripper myStripper = new PDFTextStripper(); myStripper.getText( myPDFFile ) was (Author: hesham): Thanks Johannes ... I see that ICU4J is now included in PDFBox 1.4. I have tried it but it is still giving the same results ! You can try it yourself. Wrong extracted text using PDFBox 1.4 - Key: PDFBOX-938 URL: https://issues.apache.org/jira/browse/PDFBOX-938 Project: PDFBox Issue Type: Bug Components: Text extraction Affects Versions: 1.4.0 Reporter: Hesham Attachments: Wrong extracted f char.pdf Hello , I am using PDFBox v1.4 to extract some text from a PDF, but some words are not extracted right. For example words : Nefteiugansk is read: Nežeiugansk fiancee is read: Äancée first is read: Ärst Please check the attached file to test this. Best regards -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (PDFBOX-588) Problem extracting text in newline characters
[ https://issues.apache.org/jira/browse/PDFBOX-588?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12977745#action_12977745 ] Hesham commented on PDFBOX-588: --- @Andreas ... Nice work :) As you are saying, it just merges the 2 lines together in the left right paragraphs. Problem extracting text in newline characters - Key: PDFBOX-588 URL: https://issues.apache.org/jira/browse/PDFBOX-588 Project: PDFBox Issue Type: Bug Components: Text extraction Affects Versions: 0.8.0-incubator, 1.3.1, 1.4.0 Environment: Win XP Reporter: Hesham Assignee: Andreas Lehmkühler Attachments: Enters-sample.pdf, PDFBOX588-Enters-sample.txt, PDFBOX588-Enters-sample1.png, PDFBOX588-Enters-sample1.png, PDFTextStripper.patch Hello , I have a PDF file with 1 page only, when I try to extract its text using : String pageData = stripper.getText( pdfFile ); It ignores some Enter characters between lines, so the last word in the line and the first word in the next line appear as 1 word without spaces between them !! While if I copy the PDF text manually from the PDF and paste it in a text editor, Enter characters appear after the same lines that caused the problem in PDFBox. Please check the attached file as a sample. Is there a way to fix this ? Best regards , -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Updated: (PDFBOX-935) Text not extracted with PDFBox 1.4
[ https://issues.apache.org/jira/browse/PDFBOX-935?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hesham updated PDFBOX-935: -- Attachment: data_not_extracted.pdf Text not extracted with PDFBox 1.4 -- Key: PDFBOX-935 URL: https://issues.apache.org/jira/browse/PDFBOX-935 Project: PDFBox Issue Type: Bug Components: Text extraction Affects Versions: 1.4.0 Reporter: Hesham Fix For: 1.2.1 Attachments: data_not_extracted.pdf I have used PDFBox v1.2.1 to extract text from a PDF file, and it works perfect. But now I have tested it with PDFBox v1.4 and most of the text is not extracted. I have attached a 1-page PDF file to test. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (PDFBOX-588) Problem extracting text in newline characters
[ https://issues.apache.org/jira/browse/PDFBOX-588?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12977281#action_12977281 ] Hesham commented on PDFBOX-588: --- Thanks a lot Mel and Andreas for the investigation ... 'PDFTextStripper.setSpacingTolerance(float)' method is very interesting. I have tested it on 5 PDFs the best value for me was (0.3f). It mostly extracts all words right. As for the attached PDF in this issue, the problem of spaces is now limited to the last words of the paragraph at the low left side like : able to - ableto in order - inorder But not - Butnot who set - whoset I think this is because of the 'Enters' problem. I will check it now in PDFBox-521. Problem extracting text in newline characters - Key: PDFBOX-588 URL: https://issues.apache.org/jira/browse/PDFBOX-588 Project: PDFBox Issue Type: Bug Components: Text extraction Affects Versions: 0.8.0-incubator Environment: Win XP Reporter: Hesham Attachments: Enters-sample.pdf, PDFBOX588-Enters-sample1.png, PDFTextStripper.patch Hello , I have a PDF file with 1 page only, when I try to extract its text using : String pageData = stripper.getText( pdfFile ); It ignores some Enter characters between lines, so the last word in the line and the first word in the next line appear as 1 word without spaces between them !! While if I copy the PDF text manually from the PDF and paste it in a text editor, Enter characters appear after the same lines that caused the problem in PDFBox. Please check the attached file as a sample. Is there a way to fix this ? Best regards , -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Issue Comment Edited: (PDFBOX-588) Problem extracting text in newline characters
[ https://issues.apache.org/jira/browse/PDFBOX-588?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12977281#action_12977281 ] Hesham edited comment on PDFBOX-588 at 1/4/11 9:39 AM: --- Thanks a lot Mel and Andreas for the investigation ... 'PDFTextStripper.setSpacingTolerance(float)' method is very interesting. I have tested it on 5 PDFs the best value for me was (0.3f). It mostly extracts all words right. As for the attached PDF in this issue, the problem of spaces is now limited to the last words of the paragraph at the low left side like : be able to read about Paul Revere's midnight - beabletoreadaboutPaulRevere'smidnight journey only a - journeyonlya If i used a spacing tolerance (0.1f), those words will be extracted right, but in return other words will appear wrong like : UNCENSORED REVOLUTIONARY WAR HISTORY - U N C E N S O R E D R E V O L U T I O N A R Y W A R H I S T O R Y So i guess i will leave it with value (0.3)f which is much better. I will check now the Enters problem in PDFBox-521. was (Author: hesham): Thanks a lot Mel and Andreas for the investigation ... 'PDFTextStripper.setSpacingTolerance(float)' method is very interesting. I have tested it on 5 PDFs the best value for me was (0.3f). It mostly extracts all words right. As for the attached PDF in this issue, the problem of spaces is now limited to the last words of the paragraph at the low left side like : able to - ableto in order - inorder But not - Butnot who set - whoset I think this is because of the 'Enters' problem. I will check it now in PDFBox-521. Problem extracting text in newline characters - Key: PDFBOX-588 URL: https://issues.apache.org/jira/browse/PDFBOX-588 Project: PDFBox Issue Type: Bug Components: Text extraction Affects Versions: 0.8.0-incubator Environment: Win XP Reporter: Hesham Attachments: Enters-sample.pdf, PDFBOX588-Enters-sample1.png, PDFTextStripper.patch Hello , I have a PDF file with 1 page only, when I try to extract its text using : String pageData = stripper.getText( pdfFile ); It ignores some Enter characters between lines, so the last word in the line and the first word in the next line appear as 1 word without spaces between them !! While if I copy the PDF text manually from the PDF and paste it in a text editor, Enter characters appear after the same lines that caused the problem in PDFBox. Please check the attached file as a sample. Is there a way to fix this ? Best regards , -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (PDFBOX-588) Problem extracting text in newline characters
[ https://issues.apache.org/jira/browse/PDFBOX-588?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12977390#action_12977390 ] Hesham commented on PDFBOX-588: --- I have checked the Enters problem in PDFBox-52. I am still trying to understand things ... Should i use the isParagraphSeparation(...) method ? Can you please give me an example so i can understand this ? Problem extracting text in newline characters - Key: PDFBOX-588 URL: https://issues.apache.org/jira/browse/PDFBOX-588 Project: PDFBox Issue Type: Bug Components: Text extraction Affects Versions: 0.8.0-incubator Environment: Win XP Reporter: Hesham Attachments: Enters-sample.pdf, PDFBOX588-Enters-sample1.png, PDFTextStripper.patch Hello , I have a PDF file with 1 page only, when I try to extract its text using : String pageData = stripper.getText( pdfFile ); It ignores some Enter characters between lines, so the last word in the line and the first word in the next line appear as 1 word without spaces between them !! While if I copy the PDF text manually from the PDF and paste it in a text editor, Enter characters appear after the same lines that caused the problem in PDFBox. Please check the attached file as a sample. Is there a way to fix this ? Best regards , -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.