[jira] [Commented] (PDFBOX-5029) Tika - Issues extracting Arabic script from pdf

2021-01-07 Thread Christian (Jira)


[ 
https://issues.apache.org/jira/browse/PDFBOX-5029?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17260786#comment-17260786
 ] 

Christian  commented on PDFBOX-5029:


Hi Tilman, first of all Happy New Year - I have been very busy in the past 
weeks and only now I'm back on the issue of scraping PDF files using TIkka - I 
tried all the possible combinations - the only way to get the correct text is 
to copy and paste the PDF content in a txt file and run afterwards the script. 
If I do it with WORD there are still mistakes. In any case it won't solve the 
issue because I want to extract the text from the original PDF:

> Tika - Issues extracting Arabic script from pdf
> ---
>
> Key: PDFBOX-5029
> URL: https://issues.apache.org/jira/browse/PDFBOX-5029
> Project: PDFBox
>  Issue Type: Bug
> Environment: Windows - Anaconda / Spyder
>Reporter: Christian 
>Priority: Major
> Attachments: PDFBOX-5029-not-sorted-2.0.21.txt, 
> PDFBOX-5029-not-sorted-trunk.txt, PDFBOX-5029-sorted-2.0.21.txt, 
> PDFBOX-5029-sorted-trunk.txt, extracting_text_asian_pdf.py, test.pdf, 
> test_scraped.utf8
>
>
> I'm working on building a corpus of Uygur texts and some of the content is 
> coming from pdf files. I wrote a short python script to scrape text from pdf 
> using tika-python. The script is Arabic, and the output looks good but there 
> is one major problem: there are many missing spaces between words and I 
> really do not know how to address this issue. I am attaching a pdf file, the 
> script to scrape its text and the output (test_scraped.utf8). Thanks in 
> advance for your help.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: dev-unsubscr...@pdfbox.apache.org
For additional commands, e-mail: dev-h...@pdfbox.apache.org



[jira] [Commented] (PDFBOX-5029) Tika - Issues extracting Arabic script from pdf

2020-12-01 Thread Christian (Jira)


[ 
https://issues.apache.org/jira/browse/PDFBOX-5029?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17241476#comment-17241476
 ] 

Christian  commented on PDFBOX-5029:


Hi Tilman, in your "sorted" files there are spaces between words but the word 
order in a sentence is backward - also the text is not following the column 
order in the pdf file but is jumping from "first line-first column to first 
line-second column to first line- third column" and so on. 
In addition there is a problem with the positioning of some vowel sign on the 
top of consonants - sometimes is correct sometimes is wrong even for the same 
combination of vowel+consonant. Same with the order of some "consonant 
clusters" - I'm not sure if it's correct to describe it this way, but it would 
be like the word "the" rendered as "hte" if that makes sense.

The "not sorted" files are even worse with missing spaces and reverse word 
order + letters in each word are backward. There is no "column issue" in this 
case. I summarize it with two examples:

Ex - sorted files: "the cat is red" --> "red is cat the" + the column issue.
Ex - not sorted files: "the cat is red" --> "dersitaceht" (no column issue)

In terms of "accuracy" my original utf-8 file attached above has no column 
issue and words have the right order in the sentences. We noticed also that the 
first word for each line in the first pdf column is missing. This does not make 
things easier I guess. 

Ex - test_scraped.utf8 file: "the cat is red" -> "catisred" (no column issue + 
missing first word)

Thanks again for your help.

> Tika - Issues extracting Arabic script from pdf
> ---
>
> Key: PDFBOX-5029
> URL: https://issues.apache.org/jira/browse/PDFBOX-5029
> Project: PDFBox
>  Issue Type: Bug
> Environment: Windows - Anaconda / Spyder
>Reporter: Christian 
>Priority: Major
> Attachments: PDFBOX-5029-not-sorted-2.0.21.txt, 
> PDFBOX-5029-not-sorted-trunk.txt, PDFBOX-5029-sorted-2.0.21.txt, 
> PDFBOX-5029-sorted-trunk.txt, extracting_text_asian_pdf.py, test.pdf, 
> test_scraped.utf8
>
>
> I'm working on building a corpus of Uygur texts and some of the content is 
> coming from pdf files. I wrote a short python script to scrape text from pdf 
> using tika-python. The script is Arabic, and the output looks good but there 
> is one major problem: there are many missing spaces between words and I 
> really do not know how to address this issue. I am attaching a pdf file, the 
> script to scrape its text and the output (test_scraped.utf8). Thanks in 
> advance for your help.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: dev-unsubscr...@pdfbox.apache.org
For additional commands, e-mail: dev-h...@pdfbox.apache.org



[jira] [Comment Edited] (PDFBOX-5029) Tika - Issues extracting Arabic script from pdf

2020-11-29 Thread Christian (Jira)


[ 
https://issues.apache.org/jira/browse/PDFBOX-5029?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17240362#comment-17240362
 ] 

Christian  edited comment on PDFBOX-5029 at 11/29/20, 8:26 PM:
---

Thanks Tilman, will do - tomorrow I will be in touch with a colleague of mine 
who is a native speaker and I will provide you the exact lines and missing 
spaces (if any). I guess the files to look at are "the sorted" ones. Did you 
use my script to extract the text? 


was (Author: faggionato):
Thanks Tilman, will do - tomorrow I will be in touch with a native speaker and 
I will provide you the exact lines and missing spaces.

> Tika - Issues extracting Arabic script from pdf
> ---
>
> Key: PDFBOX-5029
> URL: https://issues.apache.org/jira/browse/PDFBOX-5029
> Project: PDFBox
>  Issue Type: Bug
> Environment: Windows - Anaconda / Spyder
>Reporter: Christian 
>Priority: Major
> Attachments: PDFBOX-5029-not-sorted-2.0.21.txt, 
> PDFBOX-5029-not-sorted-trunk.txt, PDFBOX-5029-sorted-2.0.21.txt, 
> PDFBOX-5029-sorted-trunk.txt, extracting_text_asian_pdf.py, test.pdf, 
> test_scraped.utf8
>
>
> I'm working on building a corpus of Uygur texts and some of the content is 
> coming from pdf files. I wrote a short python script to scrape text from pdf 
> using tika-python. The script is Arabic, and the output looks good but there 
> is one major problem: there are many missing spaces between words and I 
> really do not know how to address this issue. I am attaching a pdf file, the 
> script to scrape its text and the output (test_scraped.utf8). Thanks in 
> advance for your help.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: dev-unsubscr...@pdfbox.apache.org
For additional commands, e-mail: dev-h...@pdfbox.apache.org



[jira] [Issue Comment Deleted] (PDFBOX-5029) Tika - Issues extracting Arabic script from pdf

2020-11-29 Thread Christian (Jira)


 [ 
https://issues.apache.org/jira/browse/PDFBOX-5029?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Christian  updated PDFBOX-5029:
---
Comment: was deleted

(was: Also, what is the difference between the sorted and not-sorted files you 
attached? Did you use my script to extract the text? Thanks again.)

> Tika - Issues extracting Arabic script from pdf
> ---
>
> Key: PDFBOX-5029
> URL: https://issues.apache.org/jira/browse/PDFBOX-5029
> Project: PDFBox
>  Issue Type: Bug
> Environment: Windows - Anaconda / Spyder
>Reporter: Christian 
>Priority: Major
> Attachments: PDFBOX-5029-not-sorted-2.0.21.txt, 
> PDFBOX-5029-not-sorted-trunk.txt, PDFBOX-5029-sorted-2.0.21.txt, 
> PDFBOX-5029-sorted-trunk.txt, extracting_text_asian_pdf.py, test.pdf, 
> test_scraped.utf8
>
>
> I'm working on building a corpus of Uygur texts and some of the content is 
> coming from pdf files. I wrote a short python script to scrape text from pdf 
> using tika-python. The script is Arabic, and the output looks good but there 
> is one major problem: there are many missing spaces between words and I 
> really do not know how to address this issue. I am attaching a pdf file, the 
> script to scrape its text and the output (test_scraped.utf8). Thanks in 
> advance for your help.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: dev-unsubscr...@pdfbox.apache.org
For additional commands, e-mail: dev-h...@pdfbox.apache.org



[jira] [Commented] (PDFBOX-5029) Tika - Issues extracting Arabic script from pdf

2020-11-29 Thread Christian (Jira)


[ 
https://issues.apache.org/jira/browse/PDFBOX-5029?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17240363#comment-17240363
 ] 

Christian  commented on PDFBOX-5029:


Also, what is the difference between the sorted and not-sorted files you 
attached? Did you use my script to extract the text? Thanks again.

> Tika - Issues extracting Arabic script from pdf
> ---
>
> Key: PDFBOX-5029
> URL: https://issues.apache.org/jira/browse/PDFBOX-5029
> Project: PDFBox
>  Issue Type: Bug
> Environment: Windows - Anaconda / Spyder
>Reporter: Christian 
>Priority: Major
> Attachments: PDFBOX-5029-not-sorted-2.0.21.txt, 
> PDFBOX-5029-not-sorted-trunk.txt, PDFBOX-5029-sorted-2.0.21.txt, 
> PDFBOX-5029-sorted-trunk.txt, extracting_text_asian_pdf.py, test.pdf, 
> test_scraped.utf8
>
>
> I'm working on building a corpus of Uygur texts and some of the content is 
> coming from pdf files. I wrote a short python script to scrape text from pdf 
> using tika-python. The script is Arabic, and the output looks good but there 
> is one major problem: there are many missing spaces between words and I 
> really do not know how to address this issue. I am attaching a pdf file, the 
> script to scrape its text and the output (test_scraped.utf8). Thanks in 
> advance for your help.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: dev-unsubscr...@pdfbox.apache.org
For additional commands, e-mail: dev-h...@pdfbox.apache.org



[jira] [Commented] (PDFBOX-5029) Tika - Issues extracting Arabic script from pdf

2020-11-29 Thread Christian (Jira)


[ 
https://issues.apache.org/jira/browse/PDFBOX-5029?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17240362#comment-17240362
 ] 

Christian  commented on PDFBOX-5029:


Thanks Tilman, will do - tomorrow I will be in touch with a native speaker and 
I will provide you the exact lines and missing spaces.

> Tika - Issues extracting Arabic script from pdf
> ---
>
> Key: PDFBOX-5029
> URL: https://issues.apache.org/jira/browse/PDFBOX-5029
> Project: PDFBox
>  Issue Type: Bug
> Environment: Windows - Anaconda / Spyder
>Reporter: Christian 
>Priority: Major
> Attachments: PDFBOX-5029-not-sorted-2.0.21.txt, 
> PDFBOX-5029-not-sorted-trunk.txt, PDFBOX-5029-sorted-2.0.21.txt, 
> PDFBOX-5029-sorted-trunk.txt, extracting_text_asian_pdf.py, test.pdf, 
> test_scraped.utf8
>
>
> I'm working on building a corpus of Uygur texts and some of the content is 
> coming from pdf files. I wrote a short python script to scrape text from pdf 
> using tika-python. The script is Arabic, and the output looks good but there 
> is one major problem: there are many missing spaces between words and I 
> really do not know how to address this issue. I am attaching a pdf file, the 
> script to scrape its text and the output (test_scraped.utf8). Thanks in 
> advance for your help.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: dev-unsubscr...@pdfbox.apache.org
For additional commands, e-mail: dev-h...@pdfbox.apache.org



[jira] [Created] (PDFBOX-5029) Tika - Issues extracting Arabic script from pdf

2020-11-27 Thread Christian (Jira)
Christian  created PDFBOX-5029:
--

 Summary: Tika - Issues extracting Arabic script from pdf
 Key: PDFBOX-5029
 URL: https://issues.apache.org/jira/browse/PDFBOX-5029
 Project: PDFBox
  Issue Type: Bug
 Environment: Windows - Anaconda / Spyder
Reporter: Christian 
 Attachments: extracting_text_asian_pdf.py, test.pdf, test_scraped.utf8

I'm working on building a corpus of Uygur texts and some of the content is 
coming from pdf files. I wrote a short python script to scrape text from pdf 
using tika-python. The script is Arabic, and the output looks good but there is 
one major problem: there are many missing spaces between words and I really do 
not know how to address this issue. I am attaching a pdf file, the script to 
scrape its text and the output (test_scraped.utf8). Thanks in advance for your 
help.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: dev-unsubscr...@pdfbox.apache.org
For additional commands, e-mail: dev-h...@pdfbox.apache.org



[jira] [Created] (PDFBOX-4149) PDF consisting on one page with 5 MB renders until the end of time using renderImageWithDPI

2018-03-11 Thread Christian (JIRA)
Christian created PDFBOX-4149:
-

 Summary: PDF consisting on one page with 5 MB renders until the 
end of time using renderImageWithDPI
 Key: PDFBOX-4149
 URL: https://issues.apache.org/jira/browse/PDFBOX-4149
 Project: PDFBox
  Issue Type: Bug
  Components: Rendering
Affects Versions: 2.0.8
Reporter: Christian
 Attachments: SB_Flyer_SpargelPromo_03-062018.pdf

I am using PDFBOX 2.0.8 on a Java VM 1.8.0_151

The attached and valid pdf should be rendered by calling

BufferedImage bim = pdfRenderer.renderImageWithDPI(i, 50);

But the rendering never ends - the only thing I see is this line repeating very 
often in the console:

[Finalizer] DEBUG org.apache.pdfbox.io.ScratchFileBuffer - ScratchFileBuffer 
not closed!

Here is the code that is used to open the document and then start the rendering:
{code:java}
PDDocument document = PDDocument.load(file, 
MemoryUsageSetting.setupTempFileOnly());
try  {
   PDFRenderer pdfRenderer = new PDFRenderer(document);
   int numberOfPages = document.getPages().getCount();
   for (int i = 0; i < numberOfPages; i++) {
  BufferedImage bim = pdfRenderer.renderImageWithDPI(i, 50);
[...]
{code}

The line 

BufferedImage bim = pdfRenderer.renderImageWithDPI(i, 50); 

is never passed. I ran a test and have wait for 30 minutes to let it pass, but 
nothing happens. 

Please advise what to do and how to solve the issue.





--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: dev-unsubscr...@pdfbox.apache.org
For additional commands, e-mail: dev-h...@pdfbox.apache.org