[jira] [Commented] (PDFBOX-4324) while extracting text from region : "Error: expected hex character and not s:115"

Tilman Hausherr (JIRA) Wed, 26 Sep 2018 21:00:22 -0700


    [ 
https://issues.apache.org/jira/browse/PDFBOX-4324?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16629715#comment-16629715
 ]


Tilman Hausherr commented on PDFBOX-4324:
-----------------------------------------

The ExtractTextByArea goes through the whole content stream internally which is 
why it its the error even if you don't use that area. The first page doesn't 
use the font (F5 = {color:#333333}Rupakara{color}) although it is in the 
resources of the first page.

The ToUnicode stream is font-related, not page related. It is in some fonts, 
depending of the type etc (see PDF specification). In your PDF it is only in 
the F5 ({color:#333333}Rupakara{color}) font.

I didn't test it, but a possible explanation is that the stream is Flate 
compressed but that the creator of the file didn't add the filter name to the 
dictionary.

> while extracting text from region : "Error: expected hex character and not 
> s:115"
> ---------------------------------------------------------------------------------
>
>                 Key: PDFBOX-4324
>                 URL: https://issues.apache.org/jira/browse/PDFBOX-4324
>             Project: PDFBox
>          Issue Type: Bug
>          Components: Text extraction
>    Affects Versions: 2.0.2
>            Reporter: Amit Maheshwari
>            Priority: Major
>         Attachments: SRI NAGAR.PDF, ToUnicode.txt
>
>
> I am getting following error when I try to extract text of any specific 
> region of 2nd page of attached pdf (while 1st page is working fine)
>  
> Error Message:
> "Error: expected hex character and not s:115"
>  
> Stack-trace:
> at org.apache.pdfbox.contentstream.PDFStreamEngine.operatorException(Operator 
> operator, List operands, IOException e)
>  at org.apache.pdfbox.contentstream.PDFStreamEngine.processOperator(Operator 
> operator, List operands)
>  at 
> org.apache.pdfbox.contentstream.PDFStreamEngine.processStreamOperators(PDContentStream
>  )
>  at 
> org.apache.pdfbox.contentstream.PDFStreamEngine.processStream(PDContentStream 
> )
>  at org.apache.pdfbox.contentstream.PDFStreamEngine.processPage(PDPage page)
>  at org.apache.pdfbox.text.PDFTextStreamEngine.processPage(PDPage )
>  at org.apache.pdfbox.text.PDFTextStripper.processPage(PDPage page)
>  at org.apache.pdfbox.text.PDFTextStripperByArea.extractRegions(PDPage page)



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[jira] [Commented] (PDFBOX-4324) while extracting text from region : "Error: expected hex character and not s:115"

Reply via email to