[jira] [Created] (PDFBOX-3248) Unwanted spaces in text extraction (2)

Tilman Hausherr (JIRA) Thu, 25 Feb 2016 09:07:46 -0800

Tilman Hausherr created PDFBOX-3248:
---------------------------------------


             Summary: Unwanted spaces in text extraction (2)
                 Key: PDFBOX-3248
                 URL: https://issues.apache.org/jira/browse/PDFBOX-3248
             Project: PDFBox
          Issue Type: Bug
          Components: Text extraction
    Affects Versions: 1.8.11, 2.0.0
            Reporter: Tilman Hausherr


The attached file provided by Francisco from the user mailing list has spaces 
in text extraction regardless of setting spacingTolerance or 
averageCharTolerance. I was unable to extract "Cada frasco ampolla" which 
looked straightforward in rendering, but it always appeared as "Ca da fras co 
ampo lla". Adobe Reader has no such problem.

The content stream has this:
{code}
     6 0 1.058 6 122.0924 312.51 Tm
     (Ca) Tj
     /Span << /ActualText (\376\377\000\255) >> BDC
       ( ) Tj
     EMC
     [ (da ) -301 (fras) ] TJ
     /Span << /ActualText (\376\377\000\255) >> BDC
       ( ) Tj
     EMC
     [ (co ) -301 (ampo) ] TJ
     /Span << /ActualText (\376\377\000\255) >> BDC
       ( ) Tj
     EMC
     [ (lla ) -301 (con) ] TJ
{code}
So there are really spaces there, and we keep them. Adobe is smarter, and 
ignores them because they are overwritten thanks to the "-301" backwards 
positioning.

Would /ActualText help? However it is always the same here...

Would it help to ignore spaces and decide based on positions only, maybe as an 
option? I added these two lines below the first existing one:
{code}
                String characterValue = position.getUnicode();
                if (" ".equals(characterValue))
                    continue;
{code}

The output looks promising:
{quote}
F ó r m u l a :
Cronopen® Balsámico Adultos:
Cada frasco ampolla contiene: ampicilina (como ampicilina sódica)
100 mg; ampicilina (como ampicilina benzatínica) 500 mg.
Cada ampolla solvente de 5 ml contiene: dipirona 1000 mg; guaife
nesina 100 mg. Exc.: bisulfito de sodio; agua destilada.
{quote}

A complete test brings many differences, most are harmless or are improvements. 
Only one test case really fails, hello3.pdf. Original extract is "Hello محمد 
World.", new extract is "Hello .Worldمحمد".

More from Francisco
{quote}
As additional information, I've found 2 related posts (about another tools)
in StackOverflow:
http://stackoverflow.com/questions/34579824/itext-how-to-tweak-text-extraction
http://stackoverflow.com/questions/22671974/itext-reading-pdf-1s-as-up-arrows-error/22688775#22688775
{quote}




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[jira] [Created] (PDFBOX-3248) Unwanted spaces in text extraction (2)

Reply via email to