[ 
https://issues.apache.org/jira/browse/PDFBOX-951?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dusan Radojevic updated PDFBOX-951:
-----------------------------------

    Attachment: file1.pdf

> Text extraction has issues on some pdfs
> ---------------------------------------
>
>                 Key: PDFBOX-951
>                 URL: https://issues.apache.org/jira/browse/PDFBOX-951
>             Project: PDFBox
>          Issue Type: Improvement
>          Components: Text extraction
>    Affects Versions: 1.4.0
>            Reporter: Dusan Radojevic
>            Priority: Minor
>             Fix For: 1.5.0
>
>         Attachments: file1.pdf
>
>
> Hi,
> i have noticed a big improvement in latest releases. Extraction is better but 
> still has some problems.
> I have attached some files where i had problems.
> This is the code i use when extracting text:
> PDFTextStripper stripper = new PDFTextStripper(); 
> stripper.setStartPage( 1 );
> stripper.setEndPage( i );
> stripper.setSortByPosition(true);
> stripper.setWordSeparator("~");
> stripper.writeText(doc, sw);
> And here are some extracted lines from file1.pdf (I have skipped few lines 
> and made them shorter because the problem is on the beggining of the line):
> 01. ENGLESKA CARLING KUP~KONA^AN ISHOD~DUPLA [ANSA~PRVO POL....
> 02. Dan~^as~R.B.~1/2 
> finala~90'~1~X~2~1X~12~X2~1~X~2~1-1~1-X~1-2~X-1~X-X~X-2~2-1~2-...
> 03. Uto~20:45~2101~ARSENAL      0~1      
> IPSWICH~1.15~6.25~12.00~1.00~4.03~1.30~3.10~11.00~1.35~29.0~50~4.50~1....
> 04. Sre~20:45~2102~BIRMINGHAM     1~2    WEST 
> HAM~2.10~3.10~3.15~1.25~1.26~1.56~2.60~1.95~3.85~3.....
> 05. ENGLESKA 1~KONA^AN GOLOVI ISHOD~DUPLA [ANSA~PRVO POL.~45' / 90'~POL.~...
> 06. Dan~^as~R.B.~igre bez 
> uslova~90'~1~X~2~1X~12~X2~1~X~2~1-1~1-X~1-2~X-1~X-X~X-2~2-1~2-....
> 07. Uto~20:30~2001~BLACKPOOL~MANCHESTER 
> UTD~7.50~4.20~1.36~2.69~1.12~1.00~7.00~2.35~1.75~13.0.....
> 08. Uto~20:45~2002~WIGAN ASTON 
> VILLA~2.80~3.00~2.35~1.45~1.28~1.32~3.65~1.93~2.9.....
> 09. 
> Sre~21:00~2003~LIVERPOOL~FULHAM~1.50~3.50~6.05~1.05~1.20~2.22~1.95~2.13~6.55...
> 10. ENGLESKA 2~KONA^AN GOLOVI A TIMA DUPLA ISHOD~DUPLA [ANSA~PRVO POL.~45' / 
> 90'~POL....
> 11. 
> Dan~^as~R.B.~90'~1~X~2~1X~12~X2~1~X~2~1-1~1-X~1-2~X-1~X-X~X-2~2-1~2-X~2-2~1p2+2p2+~0-1.....
> 12. Uto~20:45~2171~DONCASTER 
> BARNSLEY~1.95~3.10~3.55~1.20~1.26~1.65~2.50~2.00~4.25~3.15~13.0~30~4.70~....
> 13. Uto~20:45~2172NOTTINGHAM FOREST~BRISTOL 
> CITY~1.55~3.45~5.50~1.07~1.21~2.12~2.00~2.12~....
> 14. ENGLESKA 3~KONA^AN DUPLA [ANSA~PRVO POL.~45' / 90'~GOLOVI LA 
> ISHOD~POL.~....
> 15. 
> Dan~^as~R.B.~90'~1~X~2~1X~12~X2~1~X~2~1-1~1-X~1-2~X-1~X-X~X-2~2-1~2-X~2-2....
> 16. Uto~20:45~2201~BRIGHTON COLCHESTER 
> 1.70~3.30~4.45~1.12~1.23~1.89~2.15~2.10~5.20~.....
> 17. Uto~20:45~2202~HARTLEPOOL~NOTTS COUNTY~2.50~3.05~2.60~1.37~1.27.....
> 18. Uto~20:45~2203~LEYTON MK 
> DONS~2.20~3.10~2.95~1.29~1.26~1.51~2.75~1.98~3.....
> 19. Uto~20:45~2204~SHEFFIELD WED~YEOVIL 
> 1.67~3.35~4.70~1.11~1.22~1.96~2.10~2.10.....
> Lines 07, 09, 17 are extracted well and well formated.
> Lines 03 and 04 share the same problem, there are unnecessary spaces which 
> should be line separators (in my case "~" separates words). I have seen this 
> in other documents.
> Lines 08 and 18 for example doesn't have word separator ("~") between two 
> team names. The space in the document between "Wigan" and "Aston Villa"  
> words is realy big.
> Lines 16 and 19 doesn't have word separator between second team name and 
> first quota (COLCHESTER 1.70 and YEOVIL 1.67)

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

Reply via email to