[jira] Updated: (PDFBOX-838) Problem with text extraction

Dusan Radojevic (JIRA) Thu, 23 Sep 2010 02:03:00 -0700

     [ 
https://issues.apache.org/jira/browse/PDFBOX-838?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]


Dusan Radojevic updated PDFBOX-838:
-----------------------------------

    Description: 
I want to make a parser that will parse some bookie pdf list with odds. I have 
two files. One is working flawlessly and the other one have problems although 
the two files are almost in identical form. The file uploaded 
(listaMillenium.pdf) has problems with text extraction and the other file 
(listaMeridian.pdf) is working fine.

This is the code i used:

                 try {
                    doc = PDDocument.load("listaMillenium.pdf");
                   
                    PDFTextStripper stripper = new PDFTextStripper();           
     
                    stripper.setStartPage( 6 );
                    stripper.setEndPage( 6 );
         
                    stripper.setSortByPosition(true);
                    stripper.setShouldSeparateByBeads(true);
                    stripper.setSuppressDuplicateOverlappingText(true);

                    stripper.setWordSeparator("~");
                    stripper.writeText(doc, sw);
                } finally {
                     if (doc != null) {
                         doc.close();
                     }
                }

On page 6 of the uploaded document (listaMillenium.pdf) you can see the output 
lines like this:

nedelja 37 - 14.09. Utorak, 15.09. Sreda i 16.09. Četvrtak~strana 6
~Football~UEFA Europa League~Rezultat~KONAČAN ISHOD~DUPLA 
ŠANSA~POLUVREME-KRAJ~Hen~HENDIKEP
~dan~čas~šifra~45~90~1~X~2~1X~12~X2~1-1~1-X~1-2~X-1~X-X~X-2~2-1~2-X~2-2~H~H1~HX~H2
~Cet~19:00~4041*~Salzburg~Man. 
City~5.60~3.25~1.60~2.06~1.24~1.07~10.5~13.5~32.0~10.5~5.65~4.25~35.0~13.0~2.50~1~2.06~3.50~2.07
~Cet~19:00~4042*~Juventus~Lech 
P.~1.20~5.25~10.5~1.08~3.50~1.50~21.0~70.0~4.75~9.00~20.0~40.0~19.0~27.0~-1~1.40~3.85~3.50
~Cet~19:00~4043*~Aris~Atl. 
Madrid~3.50~3.20~1.95~1.67~1.25~1.21~7.00~13.0~30.0~7.25~5.05~4.80~30.0~13.0~3.25~1~1.67~3.30~2.80
~Cet~19:00~4044*~Leverkusen~Rosenborg~1.35~4.00~8.30~1.01~1.16~2.70~1.95~17.0~50.0~4.05~7.00~17.0~35.0~15.0~15.0~-1~1.63~3.70~2.70
~Cet~19:00~4045*~Lille~Sporting 
L.~1.80~3.20~4.10~1.15~1.25~1.80~2.95~13.0~30.0~4.65~5.25~7.95~30.0~13.0~7.80~-1~2.45~3.45~1.80
~Cet~19:00~4046*~Levski 
Sofia~Gent~2.00~3.20~3.35~1.23~1.25~1.64~3.35~13.0~30.0~4.85~5.00~7.00~30.0~13.0~6.75~-1~2.95~3.25~1.63
~Cet~19:00~4047*~Dinamo 
Z.~Villarreal~3.35~3.20~2.00~1.64~1.25~1.23~6.75~13.0~30.0~7.00~5.00~4.85~30.0~13.0~3.35~1~1.63~3.25~2.95
~Cet~19:00~4048*~Club 
Brugge~PAOK~2.10~3.15~3.15~1.26~1.26~1.58~3.50~13.0~30.0~4.95~5.00~6.65~30.0~13.0~6.40~-1~3.20~3.25~1.57
~Cet~19:00~4049*~AZ Alkmaar~Sheriff 
Tiraspol~1.50~3.40~6.70~1.04~1.23~2.26~2.25~15.0~40.0~4.15~6.05~12.5~32.0~14.0~11.5~-1~1.87~3.60~2.24
~Cet~19:00~4050*~Dinamo 
K.~BATE~1.40~3.75~7.65~1.02~1.18~2.52~2.05~17.0~40.0~4.10~6.65~15.0~32.0~14.0~14.0~-1~1.70~3.70~2.52
~Cet~19:00~4051*~Sparta 
P.~Palermo~2.50~3.05~2.60~1.37~1.27~1.40~4.45~12.5~30.0~5.65~5.00~5.80~28.0~12.5~4.65~-1~4.40~3.20~1.40
~Cet~19:00~4052*~Lausanne~CSKA 
Moscow~6.70~3.40~1.50~2.26~1.23~1.04~11.5~14.0~32.0~12.5~6.05~4.15~40.0~15.0~2.25~1~2.24~3.60~1.87
~Cet~21:05~4053*~Anderlecht~Zenit~2.60~3.05~2.50~1.40~1.27~1.37~4.65~12.5~28.0~5.80~5.00~5.65~30.0~12.5~4.45~1~1.40~3.20~4.40
~Cet~21:05~4054*~AEK~Hajduk~1.60~3.25~5.60~1.07~1.24~2.06~2.50~13.0~35.0~4.25~5.65~10.5~32.0~13.5~10.5~-1~2.07~3.50~2.06
~CeCet~21:021:05~4055*~Stuttgart~Y. 
Boys~1.60~3.25~5.60~1.07~1.24~2.06~2.50~13.0~35.0~4.25~5.65~10.5~32.0~13.5~10.5~-1~2.07~3.50~2.06

Last line in this listing has problems. It has duplicate values somehow.
You can find this issue on almost every page of this list. Other lists (that i 
have not uploaded) have same problems.
As i said, other file (listaMeridian.pdf) does not have this issue.

Maybe this will help you fix this and it will surely help me. :)


  was:I will add description later.


> Problem with text extraction
> ----------------------------
>
>                 Key: PDFBOX-838
>                 URL: https://issues.apache.org/jira/browse/PDFBOX-838
>             Project: PDFBox
>          Issue Type: Bug
>          Components: Text extraction
>    Affects Versions: 1.2.1
>            Reporter: Dusan Radojevic
>             Fix For: 1.3.0
>
>         Attachments: listaMillenium.pdf
>
>
> I want to make a parser that will parse some bookie pdf list with odds. I 
> have two files. One is working flawlessly and the other one have problems 
> although the two files are almost in identical form. The file uploaded 
> (listaMillenium.pdf) has problems with text extraction and the other file 
> (listaMeridian.pdf) is working fine.
> This is the code i used:
>                  try {
>                   doc = PDDocument.load("listaMillenium.pdf");
>                  
>                   PDFTextStripper stripper = new PDFTextStripper();           
>      
>                   stripper.setStartPage( 6 );
>                   stripper.setEndPage( 6 );
>        
>                   stripper.setSortByPosition(true);
>                   stripper.setShouldSeparateByBeads(true);
>                   stripper.setSuppressDuplicateOverlappingText(true);
>                   stripper.setWordSeparator("~");
>                   stripper.writeText(doc, sw);
>               } finally {
>                    if (doc != null) {
>                        doc.close();
>                    }
>               }
> On page 6 of the uploaded document (listaMillenium.pdf) you can see the 
> output lines like this:
> nedelja 37 - 14.09. Utorak, 15.09. Sreda i 16.09. Četvrtak~strana 6
> ~Football~UEFA Europa League~Rezultat~KONAČAN ISHOD~DUPLA 
> ŠANSA~POLUVREME-KRAJ~Hen~HENDIKEP
> ~dan~čas~šifra~45~90~1~X~2~1X~12~X2~1-1~1-X~1-2~X-1~X-X~X-2~2-1~2-X~2-2~H~H1~HX~H2
> ~Cet~19:00~4041*~Salzburg~Man. 
> City~5.60~3.25~1.60~2.06~1.24~1.07~10.5~13.5~32.0~10.5~5.65~4.25~35.0~13.0~2.50~1~2.06~3.50~2.07
> ~Cet~19:00~4042*~Juventus~Lech 
> P.~1.20~5.25~10.5~1.08~3.50~1.50~21.0~70.0~4.75~9.00~20.0~40.0~19.0~27.0~-1~1.40~3.85~3.50
> ~Cet~19:00~4043*~Aris~Atl. 
> Madrid~3.50~3.20~1.95~1.67~1.25~1.21~7.00~13.0~30.0~7.25~5.05~4.80~30.0~13.0~3.25~1~1.67~3.30~2.80
> ~Cet~19:00~4044*~Leverkusen~Rosenborg~1.35~4.00~8.30~1.01~1.16~2.70~1.95~17.0~50.0~4.05~7.00~17.0~35.0~15.0~15.0~-1~1.63~3.70~2.70
> ~Cet~19:00~4045*~Lille~Sporting 
> L.~1.80~3.20~4.10~1.15~1.25~1.80~2.95~13.0~30.0~4.65~5.25~7.95~30.0~13.0~7.80~-1~2.45~3.45~1.80
> ~Cet~19:00~4046*~Levski 
> Sofia~Gent~2.00~3.20~3.35~1.23~1.25~1.64~3.35~13.0~30.0~4.85~5.00~7.00~30.0~13.0~6.75~-1~2.95~3.25~1.63
> ~Cet~19:00~4047*~Dinamo 
> Z.~Villarreal~3.35~3.20~2.00~1.64~1.25~1.23~6.75~13.0~30.0~7.00~5.00~4.85~30.0~13.0~3.35~1~1.63~3.25~2.95
> ~Cet~19:00~4048*~Club 
> Brugge~PAOK~2.10~3.15~3.15~1.26~1.26~1.58~3.50~13.0~30.0~4.95~5.00~6.65~30.0~13.0~6.40~-1~3.20~3.25~1.57
> ~Cet~19:00~4049*~AZ Alkmaar~Sheriff 
> Tiraspol~1.50~3.40~6.70~1.04~1.23~2.26~2.25~15.0~40.0~4.15~6.05~12.5~32.0~14.0~11.5~-1~1.87~3.60~2.24
> ~Cet~19:00~4050*~Dinamo 
> K.~BATE~1.40~3.75~7.65~1.02~1.18~2.52~2.05~17.0~40.0~4.10~6.65~15.0~32.0~14.0~14.0~-1~1.70~3.70~2.52
> ~Cet~19:00~4051*~Sparta 
> P.~Palermo~2.50~3.05~2.60~1.37~1.27~1.40~4.45~12.5~30.0~5.65~5.00~5.80~28.0~12.5~4.65~-1~4.40~3.20~1.40
> ~Cet~19:00~4052*~Lausanne~CSKA 
> Moscow~6.70~3.40~1.50~2.26~1.23~1.04~11.5~14.0~32.0~12.5~6.05~4.15~40.0~15.0~2.25~1~2.24~3.60~1.87
> ~Cet~21:05~4053*~Anderlecht~Zenit~2.60~3.05~2.50~1.40~1.27~1.37~4.65~12.5~28.0~5.80~5.00~5.65~30.0~12.5~4.45~1~1.40~3.20~4.40
> ~Cet~21:05~4054*~AEK~Hajduk~1.60~3.25~5.60~1.07~1.24~2.06~2.50~13.0~35.0~4.25~5.65~10.5~32.0~13.5~10.5~-1~2.07~3.50~2.06
> ~CeCet~21:021:05~4055*~Stuttgart~Y. 
> Boys~1.60~3.25~5.60~1.07~1.24~2.06~2.50~13.0~35.0~4.25~5.65~10.5~32.0~13.5~10.5~-1~2.07~3.50~2.06
> Last line in this listing has problems. It has duplicate values somehow.
> You can find this issue on almost every page of this list. Other lists (that 
> i have not uploaded) have same problems.
> As i said, other file (listaMeridian.pdf) does not have this issue.
> Maybe this will help you fix this and it will surely help me. :)

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Updated: (PDFBOX-838) Problem with text extraction

Reply via email to