[
https://issues.apache.org/jira/browse/PDFBOX-838?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Jukka Zitting resolved PDFBOX-838.
----------------------------------
Resolution: Cannot Reproduce
Fix Version/s: (was: 1.3.0)
Assignee: Jukka Zitting
This seems to have been fixed in some other issue, as I can't reproduce the
problem anymore with the latest trunk.
> Problem with text extraction
> ----------------------------
>
> Key: PDFBOX-838
> URL: https://issues.apache.org/jira/browse/PDFBOX-838
> Project: PDFBox
> Issue Type: Bug
> Components: Text extraction
> Affects Versions: 1.2.1
> Reporter: Dusan Radojevic
> Assignee: Jukka Zitting
> Attachments: listaMeridian.pdf, listaMillenium.pdf
>
>
> I want to make a parser that will parse some bookie pdf list with odds. I
> have two files. One is working flawlessly and the other one have problems
> although the two files are almost in identical form. The file uploaded
> (listaMillenium.pdf) has problems with text extraction and the other file
> (listaMeridian.pdf) is working fine.
> This is the code i used:
> try {
> doc = PDDocument.load("listaMillenium.pdf");
>
> PDFTextStripper stripper = new PDFTextStripper();
>
> stripper.setStartPage( 6 );
> stripper.setEndPage( 6 );
>
> stripper.setSortByPosition(true);
> stripper.setShouldSeparateByBeads(true);
> stripper.setSuppressDuplicateOverlappingText(true);
> stripper.setWordSeparator("~");
> stripper.writeText(doc, sw);
> } finally {
> if (doc != null) {
> doc.close();
> }
> }
> On page 6 of the uploaded document (listaMillenium.pdf) you can see the
> output lines like this:
> nedelja 37 - 14.09. Utorak, 15.09. Sreda i 16.09. Četvrtak~strana 6
> ~Football~UEFA Europa League~Rezultat~KONAČAN ISHOD~DUPLA
> ŠANSA~POLUVREME-KRAJ~Hen~HENDIKEP
> ~dan~čas~šifra~45~90~1~X~2~1X~12~X2~1-1~1-X~1-2~X-1~X-X~X-2~2-1~2-X~2-2~H~H1~HX~H2
> ~Cet~19:00~4041*~Salzburg~Man.
> City~5.60~3.25~1.60~2.06~1.24~1.07~10.5~13.5~32.0~10.5~5.65~4.25~35.0~13.0~2.50~1~2.06~3.50~2.07
> ~Cet~19:00~4042*~Juventus~Lech
> P.~1.20~5.25~10.5~1.08~3.50~1.50~21.0~70.0~4.75~9.00~20.0~40.0~19.0~27.0~-1~1.40~3.85~3.50
> ~Cet~19:00~4043*~Aris~Atl.
> Madrid~3.50~3.20~1.95~1.67~1.25~1.21~7.00~13.0~30.0~7.25~5.05~4.80~30.0~13.0~3.25~1~1.67~3.30~2.80
> ~Cet~19:00~4044*~Leverkusen~Rosenborg~1.35~4.00~8.30~1.01~1.16~2.70~1.95~17.0~50.0~4.05~7.00~17.0~35.0~15.0~15.0~-1~1.63~3.70~2.70
> ~Cet~19:00~4045*~Lille~Sporting
> L.~1.80~3.20~4.10~1.15~1.25~1.80~2.95~13.0~30.0~4.65~5.25~7.95~30.0~13.0~7.80~-1~2.45~3.45~1.80
> ~Cet~19:00~4046*~Levski
> Sofia~Gent~2.00~3.20~3.35~1.23~1.25~1.64~3.35~13.0~30.0~4.85~5.00~7.00~30.0~13.0~6.75~-1~2.95~3.25~1.63
> ~Cet~19:00~4047*~Dinamo
> Z.~Villarreal~3.35~3.20~2.00~1.64~1.25~1.23~6.75~13.0~30.0~7.00~5.00~4.85~30.0~13.0~3.35~1~1.63~3.25~2.95
> ~Cet~19:00~4048*~Club
> Brugge~PAOK~2.10~3.15~3.15~1.26~1.26~1.58~3.50~13.0~30.0~4.95~5.00~6.65~30.0~13.0~6.40~-1~3.20~3.25~1.57
> ~Cet~19:00~4049*~AZ Alkmaar~Sheriff
> Tiraspol~1.50~3.40~6.70~1.04~1.23~2.26~2.25~15.0~40.0~4.15~6.05~12.5~32.0~14.0~11.5~-1~1.87~3.60~2.24
> ~Cet~19:00~4050*~Dinamo
> K.~BATE~1.40~3.75~7.65~1.02~1.18~2.52~2.05~17.0~40.0~4.10~6.65~15.0~32.0~14.0~14.0~-1~1.70~3.70~2.52
> ~Cet~19:00~4051*~Sparta
> P.~Palermo~2.50~3.05~2.60~1.37~1.27~1.40~4.45~12.5~30.0~5.65~5.00~5.80~28.0~12.5~4.65~-1~4.40~3.20~1.40
> ~Cet~19:00~4052*~Lausanne~CSKA
> Moscow~6.70~3.40~1.50~2.26~1.23~1.04~11.5~14.0~32.0~12.5~6.05~4.15~40.0~15.0~2.25~1~2.24~3.60~1.87
> ~Cet~21:05~4053*~Anderlecht~Zenit~2.60~3.05~2.50~1.40~1.27~1.37~4.65~12.5~28.0~5.80~5.00~5.65~30.0~12.5~4.45~1~1.40~3.20~4.40
> ~Cet~21:05~4054*~AEK~Hajduk~1.60~3.25~5.60~1.07~1.24~2.06~2.50~13.0~35.0~4.25~5.65~10.5~32.0~13.5~10.5~-1~2.07~3.50~2.06
> ~CeCet~21:021:05~4055*~Stuttgart~Y.
> Boys~1.60~3.25~5.60~1.07~1.24~2.06~2.50~13.0~35.0~4.25~5.65~10.5~32.0~13.5~10.5~-1~2.07~3.50~2.06
> Last line in this listing has problems. It has duplicate values somehow.
> You can find this issue on almost every page of this list. Other lists (that
> i have not uploaded) have same problems.
> As i said, other file (listaMeridian.pdf) does not have this issue.
> Maybe this will help you fix this and it will surely help me. :)
--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.