[
https://issues.apache.org/jira/browse/PDFBOX-838?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Dusan Radojevic updated PDFBOX-838:
-----------------------------------
Description:
I want to make a parser that will parse some bookie pdf list with odds. I have
two files. One is working flawlessly and the other one have problems although
the two files are almost in identical form. The file uploaded
(listaMillenium.pdf) has problems with text extraction and the other file
(listaMeridian.pdf) is working fine.
This is the code i used:
try {
doc = PDDocument.load("listaMillenium.pdf");
PDFTextStripper stripper = new PDFTextStripper();
stripper.setStartPage( 6 );
stripper.setEndPage( 6 );
stripper.setSortByPosition(true);
stripper.setShouldSeparateByBeads(true);
stripper.setSuppressDuplicateOverlappingText(true);
stripper.setWordSeparator("~");
stripper.writeText(doc, sw);
} finally {
if (doc != null) {
doc.close();
}
}
On page 6 of the uploaded document (listaMillenium.pdf) you can see the output
lines like this:
nedelja 37 - 14.09. Utorak, 15.09. Sreda i 16.09. Četvrtak~strana 6
~Football~UEFA Europa League~Rezultat~KONAČAN ISHOD~DUPLA
ŠANSA~POLUVREME-KRAJ~Hen~HENDIKEP
~dan~čas~šifra~45~90~1~X~2~1X~12~X2~1-1~1-X~1-2~X-1~X-X~X-2~2-1~2-X~2-2~H~H1~HX~H2
~Cet~19:00~4041*~Salzburg~Man.
City~5.60~3.25~1.60~2.06~1.24~1.07~10.5~13.5~32.0~10.5~5.65~4.25~35.0~13.0~2.50~1~2.06~3.50~2.07
~Cet~19:00~4042*~Juventus~Lech
P.~1.20~5.25~10.5~1.08~3.50~1.50~21.0~70.0~4.75~9.00~20.0~40.0~19.0~27.0~-1~1.40~3.85~3.50
~Cet~19:00~4043*~Aris~Atl.
Madrid~3.50~3.20~1.95~1.67~1.25~1.21~7.00~13.0~30.0~7.25~5.05~4.80~30.0~13.0~3.25~1~1.67~3.30~2.80
~Cet~19:00~4044*~Leverkusen~Rosenborg~1.35~4.00~8.30~1.01~1.16~2.70~1.95~17.0~50.0~4.05~7.00~17.0~35.0~15.0~15.0~-1~1.63~3.70~2.70
~Cet~19:00~4045*~Lille~Sporting
L.~1.80~3.20~4.10~1.15~1.25~1.80~2.95~13.0~30.0~4.65~5.25~7.95~30.0~13.0~7.80~-1~2.45~3.45~1.80
~Cet~19:00~4046*~Levski
Sofia~Gent~2.00~3.20~3.35~1.23~1.25~1.64~3.35~13.0~30.0~4.85~5.00~7.00~30.0~13.0~6.75~-1~2.95~3.25~1.63
~Cet~19:00~4047*~Dinamo
Z.~Villarreal~3.35~3.20~2.00~1.64~1.25~1.23~6.75~13.0~30.0~7.00~5.00~4.85~30.0~13.0~3.35~1~1.63~3.25~2.95
~Cet~19:00~4048*~Club
Brugge~PAOK~2.10~3.15~3.15~1.26~1.26~1.58~3.50~13.0~30.0~4.95~5.00~6.65~30.0~13.0~6.40~-1~3.20~3.25~1.57
~Cet~19:00~4049*~AZ Alkmaar~Sheriff
Tiraspol~1.50~3.40~6.70~1.04~1.23~2.26~2.25~15.0~40.0~4.15~6.05~12.5~32.0~14.0~11.5~-1~1.87~3.60~2.24
~Cet~19:00~4050*~Dinamo
K.~BATE~1.40~3.75~7.65~1.02~1.18~2.52~2.05~17.0~40.0~4.10~6.65~15.0~32.0~14.0~14.0~-1~1.70~3.70~2.52
~Cet~19:00~4051*~Sparta
P.~Palermo~2.50~3.05~2.60~1.37~1.27~1.40~4.45~12.5~30.0~5.65~5.00~5.80~28.0~12.5~4.65~-1~4.40~3.20~1.40
~Cet~19:00~4052*~Lausanne~CSKA
Moscow~6.70~3.40~1.50~2.26~1.23~1.04~11.5~14.0~32.0~12.5~6.05~4.15~40.0~15.0~2.25~1~2.24~3.60~1.87
~Cet~21:05~4053*~Anderlecht~Zenit~2.60~3.05~2.50~1.40~1.27~1.37~4.65~12.5~28.0~5.80~5.00~5.65~30.0~12.5~4.45~1~1.40~3.20~4.40
~Cet~21:05~4054*~AEK~Hajduk~1.60~3.25~5.60~1.07~1.24~2.06~2.50~13.0~35.0~4.25~5.65~10.5~32.0~13.5~10.5~-1~2.07~3.50~2.06
~CeCet~21:021:05~4055*~Stuttgart~Y.
Boys~1.60~3.25~5.60~1.07~1.24~2.06~2.50~13.0~35.0~4.25~5.65~10.5~32.0~13.5~10.5~-1~2.07~3.50~2.06
Last line in this listing has problems. It has duplicate values somehow.
You can find this issue on almost every page of this list. Other lists (that i
have not uploaded) have same problems.
As i said, other file (listaMeridian.pdf) does not have this issue.
Maybe this will help you fix this and it will surely help me. :)
was:I will add description later.
> Problem with text extraction
> ----------------------------
>
> Key: PDFBOX-838
> URL: https://issues.apache.org/jira/browse/PDFBOX-838
> Project: PDFBox
> Issue Type: Bug
> Components: Text extraction
> Affects Versions: 1.2.1
> Reporter: Dusan Radojevic
> Fix For: 1.3.0
>
> Attachments: listaMillenium.pdf
>
>
> I want to make a parser that will parse some bookie pdf list with odds. I
> have two files. One is working flawlessly and the other one have problems
> although the two files are almost in identical form. The file uploaded
> (listaMillenium.pdf) has problems with text extraction and the other file
> (listaMeridian.pdf) is working fine.
> This is the code i used:
> try {
> doc = PDDocument.load("listaMillenium.pdf");
>
> PDFTextStripper stripper = new PDFTextStripper();
>
> stripper.setStartPage( 6 );
> stripper.setEndPage( 6 );
>
> stripper.setSortByPosition(true);
> stripper.setShouldSeparateByBeads(true);
> stripper.setSuppressDuplicateOverlappingText(true);
> stripper.setWordSeparator("~");
> stripper.writeText(doc, sw);
> } finally {
> if (doc != null) {
> doc.close();
> }
> }
> On page 6 of the uploaded document (listaMillenium.pdf) you can see the
> output lines like this:
> nedelja 37 - 14.09. Utorak, 15.09. Sreda i 16.09. Četvrtak~strana 6
> ~Football~UEFA Europa League~Rezultat~KONAČAN ISHOD~DUPLA
> ŠANSA~POLUVREME-KRAJ~Hen~HENDIKEP
> ~dan~čas~šifra~45~90~1~X~2~1X~12~X2~1-1~1-X~1-2~X-1~X-X~X-2~2-1~2-X~2-2~H~H1~HX~H2
> ~Cet~19:00~4041*~Salzburg~Man.
> City~5.60~3.25~1.60~2.06~1.24~1.07~10.5~13.5~32.0~10.5~5.65~4.25~35.0~13.0~2.50~1~2.06~3.50~2.07
> ~Cet~19:00~4042*~Juventus~Lech
> P.~1.20~5.25~10.5~1.08~3.50~1.50~21.0~70.0~4.75~9.00~20.0~40.0~19.0~27.0~-1~1.40~3.85~3.50
> ~Cet~19:00~4043*~Aris~Atl.
> Madrid~3.50~3.20~1.95~1.67~1.25~1.21~7.00~13.0~30.0~7.25~5.05~4.80~30.0~13.0~3.25~1~1.67~3.30~2.80
> ~Cet~19:00~4044*~Leverkusen~Rosenborg~1.35~4.00~8.30~1.01~1.16~2.70~1.95~17.0~50.0~4.05~7.00~17.0~35.0~15.0~15.0~-1~1.63~3.70~2.70
> ~Cet~19:00~4045*~Lille~Sporting
> L.~1.80~3.20~4.10~1.15~1.25~1.80~2.95~13.0~30.0~4.65~5.25~7.95~30.0~13.0~7.80~-1~2.45~3.45~1.80
> ~Cet~19:00~4046*~Levski
> Sofia~Gent~2.00~3.20~3.35~1.23~1.25~1.64~3.35~13.0~30.0~4.85~5.00~7.00~30.0~13.0~6.75~-1~2.95~3.25~1.63
> ~Cet~19:00~4047*~Dinamo
> Z.~Villarreal~3.35~3.20~2.00~1.64~1.25~1.23~6.75~13.0~30.0~7.00~5.00~4.85~30.0~13.0~3.35~1~1.63~3.25~2.95
> ~Cet~19:00~4048*~Club
> Brugge~PAOK~2.10~3.15~3.15~1.26~1.26~1.58~3.50~13.0~30.0~4.95~5.00~6.65~30.0~13.0~6.40~-1~3.20~3.25~1.57
> ~Cet~19:00~4049*~AZ Alkmaar~Sheriff
> Tiraspol~1.50~3.40~6.70~1.04~1.23~2.26~2.25~15.0~40.0~4.15~6.05~12.5~32.0~14.0~11.5~-1~1.87~3.60~2.24
> ~Cet~19:00~4050*~Dinamo
> K.~BATE~1.40~3.75~7.65~1.02~1.18~2.52~2.05~17.0~40.0~4.10~6.65~15.0~32.0~14.0~14.0~-1~1.70~3.70~2.52
> ~Cet~19:00~4051*~Sparta
> P.~Palermo~2.50~3.05~2.60~1.37~1.27~1.40~4.45~12.5~30.0~5.65~5.00~5.80~28.0~12.5~4.65~-1~4.40~3.20~1.40
> ~Cet~19:00~4052*~Lausanne~CSKA
> Moscow~6.70~3.40~1.50~2.26~1.23~1.04~11.5~14.0~32.0~12.5~6.05~4.15~40.0~15.0~2.25~1~2.24~3.60~1.87
> ~Cet~21:05~4053*~Anderlecht~Zenit~2.60~3.05~2.50~1.40~1.27~1.37~4.65~12.5~28.0~5.80~5.00~5.65~30.0~12.5~4.45~1~1.40~3.20~4.40
> ~Cet~21:05~4054*~AEK~Hajduk~1.60~3.25~5.60~1.07~1.24~2.06~2.50~13.0~35.0~4.25~5.65~10.5~32.0~13.5~10.5~-1~2.07~3.50~2.06
> ~CeCet~21:021:05~4055*~Stuttgart~Y.
> Boys~1.60~3.25~5.60~1.07~1.24~2.06~2.50~13.0~35.0~4.25~5.65~10.5~32.0~13.5~10.5~-1~2.07~3.50~2.06
> Last line in this listing has problems. It has duplicate values somehow.
> You can find this issue on almost every page of this list. Other lists (that
> i have not uploaded) have same problems.
> As i said, other file (listaMeridian.pdf) does not have this issue.
> Maybe this will help you fix this and it will surely help me. :)
--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.