Hi all! First of all thanks for the great work which is achieved with poppler. I have a question about extracting text out of a pdf. Basically it's about extracting the departures of a timetable stored in a pdf (e.g. http://4n4.de/vvs/501.pdf). Therefore I use "pdftotext -layout" and it works good but not perfect. The extracted data out I get there is the input data for another program of mine, that's why the text should be in a good shape to import it easily. If I use the pdftotext provided from poppler (xpdf) I have the problem that the columns out of the pdf are not in a column in the exported text file any more.
an example: Station A 08.45 10.45 12.38 14.38 Station B 08.53 10.53 12.46 14.46 Station C 08.56 10.56 12.56 14.56 Station D 08.57 10.57 12.57 14.57 I already improved the output of pdftotext by decreasing the // Minimum spacing between columns, as a fraction of the font size. #define minColSpacing2 0.5 (<-- originally 0.3) in TextOutputDev.cc. But I still have some problems getting a good output file. Does anyone have a good idea for me how to get the data out of the pdf? Or are there any other good switches/options which I could change to get better results (I already tried a couple but the only real improvement was the thing I mentioned above). Or any other idea how to get the data out of the pdf? Here a couple of links: source pdf: http://4n4.de/vvs/501.pdf original pdftotext-output: http://4n4.de/vvs/501.txt.oldversion modified pdftotext-output http://4n4.de/vvs/501.txt All right. I would be glad to get some hints of you... Michael _______________________________________________ poppler mailing list [email protected] http://lists.freedesktop.org/mailman/listinfo/poppler
