Interesting - running "pdftotext.ext -raw file.pdf file.txt" dumps it into a format similar to what you'd get if you selected all and then cp/pasetd.
On 10/11/06, Roger E. Rustad, Jr. <[email protected]> wrote:
Just ran pdftotext, and it comes out worse than had I just selectAll/copy'd and pasted into another application. It looks like I might need to play with some of those switches a bit more... Thanks for the recommendation, Joel. On 10/11/06, Joel Brauer <[email protected]> wrote: > > I would start with pdftotext and then parse from there... > > pdftotext is part of the poppler-utils package on my system(Ubuntu) > > -joel > > On Wed, 2006-10-11 at 15:13 -0700, Roger E. Rustad, Jr. wrote: > > I need to parse this PDF into a delimited text format > > > > http://www.riversideca.gov/finance/pdf/Business_List.pdf > > > > (Ideally, I'd like to do it in Perl b/c I hear Perl has some great > > scraping/parsing features that would benefit me later on when I need > > to do this kind of thing again.) > > > > Any suggestions? > > > > (I can copy the text into a txt file first, in case that makes the > > scraping/parsing easier) > > _______________________________________________ > > 909linux mailing list > > [email protected] > > http://909linux.org/cgi-bin/mailman/listinfo/909linux > -- > Joel Brauer > Manager IS > Communications and Web Technologies > [email protected] > pager: [email protected] > office: 909-558-7713 > cell: 909-534-1934 > > Only you can decide to be happy! The rest of life is working out the > details... >
