This is long, but thought I would ask. Part of the problem is with the PDF file that I have no control over how it is created, but perhaps I'm missing something simple. Have a solution that works about 95%, but does need manual tweaking. Thanks for the time. Just joined list, don't know how active it is, but thought I would ask. Just sitting at home with no much to do.. Be Safe.
Have been pulling a staffing pattern pdf file for some time from my college. The data seems to be converted from an excell spreadsheet, but the data can not be copied directly?? https://guamcc.edu/sites/default/files/2020_07_31_staffing_pattern_final. pdf Is the latest version of file. If I copy and paste from Firefox to libreoffice calc, it puts everything in cell A1 with no delimeters between fields? Same thing if I paste to a text editor (gedit). Using google chrome get the same if I paste to spreadsheet, but pasting to the text editor it shows the data line by line, but no delimeter between fields. Example line of data B00010286 Guam Community College FA 00 2 PRE005 1010 Office of the President President Okada, Mary A. 06/16/07 R-16-a 01/01/21 185,378 48,717 0 2,688 186 4,273 1,301 26 57,165 242,543 Some fields are single items, while other can have variable number of words. Have used a complex sed script to add ; delimeters, and correctly distinguish fields. Saving the pdf file to disk, and trying utilities. using pdftotext it creates a txt file, but it is rather strange in the output. Seems to show two columns of first page, then it shows the next column, and then mixed number of columns so is useless. File 5687 lines long, but main data should be 252 lines plus headers.. pdf2txt.py-2.7 does the same but is 5527 lines long? Have much better luck using pdftohtml. With the following script it converts it to an html file that has all the fields listed basically one by one. Have to go thru a number of steps to get it fixed. Has some errors with empty fields that require manual corrections. pdftohtml -nomerge -noframes $1.pdf sed -e 's/;<br/<br/g' $1.html |sed -e 's_<br/>_;_g' |tr '\n' ' ' | sed -e 's/; /;/g' |sed -e 's/B0/\nB0/g' | sed -e 's/<b>/\n<b>/g' |sed -e 's/<hr/\n<hr/g' | grep "B0" | sed -e 's/\&/\& /g' | sed -e 's/\ //g' >$1.csv sed -i 's/Accomodative/Accommodative/g' $1.csv sed -i 's/Telecomunications/Telecommunications/g' $1.csv sed -i 's/Administative/Administrative/g' $1.csv gedit $1.csv gawk -i inplace '$5=++i' FS=\; OFS=\; $1.csv libreoffice $1.csv The first sed line does most of the cleaning. There are a bunch of   that seem to be randomly in file. The first one needs leave the ; or it throws off columns. Then change <br/> to ; to be field separators; then convert all lines to a long single line by replace newlines with spaces. Then clean the spaces after ; to make it cleaner. Then to make the lines separate, the first field of data lines start with a B0 (all but 3, that have the fields blank?). Other things I don't need start with <b> or <hr, so make those separate lines as well. Then use grep to only get the B0 lines. File has some & and   in it. For some reason, fields that have this are combined with the next field, so have to leave the ;?? The next sed lines just fix spelling errors I've found in file. I then edit the file, since the 3 lines that don't have the B0 are added to the previous line, so I have to insert a newline, and add blank; to be that field. There is one other field on one record that is blank, so it also needs to be fixed. Save the changes. The gawk line has nother to do with conversion, but that column use to have the record numbers, but in this version of file, the first 200 or so do, but then the number jump to higher values, and even have records with same numbers, so I just replace the field with the number 1 to 252. File is the imported to libreoffice calc with ; set as only delimeter. Final fix is that 4 records have the field in column P as empty, so data is all shifted over by one, so just need to insert the blank cell and shift right. Don't know if there is an easier way, or why the other options didn't seem to work for me. Thanks sorry for the length. +------------------------------------------------------------+ Michael D. Setzer II - Computer Science Instructor (Retired) mailto:[email protected] mailto:[email protected] Guam - Where America's Day Begins G4L Disk Imaging Project maintainer http://sourceforge.net/projects/g4l/ +------------------------------------------------------------+ _______________________________________________ poppler mailing list [email protected] https://lists.freedesktop.org/mailman/listinfo/poppler
