i have included some debug messages in config.cpp, they tell me: parsed from: application/pdf parsed to: text/html parsed cmd: charset=iso8859-1 /users/aspseek/sbin/pdftohtml -i -noframes -stdout $in > $out parsed charset: iso8859-1 /users/aspseek/sbin/pdftohtml -i -noframes -stdout $in > $out parsed cmd(2): /users/aspseek/sbin/pdftohtml -i -noframes -stdout $in > $out
so charset contains too much mfg Markus Rietzler * <rietzler_software/> * RZF NRW * Tel: 0211.4572-130 -----Urspr�ngliche Nachricht----- Von: Kir Kolyshkin [mailto:[EMAIL PROTECTED]] Gesendet am: Donnerstag, 12. September 2002 12:29 An: [EMAIL PROTECTED] Betreff: Re: AW: AW: AW: AW: [aseek-users] external converters, pdf files Attached patch shold fix it. Can you try it out? [EMAIL PROTECTED] wrote: > > mh, > with adding charset there is no exec-line in the index-log, so no > pdf-conversion at all. > tried to > > Converter application/pdf text/html; charset=iso8859-1 > /usr/bin/pdftohtml -i -noframes -stdout $in >$out > and > Converter application/pdf "text/html; charset=iso8859-1" > /usr/bin/pdftohtml -i -noframes -stdout $in >$out > > without charset option pdf's are indexed... > > mfg > > Markus Rietzler > * <rietzler_software/> > * RZF NRW > * Tel: 0211.4572-130 > > -----Urspr�ngliche Nachricht----- > Von: Kir Kolyshkin [mailto:[EMAIL PROTECTED]] > Gesendet am: Donnerstag, 12. September 2002 11:26 > An: [EMAIL PROTECTED] > Betreff: Re: AW: AW: AW: [aseek-users] external converters, pdf files > > Try to add ";charset=iso8859-1" (substitute with charset you need) > after "text/html" in Converter line. > > Sorry, seems it is missing from man page; I will add it. > > So, it will look like: > > Converter application/pdf text/html; charset=iso8859-5 /usr/bin/pdftohtml -i > -noframes -stdout $in >$out > Please report here if it works or not :) > > [EMAIL PROTECTED] wrote: > > > > that trick worked (half). > > the pdf-file is being indexed but i can't search for words with eg. > umlauts. > > in the excerpt i see "?" on the places where umlauts (�,�,�) should be. > > so the charset of the document is wrong. any ideas > > > > mfg > > > > Markus Rietzler > > * <rietzler_software/> > > * RZF NRW > > * Tel: 0211.4572-130 > > > > -----Urspr�ngliche Nachricht----- > > Von: Gerrit Hannaert [mailto:[EMAIL PROTECTED]] > > Gesendet am: Donnerstag, 12. September 2002 02:48 > > An: [EMAIL PROTECTED] > > Betreff: Re: AW: AW: [aseek-users] external converters, pdf files > > > > While trying out pdftohtml for myself (yeah, it's nicer than plain text > > and provides a title too), I figured it out. > > pdftohtml will add an ".html" extension to the output file; and hence > > index won't use it nor delete it afterwards. The solution I use is simply: > > > > Converter application/pdf text/html /usr/bin/pdftohtml -i > > -noframes -stdout $in >$out > > > > And so far it seems to be working... > > > > Cheers, > > > > [EMAIL PROTECTED] wrote: > > > > >in aspseek.conf i have > > > > > > Converter application/pdf text/html /users/aspseek/sbin/pdftohtml > -i > > >-noframes $in $out.html > > > > > >and this is what index says, looks good for me > > > > > >www@I011-32:/users/aspseek/sbin> ./index -a -m -u http://..../test.pdf > > >Loading configuration from /users/aspseek/etc/db.conf > > >Loading configuration from /users/aspseek/etc/ucharset.conf > > >Loading configuration from /users/aspseek/etc/stopwords.conf > > >Loading configuration from /users/aspseek/etc/server.url > > >Loading configuration from /users/aspseek/etc/allow.url > > >Loading configuration from /users/aspseek/etc/aspseek.conf > > >Adding URL: http://..../test.pdf > > >exec /users/aspseek/sbin/pdftohtml -i -noframes /tmp/asiLa7HF2 > > >/tmp/asoOpsGUU.html > > >Page-1 > > >Page-2 > > >Page-3 > > >Page-4 > > >Page-5 > > >Page-6 > > >Page-7 > > >Page-8 > > >Saving real-time database ... done. > > >Saving delta files [..................................................] > > >done. > > >Deleting 'deleted' records from urlword[s] ... done. (0 records deleted) > > >Saving real-time ... done > > >Saving redirects ... done > > >Splitting href delta file ... done > > >Saving href delta files ... done > > >Saving direct href delta files ... done > > >Calculating ranks [................................................] > done. > > >Saving lastmods ... done > > >Generating word site ... done > > >Generating subset http://..../% ... done (193 URLs) > > >index process finished. > > > > > >btw: those two tempfiles are not deleted in /tmp, maybe another bug > > >/tmp/asoOpsGUU.html is a html-export of the pdf file. so text is > recognized > > >and exported correct, but when i search for one of the words from this > file > > >i get no results... > > > > > >urlword-table says > > > > > >*************************** 1. row *************************** > > > url_id: 100 > > > site_id: 1 > > > deleted: 0 > > > url: http://.../versorgungsreform.pdf > > >next_index_time: 1031828953 > > > status: 200 > > > crc: d41d8cd98f00b204e9800998ecf8427e > > > last_modified: Wed, 11 Sep 2002 02:00:11 GMT > > > etag: "1cc11a-cb26-3d7ea3ab" > > >last_index_time: 1031742553 > > > referrer: 23 > > > tag: 0 > > > hops: 3 > > > redir: 0 > > > origin: 0 > > >1 row in set (0.00 sec) > > > > > >mfg > > > > > >Markus Rietzler > > >* <rietzler_software/> > > >* RZF NRW > > >* Tel: 0211.4572-130 > > > > > > > > > > > >-----Urspr�ngliche Nachricht----- > > >Von: Gerrit Hannaert [mailto:[EMAIL PROTECTED]] > > >Gesendet am: Mittwoch, 11. September 2002 10:48 > > >An: [EMAIL PROTECTED]; [EMAIL PROTECTED] > > >Betreff: Re: AW: [aseek-users] external converters, pdf files > > > > > >Hi, > > >What does your 'converter' line in aspseek.conf look like? > > >Also try running index -a -m -u "%.pdf" and see what the output is > > >(perhaps an error message is displayed). > > > > > >Cheers, > > > > > >[EMAIL PROTECTED] wrote: > > > > > > > > > > > >>nono, > > >>these are "plain" pdf files, mostly converted from winword. so there is > a > > >>lot of text. when i use pdf2text or pdftohtml and look in the result, i > > get > > >>all the words/text from the pdf file. so something different happens > > >> > > >> > > >here... > > > > > > > > >>mfg > > >> > > >>Markus Rietzler > > >>* <rietzler_software/> > > >>* RZF NRW > > >>* Tel: 0211.4572-130 > > >> > > >> > > >> > > >>-----Urspr�ngliche Nachricht----- > > >>Von: Gregory Kozlovsky [mailto:[EMAIL PROTECTED]] > > >>Gesendet am: Mittwoch, 11. September 2002 10:07 > > >>An: '[EMAIL PROTECTED]' > > >>Betreff: RE: [aseek-users] external converters, pdf files > > >> > > >>Sometimes, what appears to be text in .pdf files is actually scanned > > images > > >>that cannot be indexed. Check for it. > > >> > > >> Gregory Kozlovsky > > >> > > >>-----Original Message----- > > >>From: [EMAIL PROTECTED] > > >>[mailto:[EMAIL PROTECTED]] > > >>Sent: Mittwoch, 11. September 2002 09:59 > > >>To: [EMAIL PROTECTED] > > >>Subject: [aseek-users] external converters, pdf files > > >> > > >> > > >>hi, > > >>i am trying to setup aspseek with external converter support. i > installed > > >>pdftohtml, indexing works fine, pdf files seem to be processed, i can > find > > >>the urls to the pdf files in urlword table even with status code 200. > but > > >>when i do a search with words from the pdf-files i get no result, pdf > > files > > >>were not listet in the results... > > >> > > >>any idea? > > >> > > >>thanxs > > >> > > >>mfg > > >> > > >>Markus Rietzler > > >>* <rietzler_software/> > > >>* RZF NRW > > >>* Tel: 0211.4572-130 > > >> > > >> > > >> > > >>-----Urspr�ngliche Nachricht----- > > >>Von: Charlie Farinella [mailto:[EMAIL PROTECTED]] > > >>Gesendet am: Dienstag, 10. September 2002 23:35 > > >>An: [EMAIL PROTECTED] > > >>Betreff: [aseek-users] selective removal of urls > > >> > > >>Is there a way to selectively remove a url from our database after it > > >>has been indexed? We would like to remove porn sites from a family > > >>friendly database. > > >> > > >> > > >> > > >> > > >> > > > > > > > > > > > > > > -- [EMAIL PROTECTED] ICQ7551596 [EMAIL PROTECTED] -- > Guinness a Day Keeps a Doctor Away (people's wisdom) -- [EMAIL PROTECTED] ICQ7551596 [EMAIL PROTECTED] -- Guinness a Day Keeps a Doctor Away (people's wisdom)
