Try to add ";charset=iso8859-1" (substitute with charset you need) after "text/html" in Converter line.
Sorry, seems it is missing from man page; I will add it. So, it will look like: Converter application/pdf text/html; charset=iso8859-5 /usr/bin/pdftohtml -i -noframes -stdout $in >$out Please report here if it works or not :) [EMAIL PROTECTED] wrote: > > that trick worked (half). > the pdf-file is being indexed but i can't search for words with eg. umlauts. > in the excerpt i see "?" on the places where umlauts (�,�,�) should be. > so the charset of the document is wrong. any ideas > > mfg > > Markus Rietzler > * <rietzler_software/> > * RZF NRW > * Tel: 0211.4572-130 > > -----Urspr�ngliche Nachricht----- > Von: Gerrit Hannaert [mailto:[EMAIL PROTECTED]] > Gesendet am: Donnerstag, 12. September 2002 02:48 > An: [EMAIL PROTECTED] > Betreff: Re: AW: AW: [aseek-users] external converters, pdf files > > While trying out pdftohtml for myself (yeah, it's nicer than plain text > and provides a title too), I figured it out. > pdftohtml will add an ".html" extension to the output file; and hence > index won't use it nor delete it afterwards. The solution I use is simply: > > Converter application/pdf text/html /usr/bin/pdftohtml -i > -noframes -stdout $in >$out > > And so far it seems to be working... > > Cheers, > > [EMAIL PROTECTED] wrote: > > >in aspseek.conf i have > > > > Converter application/pdf text/html /users/aspseek/sbin/pdftohtml -i > >-noframes $in $out.html > > > >and this is what index says, looks good for me > > > >www@I011-32:/users/aspseek/sbin> ./index -a -m -u http://..../test.pdf > >Loading configuration from /users/aspseek/etc/db.conf > >Loading configuration from /users/aspseek/etc/ucharset.conf > >Loading configuration from /users/aspseek/etc/stopwords.conf > >Loading configuration from /users/aspseek/etc/server.url > >Loading configuration from /users/aspseek/etc/allow.url > >Loading configuration from /users/aspseek/etc/aspseek.conf > >Adding URL: http://..../test.pdf > >exec /users/aspseek/sbin/pdftohtml -i -noframes /tmp/asiLa7HF2 > >/tmp/asoOpsGUU.html > >Page-1 > >Page-2 > >Page-3 > >Page-4 > >Page-5 > >Page-6 > >Page-7 > >Page-8 > >Saving real-time database ... done. > >Saving delta files [..................................................] > >done. > >Deleting 'deleted' records from urlword[s] ... done. (0 records deleted) > >Saving real-time ... done > >Saving redirects ... done > >Splitting href delta file ... done > >Saving href delta files ... done > >Saving direct href delta files ... done > >Calculating ranks [................................................] done. > >Saving lastmods ... done > >Generating word site ... done > >Generating subset http://..../% ... done (193 URLs) > >index process finished. > > > >btw: those two tempfiles are not deleted in /tmp, maybe another bug > >/tmp/asoOpsGUU.html is a html-export of the pdf file. so text is recognized > >and exported correct, but when i search for one of the words from this file > >i get no results... > > > >urlword-table says > > > >*************************** 1. row *************************** > > url_id: 100 > > site_id: 1 > > deleted: 0 > > url: http://.../versorgungsreform.pdf > >next_index_time: 1031828953 > > status: 200 > > crc: d41d8cd98f00b204e9800998ecf8427e > > last_modified: Wed, 11 Sep 2002 02:00:11 GMT > > etag: "1cc11a-cb26-3d7ea3ab" > >last_index_time: 1031742553 > > referrer: 23 > > tag: 0 > > hops: 3 > > redir: 0 > > origin: 0 > >1 row in set (0.00 sec) > > > >mfg > > > >Markus Rietzler > >* <rietzler_software/> > >* RZF NRW > >* Tel: 0211.4572-130 > > > > > > > >-----Urspr�ngliche Nachricht----- > >Von: Gerrit Hannaert [mailto:[EMAIL PROTECTED]] > >Gesendet am: Mittwoch, 11. September 2002 10:48 > >An: [EMAIL PROTECTED]; [EMAIL PROTECTED] > >Betreff: Re: AW: [aseek-users] external converters, pdf files > > > >Hi, > >What does your 'converter' line in aspseek.conf look like? > >Also try running index -a -m -u "%.pdf" and see what the output is > >(perhaps an error message is displayed). > > > >Cheers, > > > >[EMAIL PROTECTED] wrote: > > > > > > > >>nono, > >>these are "plain" pdf files, mostly converted from winword. so there is a > >>lot of text. when i use pdf2text or pdftohtml and look in the result, i > get > >>all the words/text from the pdf file. so something different happens > >> > >> > >here... > > > > > >>mfg > >> > >>Markus Rietzler > >>* <rietzler_software/> > >>* RZF NRW > >>* Tel: 0211.4572-130 > >> > >> > >> > >>-----Urspr�ngliche Nachricht----- > >>Von: Gregory Kozlovsky [mailto:[EMAIL PROTECTED]] > >>Gesendet am: Mittwoch, 11. September 2002 10:07 > >>An: '[EMAIL PROTECTED]' > >>Betreff: RE: [aseek-users] external converters, pdf files > >> > >>Sometimes, what appears to be text in .pdf files is actually scanned > images > >>that cannot be indexed. Check for it. > >> > >> Gregory Kozlovsky > >> > >>-----Original Message----- > >>From: [EMAIL PROTECTED] > >>[mailto:[EMAIL PROTECTED]] > >>Sent: Mittwoch, 11. September 2002 09:59 > >>To: [EMAIL PROTECTED] > >>Subject: [aseek-users] external converters, pdf files > >> > >> > >>hi, > >>i am trying to setup aspseek with external converter support. i installed > >>pdftohtml, indexing works fine, pdf files seem to be processed, i can find > >>the urls to the pdf files in urlword table even with status code 200. but > >>when i do a search with words from the pdf-files i get no result, pdf > files > >>were not listet in the results... > >> > >>any idea? > >> > >>thanxs > >> > >>mfg > >> > >>Markus Rietzler > >>* <rietzler_software/> > >>* RZF NRW > >>* Tel: 0211.4572-130 > >> > >> > >> > >>-----Urspr�ngliche Nachricht----- > >>Von: Charlie Farinella [mailto:[EMAIL PROTECTED]] > >>Gesendet am: Dienstag, 10. September 2002 23:35 > >>An: [EMAIL PROTECTED] > >>Betreff: [aseek-users] selective removal of urls > >> > >>Is there a way to selectively remove a url from our database after it > >>has been indexed? We would like to remove porn sites from a family > >>friendly database. > >> > >> > >> > >> > >> > > > > > > > > -- [EMAIL PROTECTED] ICQ7551596 [EMAIL PROTECTED] -- Guinness a Day Keeps a Doctor Away (people's wisdom)
