sorry, sure i have patched the config.cpp... after that i included the additional logging...
mfg Markus Rietzler * <rietzler_software/> * RZF NRW * Tel: 0211.4572-130 -----Urspr�ngliche Nachricht----- Von: Kir Kolyshkin [mailto:[EMAIL PROTECTED]] Gesendet am: Donnerstag, 12. September 2002 13:25 An: [EMAIL PROTECTED] Betreff: Re: AW: AW: AW: AW: AW: [aseek-users] external converters, pdf files My first patch should fix this. Was it applied? [EMAIL PROTECTED] wrote: > > i have included some debug messages in config.cpp, they tell me: > > parsed from: application/pdf > parsed to: text/html > parsed cmd: charset=iso8859-1 /users/aspseek/sbin/pdftohtml -i -noframes > -stdout $in > $out > parsed charset: iso8859-1 /users/aspseek/sbin/pdftohtml -i -noframes > -stdout $in > $out > parsed cmd(2): /users/aspseek/sbin/pdftohtml -i -noframes -stdout $in > > $out > > so charset contains too much > > mfg > > Markus Rietzler > * <rietzler_software/> > * RZF NRW > * Tel: 0211.4572-130 > > -----Urspr�ngliche Nachricht----- > Von: Kir Kolyshkin [mailto:[EMAIL PROTECTED]] > Gesendet am: Donnerstag, 12. September 2002 12:29 > An: [EMAIL PROTECTED] > Betreff: Re: AW: AW: AW: AW: [aseek-users] external converters, pdf > files > > Attached patch shold fix it. Can you try it out? > > [EMAIL PROTECTED] wrote: > > > > mh, > > with adding charset there is no exec-line in the index-log, so no > > pdf-conversion at all. > > tried to > > > > Converter application/pdf text/html; charset=iso8859-1 > > /usr/bin/pdftohtml -i -noframes -stdout $in >$out > > and > > Converter application/pdf "text/html; charset=iso8859-1" > > /usr/bin/pdftohtml -i -noframes -stdout $in >$out > > > > without charset option pdf's are indexed... > > > > mfg > > > > Markus Rietzler > > * <rietzler_software/> > > * RZF NRW > > * Tel: 0211.4572-130 > > > > -----Urspr�ngliche Nachricht----- > > Von: Kir Kolyshkin [mailto:[EMAIL PROTECTED]] > > Gesendet am: Donnerstag, 12. September 2002 11:26 > > An: [EMAIL PROTECTED] > > Betreff: Re: AW: AW: AW: [aseek-users] external converters, pdf files > > > > Try to add ";charset=iso8859-1" (substitute with charset you need) > > after "text/html" in Converter line. > > > > Sorry, seems it is missing from man page; I will add it. > > > > So, it will look like: > > > > Converter application/pdf text/html; charset=iso8859-5 /usr/bin/pdftohtml > -i > > -noframes -stdout $in >$out > > Please report here if it works or not :) > > > > [EMAIL PROTECTED] wrote: > > > > > > that trick worked (half). > > > the pdf-file is being indexed but i can't search for words with eg. > > umlauts. > > > in the excerpt i see "?" on the places where umlauts (�,�,�) should be. > > > so the charset of the document is wrong. any ideas > > > > > > mfg > > > > > > Markus Rietzler > > > * <rietzler_software/> > > > * RZF NRW > > > * Tel: 0211.4572-130 > > > > > > -----Urspr�ngliche Nachricht----- > > > Von: Gerrit Hannaert [mailto:[EMAIL PROTECTED]] > > > Gesendet am: Donnerstag, 12. September 2002 02:48 > > > An: [EMAIL PROTECTED] > > > Betreff: Re: AW: AW: [aseek-users] external converters, pdf files > > > > > > While trying out pdftohtml for myself (yeah, it's nicer than plain text > > > and provides a title too), I figured it out. > > > pdftohtml will add an ".html" extension to the output file; and hence > > > index won't use it nor delete it afterwards. The solution I use is > simply: > > > > > > Converter application/pdf text/html /usr/bin/pdftohtml > -i > > > -noframes -stdout $in >$out > > > > > > And so far it seems to be working... > > > > > > Cheers, > > > > > > [EMAIL PROTECTED] wrote: > > > > > > >in aspseek.conf i have > > > > > > > > Converter application/pdf text/html > /users/aspseek/sbin/pdftohtml > > -i > > > >-noframes $in $out.html > > > > > > > >and this is what index says, looks good for me > > > > > > > >www@I011-32:/users/aspseek/sbin> ./index -a -m -u http://..../test.pdf > > > >Loading configuration from /users/aspseek/etc/db.conf > > > >Loading configuration from /users/aspseek/etc/ucharset.conf > > > >Loading configuration from /users/aspseek/etc/stopwords.conf > > > >Loading configuration from /users/aspseek/etc/server.url > > > >Loading configuration from /users/aspseek/etc/allow.url > > > >Loading configuration from /users/aspseek/etc/aspseek.conf > > > >Adding URL: http://..../test.pdf > > > >exec /users/aspseek/sbin/pdftohtml -i -noframes /tmp/asiLa7HF2 > > > >/tmp/asoOpsGUU.html > > > >Page-1 > > > >Page-2 > > > >Page-3 > > > >Page-4 > > > >Page-5 > > > >Page-6 > > > >Page-7 > > > >Page-8 > > > >Saving real-time database ... done. > > > >Saving delta files [..................................................] > > > >done. > > > >Deleting 'deleted' records from urlword[s] ... done. (0 records > deleted) > > > >Saving real-time ... done > > > >Saving redirects ... done > > > >Splitting href delta file ... done > > > >Saving href delta files ... done > > > >Saving direct href delta files ... done > > > >Calculating ranks [................................................] > > done. > > > >Saving lastmods ... done > > > >Generating word site ... done > > > >Generating subset http://..../% ... done (193 URLs) > > > >index process finished. > > > > > > > >btw: those two tempfiles are not deleted in /tmp, maybe another bug > > > >/tmp/asoOpsGUU.html is a html-export of the pdf file. so text is > > recognized > > > >and exported correct, but when i search for one of the words from this > > file > > > >i get no results... > > > > > > > >urlword-table says > > > > > > > >*************************** 1. row *************************** > > > > url_id: 100 > > > > site_id: 1 > > > > deleted: 0 > > > > url: http://.../versorgungsreform.pdf > > > >next_index_time: 1031828953 > > > > status: 200 > > > > crc: d41d8cd98f00b204e9800998ecf8427e > > > > last_modified: Wed, 11 Sep 2002 02:00:11 GMT > > > > etag: "1cc11a-cb26-3d7ea3ab" > > > >last_index_time: 1031742553 > > > > referrer: 23 > > > > tag: 0 > > > > hops: 3 > > > > redir: 0 > > > > origin: 0 > > > >1 row in set (0.00 sec) > > > > > > > >mfg > > > > > > > >Markus Rietzler > > > >* <rietzler_software/> > > > >* RZF NRW > > > >* Tel: 0211.4572-130 > > > > > > > > > > > > > > > >-----Urspr�ngliche Nachricht----- > > > >Von: Gerrit Hannaert [mailto:[EMAIL PROTECTED]] > > > >Gesendet am: Mittwoch, 11. September 2002 10:48 > > > >An: [EMAIL PROTECTED]; [EMAIL PROTECTED] > > > >Betreff: Re: AW: [aseek-users] external converters, pdf files > > > > > > > >Hi, > > > >What does your 'converter' line in aspseek.conf look like? > > > >Also try running index -a -m -u "%.pdf" and see what the output is > > > >(perhaps an error message is displayed). > > > > > > > >Cheers, > > > > > > > >[EMAIL PROTECTED] wrote: > > > > > > > > > > > > > > > >>nono, > > > >>these are "plain" pdf files, mostly converted from winword. so there > is > > a > > > >>lot of text. when i use pdf2text or pdftohtml and look in the result, > i > > > get > > > >>all the words/text from the pdf file. so something different happens > > > >> > > > >> > > > >here... > > > > > > > > > > > >>mfg > > > >> > > > >>Markus Rietzler > > > >>* <rietzler_software/> > > > >>* RZF NRW > > > >>* Tel: 0211.4572-130 > > > >> > > > >> > > > >> > > > >>-----Urspr�ngliche Nachricht----- > > > >>Von: Gregory Kozlovsky [mailto:[EMAIL PROTECTED]] > > > >>Gesendet am: Mittwoch, 11. September 2002 10:07 > > > >>An: '[EMAIL PROTECTED]' > > > >>Betreff: RE: [aseek-users] external converters, pdf files > > > >> > > > >>Sometimes, what appears to be text in .pdf files is actually scanned > > > images > > > >>that cannot be indexed. Check for it. > > > >> > > > >> Gregory Kozlovsky > > > >> > > > >>-----Original Message----- > > > >>From: [EMAIL PROTECTED] > > > >>[mailto:[EMAIL PROTECTED]] > > > >>Sent: Mittwoch, 11. September 2002 09:59 > > > >>To: [EMAIL PROTECTED] > > > >>Subject: [aseek-users] external converters, pdf files > > > >> > > > >> > > > >>hi, > > > >>i am trying to setup aspseek with external converter support. i > > installed > > > >>pdftohtml, indexing works fine, pdf files seem to be processed, i can > > find > > > >>the urls to the pdf files in urlword table even with status code 200. > > but > > > >>when i do a search with words from the pdf-files i get no result, pdf > > > files > > > >>were not listet in the results... > > > >> > > > >>any idea? > > > >> > > > >>thanxs > > > >> > > > >>mfg > > > >> > > > >>Markus Rietzler > > > >>* <rietzler_software/> > > > >>* RZF NRW > > > >>* Tel: 0211.4572-130 > > > >> > > > >> > > > >> > > > >>-----Urspr�ngliche Nachricht----- > > > >>Von: Charlie Farinella [mailto:[EMAIL PROTECTED]] > > > >>Gesendet am: Dienstag, 10. September 2002 23:35 > > > >>An: [EMAIL PROTECTED] > > > >>Betreff: [aseek-users] selective removal of urls > > > >> > > > >>Is there a way to selectively remove a url from our database after it > > > >>has been indexed? We would like to remove porn sites from a family > > > >>friendly database. > > > >> > > > >> > > > >> > > > >> > > > >> > > > > > > > > > > > > > > > > > > > > -- [EMAIL PROTECTED] ICQ7551596 [EMAIL PROTECTED] -- > > Guinness a Day Keeps a Doctor Away (people's wisdom) > > -- [EMAIL PROTECTED] ICQ7551596 [EMAIL PROTECTED] -- > Guinness a Day Keeps a Doctor Away (people's wisdom) -- [EMAIL PROTECTED] ICQ7551596 [EMAIL PROTECTED] -- Guinness a Day Keeps a Doctor Away (people's wisdom)
