AW: AW: AW: AW: AW: AW: [aseek-users] external converters, pdf files

Markus . Rietzler Thu, 12 Sep 2002 04:02:07 -0700

sorry, 
sure i have patched the config.cpp... after that i included the additional
logging...


mfg

Markus Rietzler
* <rietzler_software/>
* RZF NRW
* Tel: 0211.4572-130



-----Urspr�ngliche Nachricht-----
Von: Kir Kolyshkin [mailto:[EMAIL PROTECTED]]
Gesendet am: Donnerstag, 12. September 2002 13:25
An: [EMAIL PROTECTED]
Betreff: Re: AW: AW: AW: AW: AW: [aseek-users] external converters, pdf
files

My first patch should fix this. Was it applied?

[EMAIL PROTECTED] wrote:
> 
> i have included some debug messages in config.cpp, they tell me:
> 
> parsed from: application/pdf
> parsed to: text/html
> parsed cmd:  charset=iso8859-1 /users/aspseek/sbin/pdftohtml -i -noframes
> -stdout $in > $out
> parsed charset: iso8859-1 /users/aspseek/sbin/pdftohtml -i -noframes
> -stdout $in > $out
> parsed cmd(2): /users/aspseek/sbin/pdftohtml -i -noframes  -stdout $in >
> $out
> 
> so charset contains too much
> 
> mfg
> 
> Markus Rietzler
> * <rietzler_software/>
> * RZF NRW
> * Tel: 0211.4572-130
> 
> -----Urspr�ngliche Nachricht-----
> Von: Kir Kolyshkin [mailto:[EMAIL PROTECTED]]
> Gesendet am: Donnerstag, 12. September 2002 12:29
> An: [EMAIL PROTECTED]
> Betreff: Re: AW: AW: AW: AW: [aseek-users] external converters, pdf
> files
> 
> Attached patch shold fix it. Can you try it out?
> 
> [EMAIL PROTECTED] wrote:
> >
> > mh,
> > with adding charset there is no exec-line in the index-log, so no
> > pdf-conversion at all.
> > tried to
> >
> >         Converter application/pdf text/html; charset=iso8859-1
> > /usr/bin/pdftohtml -i -noframes -stdout $in >$out
> > and
> >         Converter application/pdf "text/html; charset=iso8859-1"
> > /usr/bin/pdftohtml -i -noframes -stdout $in >$out
> >
> > without charset option pdf's are indexed...
> >
> > mfg
> >
> > Markus Rietzler
> > * <rietzler_software/>
> > * RZF NRW
> > * Tel: 0211.4572-130
> >
> > -----Urspr�ngliche Nachricht-----
> > Von: Kir Kolyshkin [mailto:[EMAIL PROTECTED]]
> > Gesendet am: Donnerstag, 12. September 2002 11:26
> > An: [EMAIL PROTECTED]
> > Betreff: Re: AW: AW: AW: [aseek-users] external converters, pdf files
> >
> > Try to add ";charset=iso8859-1" (substitute with charset you need)
> > after "text/html" in Converter line.
> >
> > Sorry, seems it is missing from man page; I will add it.
> >
> > So, it will look like:
> >
> > Converter application/pdf text/html; charset=iso8859-5
/usr/bin/pdftohtml
> -i
> > -noframes -stdout $in >$out
> > Please report here if it works or not :)
> >
> > [EMAIL PROTECTED] wrote:
> > >
> > > that trick worked (half).
> > > the pdf-file is being indexed but i can't search for words with eg.
> > umlauts.
> > > in the excerpt i see "?" on the places where umlauts (�,�,�) should
be.
> > > so the charset of the document is wrong. any ideas
> > >
> > > mfg
> > >
> > > Markus Rietzler
> > > * <rietzler_software/>
> > > * RZF NRW
> > > * Tel: 0211.4572-130
> > >
> > > -----Urspr�ngliche Nachricht-----
> > > Von: Gerrit Hannaert [mailto:[EMAIL PROTECTED]]
> > > Gesendet am: Donnerstag, 12. September 2002 02:48
> > > An: [EMAIL PROTECTED]
> > > Betreff: Re: AW: AW: [aseek-users] external converters, pdf files
> > >
> > > While trying out pdftohtml for myself (yeah, it's nicer than plain
text
> > > and provides a title too), I figured it out.
> > > pdftohtml will add an ".html" extension to the output file; and hence
> > > index won't use it nor delete it afterwards. The solution I use is
> simply:
> > >
> > > Converter application/pdf            text/html
/usr/bin/pdftohtml
> -i
> > > -noframes -stdout $in >$out
> > >
> > > And so far it seems to be working...
> > >
> > > Cheers,
> > >
> > > [EMAIL PROTECTED] wrote:
> > >
> > > >in aspseek.conf i have
> > > >
> > > >       Converter application/pdf text/html
> /users/aspseek/sbin/pdftohtml
> > -i
> > > >-noframes $in $out.html
> > > >
> > > >and this is what index says, looks good for me
> > > >
> > > >www@I011-32:/users/aspseek/sbin> ./index -a -m -u
http://..../test.pdf
> > > >Loading configuration from /users/aspseek/etc/db.conf
> > > >Loading configuration from /users/aspseek/etc/ucharset.conf
> > > >Loading configuration from /users/aspseek/etc/stopwords.conf
> > > >Loading configuration from /users/aspseek/etc/server.url
> > > >Loading configuration from /users/aspseek/etc/allow.url
> > > >Loading configuration from /users/aspseek/etc/aspseek.conf
> > > >Adding URL: http://..../test.pdf
> > > >exec /users/aspseek/sbin/pdftohtml -i -noframes /tmp/asiLa7HF2
> > > >/tmp/asoOpsGUU.html
> > > >Page-1
> > > >Page-2
> > > >Page-3
> > > >Page-4
> > > >Page-5
> > > >Page-6
> > > >Page-7
> > > >Page-8
> > > >Saving real-time database ... done.
> > > >Saving delta files
[..................................................]
> > > >done.
> > > >Deleting 'deleted' records from urlword[s] ... done. (0 records
> deleted)
> > > >Saving real-time ... done
> > > >Saving redirects ... done
> > > >Splitting href delta file ... done
> > > >Saving href delta files ... done
> > > >Saving direct href delta files ... done
> > > >Calculating ranks  [................................................]
> > done.
> > > >Saving lastmods ... done
> > > >Generating word site ... done
> > > >Generating subset http://..../% ... done (193 URLs)
> > > >index process finished.
> > > >
> > > >btw: those two tempfiles are not deleted in /tmp, maybe another bug
> > > >/tmp/asoOpsGUU.html is a html-export of the pdf file. so text is
> > recognized
> > > >and exported correct, but when i search for one of the words from
this
> > file
> > > >i get no results...
> > > >
> > > >urlword-table says
> > > >
> > > >*************************** 1. row ***************************
> > > >         url_id: 100
> > > >        site_id: 1
> > > >        deleted: 0
> > > >            url: http://.../versorgungsreform.pdf
> > > >next_index_time: 1031828953
> > > >         status: 200
> > > >            crc: d41d8cd98f00b204e9800998ecf8427e
> > > >  last_modified: Wed, 11 Sep 2002 02:00:11 GMT
> > > >           etag: "1cc11a-cb26-3d7ea3ab"
> > > >last_index_time: 1031742553
> > > >       referrer: 23
> > > >            tag: 0
> > > >           hops: 3
> > > >          redir: 0
> > > >         origin: 0
> > > >1 row in set (0.00 sec)
> > > >
> > > >mfg
> > > >
> > > >Markus Rietzler
> > > >* <rietzler_software/>
> > > >* RZF NRW
> > > >* Tel: 0211.4572-130
> > > >
> > > >
> > > >
> > > >-----Urspr�ngliche Nachricht-----
> > > >Von: Gerrit Hannaert [mailto:[EMAIL PROTECTED]]
> > > >Gesendet am: Mittwoch, 11. September 2002 10:48
> > > >An: [EMAIL PROTECTED]; [EMAIL PROTECTED]
> > > >Betreff: Re: AW: [aseek-users] external converters, pdf files
> > > >
> > > >Hi,
> > > >What does your 'converter' line in aspseek.conf look like?
> > > >Also try running index -a -m -u "%.pdf" and see what the output is
> > > >(perhaps an error message is displayed).
> > > >
> > > >Cheers,
> > > >
> > > >[EMAIL PROTECTED] wrote:
> > > >
> > > >
> > > >
> > > >>nono,
> > > >>these are "plain" pdf files, mostly converted from winword. so there
> is
> > a
> > > >>lot of text. when i use pdf2text or pdftohtml and look in the
result,
> i
> > > get
> > > >>all the words/text from the pdf file. so something different happens
> > > >>
> > > >>
> > > >here...
> > > >
> > > >
> > > >>mfg
> > > >>
> > > >>Markus Rietzler
> > > >>* <rietzler_software/>
> > > >>* RZF NRW
> > > >>* Tel: 0211.4572-130
> > > >>
> > > >>
> > > >>
> > > >>-----Urspr�ngliche Nachricht-----
> > > >>Von: Gregory Kozlovsky [mailto:[EMAIL PROTECTED]]
> > > >>Gesendet am: Mittwoch, 11. September 2002 10:07
> > > >>An: '[EMAIL PROTECTED]'
> > > >>Betreff: RE: [aseek-users] external converters, pdf files
> > > >>
> > > >>Sometimes, what appears to be text in .pdf files is actually scanned
> > > images
> > > >>that cannot be indexed. Check for it.
> > > >>
> > > >>   Gregory Kozlovsky
> > > >>
> > > >>-----Original Message-----
> > > >>From: [EMAIL PROTECTED]
> > > >>[mailto:[EMAIL PROTECTED]]
> > > >>Sent: Mittwoch, 11. September 2002 09:59
> > > >>To: [EMAIL PROTECTED]
> > > >>Subject: [aseek-users] external converters, pdf files
> > > >>
> > > >>
> > > >>hi,
> > > >>i am trying to setup aspseek with external converter support. i
> > installed
> > > >>pdftohtml, indexing works fine, pdf files seem to be processed, i
can
> > find
> > > >>the urls to the pdf files in urlword table even with status code
200.
> > but
> > > >>when i do a search with words from the pdf-files i get no result,
pdf
> > > files
> > > >>were not listet in the results...
> > > >>
> > > >>any idea?
> > > >>
> > > >>thanxs
> > > >>
> > > >>mfg
> > > >>
> > > >>Markus Rietzler
> > > >>* <rietzler_software/>
> > > >>* RZF NRW
> > > >>* Tel: 0211.4572-130
> > > >>
> > > >>
> > > >>
> > > >>-----Urspr�ngliche Nachricht-----
> > > >>Von: Charlie Farinella [mailto:[EMAIL PROTECTED]]
> > > >>Gesendet am: Dienstag, 10. September 2002 23:35
> > > >>An: [EMAIL PROTECTED]
> > > >>Betreff: [aseek-users] selective removal of urls
> > > >>
> > > >>Is there a way to selectively remove a url from our database after
it
> > > >>has been indexed?  We would like to remove porn sites from a family
> > > >>friendly database.
> > > >>
> > > >>
> > > >>
> > > >>
> > > >>
> > > >
> > > >
> > > >
> > > >
> >
> > -- [EMAIL PROTECTED]  ICQ7551596  [EMAIL PROTECTED] --
> >    Guinness a Day Keeps a Doctor Away (people's wisdom)
> 
> -- [EMAIL PROTECTED]  ICQ7551596  [EMAIL PROTECTED] --
>    Guinness a Day Keeps a Doctor Away (people's wisdom)

-- [EMAIL PROTECTED]  ICQ7551596  [EMAIL PROTECTED] --
   Guinness a Day Keeps a Doctor Away (people's wisdom)

AW: AW: AW: AW: AW: AW: [aseek-users] external converters, pdf files

Reply via email to