Webboard: PDF parser will not run

Alexander Barkov Sat, 08 Sep 2001 02:57:23 -0700
Author: Alexander Barkov
Email: [EMAIL PROTECTED]
Message:

Probably server returns something else than
  "application/pdf; charset=iso-8859-1"

Check what Content-Type is sent in HTTP response using
for example 

  wget -s http://servername/path/to/file.pdf


  

> I could not get my pdftotext parser going ...
> 
> $ /fs1/sw/xpdf/pdftotext 
>   # works from the command line, exectuable by world.
> 
> #My indexer.conf:
> ---
> 
> DBAddr                mysql://xyz@localhost/mnogosearch/
> 
> #DBMode single
> #VarDir /usr/local/mnogosearch/var
> 
> LocalCharset Phrase yes
> #CrossWords no
> 
> #StopwordFile stopwords.txt
> StopwordTable stopword
> 
> MinWordLength 1
> MaxWordLength 32
> 
> MaxDocSize 10000000
> 
> HTTPHeader User-Agent: MnoGoSearch_RZ_UOS
> #HTTPHeader Accept-Language: de, en
> #HTTPHeader From: [EMAIL PROTECTED]
> 
> # ServerTable server
> 
> DeleteNoServer yes
> 
> # Exclude cgi-bin and non-parsed-headers using &quot;string&quot; match:
> Disallow /cgi-bin/* */nph-*
> 
> # Exclude anything with '?' sign in URL. Note that '?' sign has a 
> # special meaning in &quot;string&quot; match, so we have to use &quot;regex&quot; 
>match here:
> Disallow Regex  \?
> 
> # Exclude Apache directory list in different sort order using &quot;string&quot; 
>match:
> Disallow *D=A *D=D *M=A *M=D *N=A *N=D *S=A *S=D
> 
> CheckOnly *.pl   *.cgi 
> CheckOnly *.b  *.sh   *.md5
> CheckOnly *.arj  *.tar  *.zip  *.tgz  *.gz
> CheckOnly *.lha  *.lzh  *.rar  *.zoo  *.tar*.Z
> CheckOnly *.gif  *.jpg  *.jpeg *.bmp  *.tiff 
> CheckOnly *.vdo  *.mpeg *.mpe  *.mpg  *.avi  *.movie
> CheckOnly *.mid  *.mp3  *.rm   *.ram  *.wav  *.aiff
> CheckOnly *.vrml *.wrl  *.png
> CheckOnly *.exe  *.cab  *.dll  *.bin  *.class
> CheckOnly *.tex  *.texi *.xls  *.doc  *.texinfo
> CheckOnly *.ai   *.eps  *.ppt  *.hqx
> CheckOnly *.cpt  *.bms  *.oda  *.tcl
> CheckOnly *.rpm  *.m3u  *.qt   *.mov
> CheckOnly *.map  *.aif  *.sit  *.sea
> # CheckOnly *.rtf  *.pdf  *.cdf  *.ps
> 
> UseRemoteContentType no
> 
> AddType       text/plain      *.txt
> AddType text/plain  *.js *.java
> AddType text/plain  *.h *.c *.cpp
> 
> AddType       text/html       *.html *.htm
> AddType text/html   *.cfm *.cfml
> 
> AddType image/x-xpixmap       *.xpm
> AddType image/x-xbitmap       *.xbm
> AddType image/gif     *.gif
> 
> AddType       application/pdf *.pdf
> AddType       application/unknown *.*
> 
> Mime &quot;application/pdf; charset=iso-8859-1&quot;  &quot;text/plain&quot;         
>         &quot;/fs1/sw/xpdf/pdftotext $1 $2&quot;
> 
> Period 7d
> 
> Robots yes
> 
> Clones yes
> 
> BodyWeight 2
> #CrossWeight 32
> TitleWeight 4
> KeywordWeight 8
> DescWeight 16
> #UrlWeight 0
> #UrlHostWeight 0
> #UrlPathWeight 0
> #UrlFileWeight 0
> 
> DeleteBad yes
> Index yes
> 
> Follow site
> 
> CharSet iso-8859-1
> 
> # Server      http://localhost/
> Server        http://rz-intern.rz.uni-osnabrueck.de/
> 
> ---
> 
> When running indexer with -v 5, the PDF file is indexed
> like all other html/txt stuff, no special treatment ...
> 
> Anything wrong/missing?
> 
> Frank
> 

Reply: <http://www.mnogosearch.org/board/message.php?id=3025>

___________________________________________
If you want to unsubscribe send "unsubscribe general"
to [EMAIL PROTECTED]
Webboard: PDF parser will not run

Reply via email to