Author: Alexander Barkov
Email: [EMAIL PROTECTED]
Message:
Probably server returns something else than
"application/pdf; charset=iso-8859-1"
Check what Content-Type is sent in HTTP response using
for example
wget -s http://servername/path/to/file.pdf
> I could not get my pdftotext parser going ...
>
> $ /fs1/sw/xpdf/pdftotext
> # works from the command line, exectuable by world.
>
> #My indexer.conf:
> ---
>
> DBAddr mysql://xyz@localhost/mnogosearch/
>
> #DBMode single
> #VarDir /usr/local/mnogosearch/var
>
> LocalCharset Phrase yes
> #CrossWords no
>
> #StopwordFile stopwords.txt
> StopwordTable stopword
>
> MinWordLength 1
> MaxWordLength 32
>
> MaxDocSize 10000000
>
> HTTPHeader User-Agent: MnoGoSearch_RZ_UOS
> #HTTPHeader Accept-Language: de, en
> #HTTPHeader From: [EMAIL PROTECTED]
>
> # ServerTable server
>
> DeleteNoServer yes
>
> # Exclude cgi-bin and non-parsed-headers using "string" match:
> Disallow /cgi-bin/* */nph-*
>
> # Exclude anything with '?' sign in URL. Note that '?' sign has a
> # special meaning in "string" match, so we have to use "regex"
>match here:
> Disallow Regex \?
>
> # Exclude Apache directory list in different sort order using "string"
>match:
> Disallow *D=A *D=D *M=A *M=D *N=A *N=D *S=A *S=D
>
> CheckOnly *.pl *.cgi
> CheckOnly *.b *.sh *.md5
> CheckOnly *.arj *.tar *.zip *.tgz *.gz
> CheckOnly *.lha *.lzh *.rar *.zoo *.tar*.Z
> CheckOnly *.gif *.jpg *.jpeg *.bmp *.tiff
> CheckOnly *.vdo *.mpeg *.mpe *.mpg *.avi *.movie
> CheckOnly *.mid *.mp3 *.rm *.ram *.wav *.aiff
> CheckOnly *.vrml *.wrl *.png
> CheckOnly *.exe *.cab *.dll *.bin *.class
> CheckOnly *.tex *.texi *.xls *.doc *.texinfo
> CheckOnly *.ai *.eps *.ppt *.hqx
> CheckOnly *.cpt *.bms *.oda *.tcl
> CheckOnly *.rpm *.m3u *.qt *.mov
> CheckOnly *.map *.aif *.sit *.sea
> # CheckOnly *.rtf *.pdf *.cdf *.ps
>
> UseRemoteContentType no
>
> AddType text/plain *.txt
> AddType text/plain *.js *.java
> AddType text/plain *.h *.c *.cpp
>
> AddType text/html *.html *.htm
> AddType text/html *.cfm *.cfml
>
> AddType image/x-xpixmap *.xpm
> AddType image/x-xbitmap *.xbm
> AddType image/gif *.gif
>
> AddType application/pdf *.pdf
> AddType application/unknown *.*
>
> Mime "application/pdf; charset=iso-8859-1" "text/plain"
> "/fs1/sw/xpdf/pdftotext $1 $2"
>
> Period 7d
>
> Robots yes
>
> Clones yes
>
> BodyWeight 2
> #CrossWeight 32
> TitleWeight 4
> KeywordWeight 8
> DescWeight 16
> #UrlWeight 0
> #UrlHostWeight 0
> #UrlPathWeight 0
> #UrlFileWeight 0
>
> DeleteBad yes
> Index yes
>
> Follow site
>
> CharSet iso-8859-1
>
> # Server http://localhost/
> Server http://rz-intern.rz.uni-osnabrueck.de/
>
> ---
>
> When running indexer with -v 5, the PDF file is indexed
> like all other html/txt stuff, no special treatment ...
>
> Anything wrong/missing?
>
> Frank
>
Reply: <http://www.mnogosearch.org/board/message.php?id=3025>
___________________________________________
If you want to unsubscribe send "unsubscribe general"
to [EMAIL PROTECTED]