Hi Mike,
Yes you were right, I was missing that part and I didn't even noticed!
I changed the config file and wrote this:

application/pdf->text/html
/usr/local/apache/htdocs/htdig-3.1.6/contrib/parsepdf.pl
\

application/vnd.wap.xhtml+xml->text/html /opt/vin/mht2html.pl
vnd.wap.xhtml+xml was the MIME type for my mht documents.
So I run dig and everything seems to go fine, having at the end:


0/http://172.26.0.169/testdig/
1/http://172.26.0.169/testdig/About_comments_eex3.mht
2/http://172.26.0.169/testdig/aster.pdf
3/http://172.26.0.169/testdig/beepmacro.mht
4/http://172.26.0.169/testdig/index.txt
5/http://172.26.0.169/testdig/test.html

(I am doing this in a test folder)

But when I go to the search page, it won't find words inside the mht files.
It works for the pdf, txt and html ones, but can't find the words that are
in the mht ones.

I suppose I am missing something here... do I need to setup any other
settings for the search engine?

Thanks a lot for all your help,

Ainhoa

On Feb 8, 2008 12:58 PM, <[EMAIL PROTECTED]> wrote:

>  Ainhoa,
> Can I ask you to check whether _new_ PDF's are getting indexed correctly?
>
> I notice that the syntax used in the very first, commented, line of the
> external_parsers  section looks different to the rest:
>
> application/pdf->text/html /usr/local/bin/conv_doc.pl
>
> Note the 'arrow' and mime-type bit after application/pdf. All of the
> external_parsers  declarations in my config have this same bit, which makes
> me suspect that none of your declarations will be working just now, though
> if you have not rebuilt your databases from scratch this may not be obvious.
> You probably want to be using at least -vv  (two letter v's) to get verbose
> output from the dig process - this should tell you what is happening during
> the indexing. My other thought is to check whether the mht files are being
> served to you with that mime-type - this won't work correctly if not, and
> you may need more than one  external_parsers  declaration to cover all
> possibilities.
>
> Regards,
> Mike
>
>  ------------------------------
> *From:* Ainhoa L [mailto:[EMAIL PROTECTED]
> *Sent:* Wednesday, February 06, 2008 5:29 PM
> *To:* Brockington,MJ,Michael,JPGA4X R
> *Cc:* htdig-general@lists.sourceforge.net
> *Subject:* Re: [htdig] Htdig and MHT files
>
>   Hi Mike,
>
> You are talking about the version with the mht parser, right?
> I write here an extract of where I mention mht things and I attach the
> whole file and the parser (originally the parser would create files for the
> files appearing in the mht. I modified it so it will only output the code in
> the htm file). Maybe this parser I modified is sending some other garbage
> that can't be read by the indexer?
>
> bad_extensions: .wav .gz .z .sit .au .zip .tar .hqx .exe .com .gif \
> .jpg .jpeg .aiff .class .map .ram .tgz .bin .rpm .mpg .mov .avi .css
>
> valid_extensions: .html .htm .shtml .php .uhtml .phtml .txt .pdf .mht
>
> external_parsers: application/postscript /usr/local/apache/htdocs/htdig-
> 3.1.6/contrib/parsepdf.pl\ application/pdf /usr/local/apache/htdocs/htdig-
> 3.1.6/contrib/parsepdf.pl \ application/mht /opt/vin/mht2html2.pl
>
> Thanks a lot for your help!
> Regards,
>
> Ainhoa
>
>
>
> On Feb 5, 2008 9:58 PM, <[EMAIL PROTECTED]> wrote:
>
> > Can you show us at least an extract of your config file - as you
> > describe it this should work.
> >
> > Regards,
> > Mike
> >
> >
> > -----Original Message-----
> > From: [EMAIL PROTECTED] on behalf of Ainhoa L
> > Sent: Tue 2/5/2008 4:09 PM
> > To: htdig-general@lists.sourceforge.net
> > Subject: [htdig] Htdig and MHT files
> >
> > Hi! Maybe this is a very stupid question but, is it possible to index
> > mht
> > files with htdig?
> > I have tried with the mht in the valid_extensions list, etc. Obviously
> > htdig
> > doesn't take them as html and refuses to index them. I looked for a
> > parser
> > and found a mht2html parser, modified it so it just sends through output
> > the
> > html. I added it to the parsers in the htdig config file. This didn't
> > work,
> > although the parser returns valid html...
> > I would like to know if there is any way to index mht files with htdig?
> > Thanks a lot for your help.
> >
> >
>
-------------------------------------------------------------------------
This SF.net email is sponsored by: Microsoft
Defy all challenges. Microsoft(R) Visual Studio 2008.
http://clk.atdmt.com/MRT/go/vse0120000070mrt/direct/01/
_______________________________________________
ht://Dig general mailing list: <htdig-general@lists.sourceforge.net>
ht://Dig FAQ: http://htdig.sourceforge.net/FAQ.html
List information (subscribe/unsubscribe, etc.)
https://lists.sourceforge.net/lists/listinfo/htdig-general

Reply via email to