Ainhoa,
I see what you mean - that certainly doesn't look quite right!
Before we go any further, can I ask if you have tried indexing this file
as 'plain' HTML? I know that it doesn't look quite right, but it would
appear to me that the content should be okay for htdig's own html
parser, if things are set up correctly. Since we know that the config
wasn't correct (at first) for the mime-type etc it would be worth
checking that over - no point doing an MHT -> HTML translation if it is
HTML to begin with!
I am afraid that I don't have a working example, but
http://www.htdig.org/attrs.html#external_parsers describes how to
target a file at the internal parser
Regards,
Mike
PS None of my messages have been coming back to me via the list - have
you been getting one copy or two?
________________________________
From: [EMAIL PROTECTED]
[mailto:[EMAIL PROTECTED] On Behalf Of Ainhoa
L
Sent: Monday, February 11, 2008 3:19 PM
To: Brockington,MJ,Michael,JPGA4X R
Cc: htdig-general@lists.sourceforge.net
Subject: Re: [htdig] Htdig and MHT files
Yeah you are right, I think it doesn't like the output at all.
Instead of the words it is taking as words:
word: [EMAIL PROTECTED]
word: [EMAIL PROTECTED]
word: [EMAIL PROTECTED]
word: [EMAIL PROTECTED]
:S
So I suppose htdig just doesn't really like the output of the
parser. I'm attaching the output of parser (executed manually) and the
output of dig just in case you have any more ideas :)
Thanks a lot!
Ainhoa
On Feb 11, 2008 12:16 PM, <[EMAIL PROTECTED]> wrote:
Ainhoa,
My first instinct would now be to check the parser
output - try adding another v to your config, (and possibly
restricting your indexing to just this one file) and check the log
output - it may be that htdig does not like the output from your PERL
script. www.htdig.org explains what the output means. I seem to recall
you saying that you had already tested that it ran on its own, but
possibly there is something not right there, or a typo in the config
that neither of us can see.
Regards,
Mike
________________________________
From: Ainhoa L [mailto:[EMAIL PROTECTED]
Sent: Monday, February 11, 2008 9:33 AM
To: Brockington,MJ,Michael,JPGA4X R
Cc: htdig-general@lists.sourceforge.net
Subject: Re: [htdig] Htdig and MHT files
Hi Mike,
Yes you were right, I was missing that part and
I didn't even noticed!
I changed the config file and wrote this:
application/pdf->text/html
/usr/local/apache/htdocs/htdig-3.1.6/contrib/parsepdf.pl \
application/vnd.wap.xhtml+xml->text/html
/opt/vin/mht2html.pl
vnd.wap.xhtml+xml was the MIME type for my mht
documents.
So I run dig and everything seems to go fine,
having at the end:
0/http://172.26.0.169/testdig/
1/http://172.26.0.169/testdig/About_comments_eex3.mht
2/http://172.26.0.169/testdig/aster.pdf
3/http://172.26.0.169/testdig/beepmacro.mht
4/http://172.26.0.169/testdig/index.txt
5/http://172.26.0.169/testdig/test.html
(I am doing this in a test folder)
But when I go to the search page, it won't find
words inside the mht files. It works for the pdf, txt and html ones, but
can't find the words that are in the mht ones.
I suppose I am missing something here... do I
need to setup any other settings for the search engine?
Thanks a lot for all your help,
Ainhoa
-------------------------------------------------------------------------
This SF.net email is sponsored by: Microsoft
Defy all challenges. Microsoft(R) Visual Studio 2008.
http://clk.atdmt.com/MRT/go/vse0120000070mrt/direct/01/
_______________________________________________
ht://Dig general mailing list: <htdig-general@lists.sourceforge.net>
ht://Dig FAQ: http://htdig.sourceforge.net/FAQ.html
List information (subscribe/unsubscribe, etc.)
https://lists.sourceforge.net/lists/listinfo/htdig-general