Hi,

I built a MetaTagsManager class.

In HTMLDocument, there is a line of code that causes the doc
to be parsed:

HTMLParser parser = new HTMLParser(f);

After parsing, all of the parse results are available. I added:

Properties prop = parser.getMetaTags();
//prop.list(System.out);
MetaTagManager.process(doc, prop);

Then, in the "process" method of MetaTagManager, you have
access to all of the meta tags as well as the document. You
can add them as any sort of fields you need. A typical bit might be:

        String desc = prop.getProperty(META_DESCRIPTION);
        if (desc != null) {
            System.out.println("DESC: " + desc);
            doc.add(Field.UnIndexed(DOC_DESCRIPTION, desc));
        }


Be warned, though: As has been discussed on this list several times, IndexHTML has some significant limitations. The most important one for me has been the fact that it's not unicode-clean.

However, I'm still using it!

Good luck,

Fred


At 07:06 PM 10/14/2004, you wrote:
Hi Fred,

Thanks for replying. I see what you mean by creating tags for name and description. I am not sure about how to hack the methods and where. If you can point me in right direction that would really be appreciated. I am thinking about editing the shipped HTMLDocument.java and HTMLParser.java files but not sure what I need to add. Can you please explain a little more.

TIA.
-H

Fred Toth wrote:
Hi,
Could be your best bet is to use HTML <meta> tags. Create tags
for name, description, etc. (title is already parsed). The HTML parser
that ships with Lucene will parse these tags into java Properties. You
will need to hack a bit, but you can easily pick these up and add them
as specific fields to your index.
Fred
At 02:42 PM 10/13/2004, you wrote:

Hello,

I am using the IndexHTML class to index around 30,000 files and it is working fine. Question that I have is, is there a way to add multiple fields to index so that when the actual search is performed I can extract the exact match.
E.g.
the fields can be
1) title - abc
2) name - foo inc,
3) description - Lorem ipsum dolor sit
4) URL - www.lorem.ipsum


and so on,

From search when the match for title 'abc' is found then searching for doc.get("name") can return foo inc and so on.

Is this already happening in any other indexing class if not what do I need to add to IndexHTML class to accomplish this?

thanks for all the help gang.
-H


--------------------------------------------------------------------- To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]


---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



--------------------------------------------------------------------- To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]



Reply via email to