I think lucene is a very good fit to what you are looking for.
Just use a Lucene 'field' for each of your
'tags' and write the code to extract the content of the various tags so
you can pass them to Lucene for indexing.
Tal
> -----Original Message-----
> From: [EMAIL PROTECTED]
> [mailto:[EMAIL PROTECTED]]On Behalf Of Steven J.
> Owens
> Sent: Monday, June 18, 2001 5:54 PM
> To: [EMAIL PROTECTED]
> Subject: [Lucene-users] XML-aware Searching?
>
>
> Hey folks,
>
> I just skimmed the last couple dozen messages on Lucene-Users.
> Lucene looks kinda nifty and maybe just what I need, but I noticed the
> thread on XML searching. I'm doing a lot of XML stuff and I agree
> that providing XML Xpath/Xquery searching is probably going to be
> non-trivial. Might be worth doing, but then again a lot of things are
> worth doing (just not all by me, and not all today :-).
>
> However, I'm curious as to whether I could use Lucene to do
> "XML-aware" searches. Not so XPath or XQL queries but being able to
> search across a file system of XML documents for documents containing
> specific values in certain XML tags in the documents.
>
> I want to use a fairly simple XML format to add some structure to
> mailing list posts and allow users to annotate them with keywords for
> later filtering, to make a more useful mailnig list archive. One of
> my big problems with existing archives is that when I'm digging
> through a very spammy mailing list (say, for example,
> [EMAIL PROTECTED]) I have to wade through hundreds of
> obviously inappropriate messages. To make matters worse, I often run
> across the same inappropriate messages, time and again. If I could
> just mark them as inappropriate the first time and have them
> automatically be filtered out in the future, that'd be a huge step
> forward.
>
> So my current approach is that I have the archive split up into a
> big set of files (42707 at the moment, but that's a month or two out
> of date...). I'm trying to figure out the best way to implement a
> scheme where I wrap each message in a straightforward XML format (see
> example below). Then I (and other users) can add annotation tags and
> then do searches that use those annotations either specifically to
> look for certain annotations, or generally as a filter to be added to
> your search (or to be applied against search results).
>
> For example, you'd take a message:
>
> From: [EMAIL PROTECTED]
> Subject: I really don't like Broccoli
>
> Y'know, I've always hated Broccoli, but that doesn't mean I
> like President Bush. I do like servlets, though.
>
> [EMAIL PROTECTED]
>
> Which would then get wrapped in a basic XML schema, maybe with some
> basic parsing:
>
> <MESSAGE>
> <HEADERS>
> From: [EMAIL PROTECTED]
> Subject: I really don't like Broccoli
> </HEADERS>
> <BODY>
> Y'know, I've always hated Broccoli, but that doesn't mean I
> like President Bush. I do like servlets, though.
>
> [EMAIL PROTECTED]
> </BODY>
> </MESSAGE>
>
> And then at some later point somebody could come along and add
> an annotation:
>
> <MESSAGE>
> <HEADERS>
> From: [EMAIL PROTECTED]
> Subject: I really don't like Broccoli
> </HEADERS>
> <BODY>
> Y'know, I've always hated Broccoli, but that doesn't mean I
> like President Bush. I do like servlets, though.
>
> [EMAIL PROTECTED]
> </BODY>
> <ANNOTATION>
> <AUTHOR>[EMAIL PROTECTED]</AUTHOR>
> <KEYWORD>noise</KEYWORD>
> </ANNOTATION>
> </MESSAGE>
>
> And then later somebody else might run a search like:
>
> Search:
>
> servlets
> and not (tag:annotation containing
> tag:author=curmurdgeon and
> tag:keyword=noise)
>
>
> Or something. Thoughts? Is Lucene appropriate for this? I'm
> already thinking in terms of of a java & servlets based approach, and
> I'm really leaning towards XML for the message bodies, because it
> gives me a good mix of simple and fast (one file per message, able to
> quickly serve up an individual message) and complex (able to have
> arbitrary data and structure in each file).
>
> Steven J. Owens
> [EMAIL PROTECTED]
>
>
>
>
> _______________________________________________
> Lucene-users mailing list
> [EMAIL PROTECTED]
> http://lists.sourceforge.net/lists/listinfo/lucene-users
>
_______________________________________________
Lucene-users mailing list
[EMAIL PROTECTED]
http://lists.sourceforge.net/lists/listinfo/lucene-users