I think lucene is a very good fit to what you are looking for.
Just use a Lucene 'field' for each of your
'tags' and write the code to extract the content of the various tags so
you can pass them to Lucene for indexing.

Tal

> -----Original Message-----
> From: [EMAIL PROTECTED]
> [mailto:[EMAIL PROTECTED]]On Behalf Of Steven J.
> Owens
> Sent: Monday, June 18, 2001 5:54 PM
> To: [EMAIL PROTECTED]
> Subject: [Lucene-users] XML-aware Searching?
> 
> 
> Hey folks,
> 
>      I just skimmed the last couple dozen messages on Lucene-Users.
> Lucene looks kinda nifty and maybe just what I need, but I noticed the
> thread on XML searching.  I'm doing a lot of XML stuff and I agree
> that providing XML Xpath/Xquery searching is probably going to be
> non-trivial.  Might be worth doing, but then again a lot of things are
> worth doing (just not all by me, and not all today :-).
> 
>      However, I'm curious as to whether I could use Lucene to do
> "XML-aware" searches.  Not so XPath or XQL queries but being able to
> search across a file system of XML documents for documents containing
> specific values in certain XML tags in the documents.  
> 
>      I want to use a fairly simple XML format to add some structure to
> mailing list posts and allow users to annotate them with keywords for
> later filtering, to make a more useful mailnig list archive.  One of
> my big problems with existing archives is that when I'm digging
> through a very spammy mailing list (say, for example,
> [EMAIL PROTECTED]) I have to wade through hundreds of
> obviously inappropriate messages.  To make matters worse, I often run
> across the same inappropriate messages, time and again.  If I could
> just mark them as inappropriate the first time and have them
> automatically be filtered out in the future, that'd be a huge step
> forward.
> 
>      So my current approach is that I have the archive split up into a
> big set of files (42707 at the moment, but that's a month or two out
> of date...).  I'm trying to figure out the best way to implement a
> scheme where I wrap each message in a straightforward XML format (see
> example below).  Then I (and other users) can add annotation tags and
> then do searches that use those annotations either specifically to
> look for certain annotations, or generally as a filter to be added to
> your search (or to be applied against search results).
> 
>      For example, you'd take a message:
> 
>       From: [EMAIL PROTECTED]
>       Subject:  I really don't like Broccoli 
> 
>       Y'know, I've always hated Broccoli, but that doesn't mean I
>       like President Bush.  I do like servlets, though.
> 
>       [EMAIL PROTECTED]
> 
>      Which would then get wrapped in a basic XML schema, maybe with some
> basic parsing:
> 
>       <MESSAGE>
>         <HEADERS>
>         From: [EMAIL PROTECTED]
>         Subject:  I really don't like Broccoli
>         </HEADERS>
>         <BODY>
>         Y'know, I've always hated Broccoli, but that doesn't mean I
>         like President Bush.  I do like servlets, though.
> 
>         [EMAIL PROTECTED]
>         </BODY>
>       </MESSAGE>
>  
>      And then at some later point somebody could come along and add
> an annotation:
> 
>       <MESSAGE>
>         <HEADERS>
>         From: [EMAIL PROTECTED]
>         Subject:  I really don't like Broccoli
>         </HEADERS>
>         <BODY>
>         Y'know, I've always hated Broccoli, but that doesn't mean I
>         like President Bush.  I do like servlets, though.
> 
>         [EMAIL PROTECTED]
>         </BODY>
>         <ANNOTATION>
>           <AUTHOR>[EMAIL PROTECTED]</AUTHOR>
>           <KEYWORD>noise</KEYWORD> 
>         </ANNOTATION>
>       </MESSAGE>
> 
>      And then later somebody else might run a search like:
> 
>      Search:  
> 
>      servlets 
>      and not (tag:annotation containing 
>       tag:author=curmurdgeon and 
>       tag:keyword=noise)
> 
> 
>      Or something.  Thoughts?  Is Lucene appropriate for this?  I'm
> already thinking in terms of of a java & servlets based approach, and
> I'm really leaning towards XML for the message bodies, because it
> gives me a good mix of simple and fast (one file per message, able to
> quickly serve up an individual message) and complex (able to have
> arbitrary data and structure in each file).
> 
> Steven J. Owens
> [EMAIL PROTECTED]
> 
> 
> 
> 
> _______________________________________________
> Lucene-users mailing list
> [EMAIL PROTECTED]
> http://lists.sourceforge.net/lists/listinfo/lucene-users
> 

_______________________________________________
Lucene-users mailing list
[EMAIL PROTECTED]
http://lists.sourceforge.net/lists/listinfo/lucene-users

Reply via email to