First, I admit to not reading your whole email, but here is some thoughts
from just the quick scan of it.
Lucene is not really ment to do searches against XML files... as in using
XPath queries, but if your XML document is simple or you are just interested
in locating documents that has some specific terms in a particular tag, you
could break the document down into tags and content, where each tag will be
field and the attributs and content would be the values. The index would be
very flat and you would lose the hierarchy of the XML file, but you could at
least locate a document that matches and then process it. You would not be
able to do a search and have it return a particular node or XML fragment.
-david
> -----Original Message-----
> From: [EMAIL PROTECTED]
> [mailto:[EMAIL PROTECTED]]On Behalf Of Steven J.
> Owens
> Sent: Monday, June 18, 2001 5:54 PM
> To: [EMAIL PROTECTED]
> Subject: [Lucene-users] XML-aware Searching?
>
>
> Hey folks,
>
> I just skimmed the last couple dozen messages on Lucene-Users.
> Lucene looks kinda nifty and maybe just what I need, but I noticed the
> thread on XML searching. I'm doing a lot of XML stuff and I agree
> that providing XML Xpath/Xquery searching is probably going to be
> non-trivial. Might be worth doing, but then again a lot of things are
> worth doing (just not all by me, and not all today :-).
>
> However, I'm curious as to whether I could use Lucene to do
> "XML-aware" searches. Not so XPath or XQL queries but being able to
> search across a file system of XML documents for documents containing
> specific values in certain XML tags in the documents.
>
> I want to use a fairly simple XML format to add some structure to
> mailing list posts and allow users to annotate them with keywords for
> later filtering, to make a more useful mailnig list archive. One of
> my big problems with existing archives is that when I'm digging
> through a very spammy mailing list (say, for example,
> [EMAIL PROTECTED]) I have to wade through hundreds of
> obviously inappropriate messages. To make matters worse, I often run
> across the same inappropriate messages, time and again. If I could
> just mark them as inappropriate the first time and have them
> automatically be filtered out in the future, that'd be a huge step
> forward.
>
> So my current approach is that I have the archive split up into a
> big set of files (42707 at the moment, but that's a month or two out
> of date...). I'm trying to figure out the best way to implement a
> scheme where I wrap each message in a straightforward XML format (see
> example below). Then I (and other users) can add annotation tags and
> then do searches that use those annotations either specifically to
> look for certain annotations, or generally as a filter to be added to
> your search (or to be applied against search results).
>
> For example, you'd take a message:
>
> From: [EMAIL PROTECTED]
> Subject: I really don't like Broccoli
>
> Y'know, I've always hated Broccoli, but that doesn't mean I
> like President Bush. I do like servlets, though.
>
> [EMAIL PROTECTED]
>
> Which would then get wrapped in a basic XML schema, maybe with some
> basic parsing:
>
> <MESSAGE>
> <HEADERS>
> From: [EMAIL PROTECTED]
> Subject: I really don't like Broccoli
> </HEADERS>
> <BODY>
> Y'know, I've always hated Broccoli, but that doesn't mean I
> like President Bush. I do like servlets, though.
>
> [EMAIL PROTECTED]
> </BODY>
> </MESSAGE>
>
> And then at some later point somebody could come along and add
> an annotation:
>
> <MESSAGE>
> <HEADERS>
> From: [EMAIL PROTECTED]
> Subject: I really don't like Broccoli
> </HEADERS>
> <BODY>
> Y'know, I've always hated Broccoli, but that doesn't mean I
> like President Bush. I do like servlets, though.
>
> [EMAIL PROTECTED]
> </BODY>
> <ANNOTATION>
> <AUTHOR>[EMAIL PROTECTED]</AUTHOR>
> <KEYWORD>noise</KEYWORD>
> </ANNOTATION>
> </MESSAGE>
>
> And then later somebody else might run a search like:
>
> Search:
>
> servlets
> and not (tag:annotation containing
> tag:author=curmurdgeon and
> tag:keyword=noise)
>
>
> Or something. Thoughts? Is Lucene appropriate for this? I'm
> already thinking in terms of of a java & servlets based approach, and
> I'm really leaning towards XML for the message bodies, because it
> gives me a good mix of simple and fast (one file per message, able to
> quickly serve up an individual message) and complex (able to have
> arbitrary data and structure in each file).
>
> Steven J. Owens
> [EMAIL PROTECTED]
>
>
>
>
> _______________________________________________
> Lucene-users mailing list
> [EMAIL PROTECTED]
> http://lists.sourceforge.net/lists/listinfo/lucene-users
_______________________________________________
Lucene-users mailing list
[EMAIL PROTECTED]
http://lists.sourceforge.net/lists/listinfo/lucene-users