Hey folks,
I just skimmed the last couple dozen messages on Lucene-Users.
Lucene looks kinda nifty and maybe just what I need, but I noticed the
thread on XML searching. I'm doing a lot of XML stuff and I agree
that providing XML Xpath/Xquery searching is probably going to be
non-trivial. Might be worth doing, but then again a lot of things are
worth doing (just not all by me, and not all today :-).
However, I'm curious as to whether I could use Lucene to do
"XML-aware" searches. Not so XPath or XQL queries but being able to
search across a file system of XML documents for documents containing
specific values in certain XML tags in the documents.
I want to use a fairly simple XML format to add some structure to
mailing list posts and allow users to annotate them with keywords for
later filtering, to make a more useful mailnig list archive. One of
my big problems with existing archives is that when I'm digging
through a very spammy mailing list (say, for example,
[EMAIL PROTECTED]) I have to wade through hundreds of
obviously inappropriate messages. To make matters worse, I often run
across the same inappropriate messages, time and again. If I could
just mark them as inappropriate the first time and have them
automatically be filtered out in the future, that'd be a huge step
forward.
So my current approach is that I have the archive split up into a
big set of files (42707 at the moment, but that's a month or two out
of date...). I'm trying to figure out the best way to implement a
scheme where I wrap each message in a straightforward XML format (see
example below). Then I (and other users) can add annotation tags and
then do searches that use those annotations either specifically to
look for certain annotations, or generally as a filter to be added to
your search (or to be applied against search results).
For example, you'd take a message:
From: [EMAIL PROTECTED]
Subject: I really don't like Broccoli
Y'know, I've always hated Broccoli, but that doesn't mean I
like President Bush. I do like servlets, though.
[EMAIL PROTECTED]
Which would then get wrapped in a basic XML schema, maybe with some
basic parsing:
<MESSAGE>
<HEADERS>
From: [EMAIL PROTECTED]
Subject: I really don't like Broccoli
</HEADERS>
<BODY>
Y'know, I've always hated Broccoli, but that doesn't mean I
like President Bush. I do like servlets, though.
[EMAIL PROTECTED]
</BODY>
</MESSAGE>
And then at some later point somebody could come along and add
an annotation:
<MESSAGE>
<HEADERS>
From: [EMAIL PROTECTED]
Subject: I really don't like Broccoli
</HEADERS>
<BODY>
Y'know, I've always hated Broccoli, but that doesn't mean I
like President Bush. I do like servlets, though.
[EMAIL PROTECTED]
</BODY>
<ANNOTATION>
<AUTHOR>[EMAIL PROTECTED]</AUTHOR>
<KEYWORD>noise</KEYWORD>
</ANNOTATION>
</MESSAGE>
And then later somebody else might run a search like:
Search:
servlets
and not (tag:annotation containing
tag:author=curmurdgeon and
tag:keyword=noise)
Or something. Thoughts? Is Lucene appropriate for this? I'm
already thinking in terms of of a java & servlets based approach, and
I'm really leaning towards XML for the message bodies, because it
gives me a good mix of simple and fast (one file per message, able to
quickly serve up an individual message) and complex (able to have
arbitrary data and structure in each file).
Steven J. Owens
[EMAIL PROTECTED]
_______________________________________________
Lucene-users mailing list
[EMAIL PROTECTED]
http://lists.sourceforge.net/lists/listinfo/lucene-users