Hi,

I posted a mail some months ago (one of the owners replied to my post
but I don't think my mail got published on the list)
explaining what we had done for a specific project.

That was basically a way to index and then search for any XPath expression
in an XML document.

The idea was to specify the XPath expressions you wanted to search at
indexing time.

Then before indexing an XML file, it is be parsed and then each XPath
expression
is evaluated against the document and result(s) (converted into a single
string)
will be stored and indexed as a particular field.

It seemed to work well although it made indexing much slower (since
parsing+evaluation of XPath
expressions was a great overhead).

I did send the code (very buggy though) to Eugene but haven't heard about it
since.

Mathias Bonnard
Valoris
Tel: 00 33 (0)663561625

-----Original Message-----
From: [EMAIL PROTECTED]
[mailto:[EMAIL PROTECTED]]On Behalf Of Steven J.
Owens
Sent: 19 June 2001 01:54
To: [EMAIL PROTECTED]
Subject: [Lucene-users] XML-aware Searching?


Hey folks,

     I just skimmed the last couple dozen messages on Lucene-Users.
Lucene looks kinda nifty and maybe just what I need, but I noticed the
thread on XML searching.  I'm doing a lot of XML stuff and I agree
that providing XML Xpath/Xquery searching is probably going to be
non-trivial.  Might be worth doing, but then again a lot of things are
worth doing (just not all by me, and not all today :-).

     However, I'm curious as to whether I could use Lucene to do
"XML-aware" searches.  Not so XPath or XQL queries but being able to
search across a file system of XML documents for documents containing
specific values in certain XML tags in the documents.  

     I want to use a fairly simple XML format to add some structure to
mailing list posts and allow users to annotate them with keywords for
later filtering, to make a more useful mailnig list archive.  One of
my big problems with existing archives is that when I'm digging
through a very spammy mailing list (say, for example,
[EMAIL PROTECTED]) I have to wade through hundreds of
obviously inappropriate messages.  To make matters worse, I often run
across the same inappropriate messages, time and again.  If I could
just mark them as inappropriate the first time and have them
automatically be filtered out in the future, that'd be a huge step
forward.

     So my current approach is that I have the archive split up into a
big set of files (42707 at the moment, but that's a month or two out
of date...).  I'm trying to figure out the best way to implement a
scheme where I wrap each message in a straightforward XML format (see
example below).  Then I (and other users) can add annotation tags and
then do searches that use those annotations either specifically to
look for certain annotations, or generally as a filter to be added to
your search (or to be applied against search results).

     For example, you'd take a message:

        From: [EMAIL PROTECTED]
        Subject:  I really don't like Broccoli 

        Y'know, I've always hated Broccoli, but that doesn't mean I
        like President Bush.  I do like servlets, though.

        [EMAIL PROTECTED]

     Which would then get wrapped in a basic XML schema, maybe with some
basic parsing:

        <MESSAGE>
          <HEADERS>
          From: [EMAIL PROTECTED]
          Subject:  I really don't like Broccoli
          </HEADERS>
          <BODY>
          Y'know, I've always hated Broccoli, but that doesn't mean I
          like President Bush.  I do like servlets, though.

          [EMAIL PROTECTED]
          </BODY>
        </MESSAGE>
 
     And then at some later point somebody could come along and add
an annotation:

        <MESSAGE>
          <HEADERS>
          From: [EMAIL PROTECTED]
          Subject:  I really don't like Broccoli
          </HEADERS>
          <BODY>
          Y'know, I've always hated Broccoli, but that doesn't mean I
          like President Bush.  I do like servlets, though.

          [EMAIL PROTECTED]
          </BODY>
          <ANNOTATION>
            <AUTHOR>[EMAIL PROTECTED]</AUTHOR>
            <KEYWORD>noise</KEYWORD> 
          </ANNOTATION>
        </MESSAGE>

     And then later somebody else might run a search like:

     Search:  

     servlets 
     and not (tag:annotation containing 
        tag:author=curmurdgeon and 
        tag:keyword=noise)


     Or something.  Thoughts?  Is Lucene appropriate for this?  I'm
already thinking in terms of of a java & servlets based approach, and
I'm really leaning towards XML for the message bodies, because it
gives me a good mix of simple and fast (one file per message, able to
quickly serve up an individual message) and complex (able to have
arbitrary data and structure in each file).

Steven J. Owens
[EMAIL PROTECTED]




_______________________________________________
Lucene-users mailing list
[EMAIL PROTECTED]
http://lists.sourceforge.net/lists/listinfo/lucene-users


___________________________________________________________________
This communication contains  information which  is confidential and
may also be privileged. It is for the exclusive use of the intended
recipient. If you are  not the intended recipient, please note that
any form of distribution, copying or  use of this  communication or
the information in it is strictly prohibited. If  you have received
this communication  in  error,  please  return  it  with  the title
"received in error"  to  [EMAIL PROTECTED]  then  delete the
email and destroy any copies of it. Please contact our Helpdesk  on
01628765555 if you need assistance. Thank you for your cooperation.
___________________________________________________________________

_______________________________________________
Lucene-users mailing list
[EMAIL PROTECTED]
http://lists.sourceforge.net/lists/listinfo/lucene-users

Reply via email to