What is the best way to index xml data preserving the mark up?

David Neubert Wed, 07 Nov 2007 20:18:58 -0800

I am sure this is 101 question, but I am bit confused about indexing xml data 
using SOLR.


I have rich xml content (books) that need to searched at granular levels 
(specifically paragraph and sentence levels very accurately, no 
approximations).  My source text has exact <p></p> and <s></s> tags for this 
purpose.  I have built this app in previous versions (using other search 
engines) indexing the text twice, (1) where every paragraph was a virtual 
document and (2) where every sentence was a virtual document  -- both extracted 
from the source file (which was a singe xml file for the entire book).  I have 
of course thought about using an XML engine eXists or Xindices, but I am prefer 
to the stability and user base and performance that Lucene/SOLR seems to have, 
and also there is a large body of text that is regular documents and not well 
formed XML as well.

I am brand new to SOLR (one day) and at a basic level understand SOLR's nice 
simple xml scheme to add documents:

<add>
  <doc>
    <field name="foo1">foo value 1</field>
    <field name="foo2">foo value 2</field>
  </doc>
  <doc>...</doc>
</add>

But my problem is that I believe I need to perserve the xml markup at the 
paragraph and sentence levels, so I was hoping to create a content field that 
could just contain the source xml for the paragraph or sentence respectively.  
There are reasons for this that I won't go into -- alot of granular work in 
this app, accessing pars and sens.

Obviously an XML mechanism that could leverage the xml structure (via XPath or 
XPointers) would work great.  Still I think Lucene can do this in a field level 
way-- and I also can't imagine that users who are indexing XML documents have 
to go through the trouble of striping all the markup before indexing?  
Hopefully I missing something basic?

It would be great to pointed in the right direction on this matter?

I think I need something along this line:

<add>
  <doc>
    <field name="foo1">value 1</field>
    <field name="foo2">value 2</field>
    ....
    <field name="content"><an xml stream with embedded source markup></field>
  </doc>
</add>

Maybe the overall question -- is what is the best way to index XML content 
using SOLR -- is all this tag stripping really necessary?

Thanks for any help,

Dave





__________________________________________________
Do You Yahoo!?
Tired of spam?  Yahoo! Mail has the best spam protection around 
http://mail.yahoo.com

What is the best way to index xml data preserving the mark up?

Reply via email to