I am sure this is 101 question, but I am bit confused about indexing xml data using SOLR.
I have rich xml content (books) that need to searched at granular levels (specifically paragraph and sentence levels very accurately, no approximations). My source text has exact <p></p> and <s></s> tags for this purpose. I have built this app in previous versions (using other search engines) indexing the text twice, (1) where every paragraph was a virtual document and (2) where every sentence was a virtual document -- both extracted from the source file (which was a singe xml file for the entire book). I have of course thought about using an XML engine eXists or Xindices, but I am prefer to the stability and user base and performance that Lucene/SOLR seems to have, and also there is a large body of text that is regular documents and not well formed XML as well. I am brand new to SOLR (one day) and at a basic level understand SOLR's nice simple xml scheme to add documents: <add> <doc> <field name="foo1">foo value 1</field> <field name="foo2">foo value 2</field> </doc> <doc>...</doc> </add> But my problem is that I believe I need to perserve the xml markup at the paragraph and sentence levels, so I was hoping to create a content field that could just contain the source xml for the paragraph or sentence respectively. There are reasons for this that I won't go into -- alot of granular work in this app, accessing pars and sens. Obviously an XML mechanism that could leverage the xml structure (via XPath or XPointers) would work great. Still I think Lucene can do this in a field level way-- and I also can't imagine that users who are indexing XML documents have to go through the trouble of striping all the markup before indexing? Hopefully I missing something basic? It would be great to pointed in the right direction on this matter? I think I need something along this line: <add> <doc> <field name="foo1">value 1</field> <field name="foo2">value 2</field> .... <field name="content"><an xml stream with embedded source markup></field> </doc> </add> Maybe the overall question -- is what is the best way to index XML content using SOLR -- is all this tag stripping really necessary? Thanks for any help, Dave __________________________________________________ Do You Yahoo!? Tired of spam? Yahoo! Mail has the best spam protection around http://mail.yahoo.com