Hi Eran, see my comments below inline:
On 03/11/2008 at 9:23 AM, Eran Sevi wrote:
> I would like to ask for suggestions of the best design for
> the following scenario:
>
> I have a very large number of XML files (around 1M).
> Each file contains several sections. Each section contains
> many elements (about 1000-5000).
> Each element has a value and some attributes describing the
> value (like
> metadata), for example:
>
> <Section1>
> <Element1 id="0" type="A" meta1="val11"
> meta2="val21">value1</Element1>
> <Element1 id="1" type="B" meta1="val12"
> meta2="val21">value2</Element1>
> ...
> </Section1>
> <Section2>
> <Element2 id="0" type="D" meta1="val11"
> meta3="val31">value3</Element2>
> <Element2 id="1" type="B" meta1="val13"
> meta3="val34">value1</Element2>
> ...
> <Section2>
> ...
>
> As you can see, each attribute can have any value, and
> attribute names can be the same in different sections.
>
> I would like to index the XML in such a way so I can perform
> queries like:
>
> element1=value1 AND type=A AND meta2=val21
>
> and also more complicated queries that include positions
> between elements, and even range queries on attribute values.
>
> Indexing each element as a different document might not be
> possible because of the large number of documents it might
> create (more then 5 billion docs), and might also make it
> difficult to parse results - I still want to know how
> many original XML documents contains the searched terms.
5 billion docs is within the range that Lucene can handle. I think you should
try doc = element and see how well it works.
In order to know which original documents your hits come from, add an
"xml_doc_id" field, and collect the hits' xml_doc_id values in a set, then take
the set's cardinality.
> Indexing each attribute as a different field is also
> difficult because I then need the positional information
> of the found terms and check that they were all found in
> the same place (and thus "belong" to the same element).
You could use an XPath(-ish, depending on requirements) field that represents
the element location, e.g.:
<Section1>
<Element1 id="0" type="A" meta1="val11" meta2="val21">value1</Element1>
<Element1 id="1" type="B" meta1="val12" meta2="val21">value2</Element1>
...
</Section1>
==>
Lucene Document field-name:value
doc #1
xml_doc_id:1
xpath:/Section1/Element1[1]
id:0
type:A
meta1:val11
meta2:val21
value:value1
doc #2
xml_doc_id:1
xpath:/Section1/Element1[2]
id:1
type:B
meta1:val12
meta2:val21
value:value2
Hope it helps,
Steve
---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]