Hi...
I am a Final Year Undergrad.My Final year project is about search engine for
XML Document..I am currently building this system using Lucene.
The example of XML element from an XML document :
----------------------------------------------
<article>
<body>
<section>
This is my first text
</section>
<p>
This is my second text
</p>
</body>
</article>
----------------------------------------------
After the XML Document is parsed,I will get
tag : article
value :
tag : article/body
value :
tag : article/body/section
value : This is my first text
tag : article/body/p
value : This is my second text
Constructing the Lucene Index, I treat :
1.the XML tag as the field
2.the value of it as the terms to be indexed
Before implementing this search engine,I have designed to build the index in
such a way that every XML tag is converted using binary value,in order to
reduce the size index and perhaps for faster searching.To illustrate:
article will be converted to 0
article/body will be converted to 0.0
article/body/section will be converted to 0.0.0
article/body/p will be converted to 0.0.1
Now,because of using lucene for the implementation,i wonder wheter such
conversion will still be useful for efficiency..I wonder wheter inside the
lucene index itself, such kind of conversion or perhaps even further
optimization is already done in order to reduce the size index or for faster
searching.
Can anyone give me some information?
Really need help...Thanks a lot
Regards,
Maureen
---------------------------------
Be a PS3 game guru.
Get your game face on with the latest PS3 news and previews at Yahoo! Games.