Juan Jose Pablos wrote:
FYI:
Ricardo Beltran wrote:
I've CC'd Ricardo on this reply - please reply all.
...
My questions are: Do you think that Forrest is an appropriate framework
for this purpose? and Do you think that Lucene or
Google will do the job of indexing about (5 GB) of XML
files?
I can't comment with authority on the suitability of Google or Lucene
for this as I have no experience. My gut is telling me that this is not
the optimal solution.
I do have a project that has around 8Gb of dynamic data being published
via the Forrest webapp.
The solution I employed, and one that appears to be working well, was to
have the data in an XML enabled database, in this case we used Oracle,
but we have successfully used XIndice and eXist in similar, smaller,
projects in the past. I wrote a custom generator to retrieve the data
from the DBMS.
It should be noted that Cocoon has some database components that can be
utilised (there is the results of some early experiments of I did with
these components in the whiteboard plugin
org.apache.forrest.plugin.Database). The reason I never completed work
on this plugin was not a problem with it, but additional requirements
that made it easier to build a custom generator (our requests were also
dependant on live data from sensor readings over an RS232 port).
The system has now been running for about 3 months and we are very happy
with it. Because we are using a Database server as the repository we
have all the indexing and optimisation provided by that server. We also
have the benefit of a very expressive and mature search language.
Of course, this solution requires that you run the system dynamically.
Using Google to index your site would allow you to run statically.
Trying to build a static site from 5GB of data would be a wonderful
stress test, if you do this please report your findings to us.
Ross