I salute the Lucene community! it will be a great help for me if I get your valuable opinions on the following issue; I know I could've find more answers to my questions from reading the documentation but I did invest some time on this and still have these questions:
I am (also) building a web crawler, a topic specific one to be more precise, for a vortal. I recently learned about Lucene and I'd very much like to use it in order to handle keyword specific searched on the info that I collect. I suspect this is a "classic" project, at least for Lucene, probably something like this has been addressed already on this disussion list, I'm interested to hear any experience anyone might have with this subject. My crawler goes on the internet, extracts/parse/ranks and saves websites, most of the information is also categoriezed and stored in the database but I also save about 10 top pages from each site in the filesystem. The first question is: should I care about indexing these files at the time I extract them from internet? Or should I index them later, when I make them available for search? If yes, then can I still name my files the way I want?(i.e. are there any constraints in the filenames from Lucene perspective?) Is it an OK idea to have the same files repository (or index) where the crawler writes (indexes files) and the search function searches? I guess performance issues are important here. Can I still organize the files that I save the way I want? (I planned to write all the files from a given website on different folders...and the folders will have as name the id from my database) I maintain a taxonomy (list of categories)...each website will fall into one or more of these categories, also each website will have a rank. Does Lucene have something that I should be aware of related to what I said? I guess that's it for now...this is more like a pet project for me, a pet which keeps growing :) I wouldn't mind any help and opinions you can provide, source code samples, etc. Big thanks in advance and good luck on your work. adrian. --------------------------------------------------------------------- To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
