Hello, I'm testing how PmWiki, using sqlite recipe as page store, can handle one hundred thousands of (short) pages (and maybe more later if I can make it work the way I need).
In my case, for a dictionary simulation, I define a group for Word and another group for Quote. A pagelist in Word pages will print out the linked quotes. I import 70k words (word, sense, etymology, synonym, etc., all in ptv, to be templated later) and 90k quotes (the quote, source, word) and sqlite database is now about 100 mo (for 100k « pages ») and .pageindex is about 40 mo. Main issue encountered is how .pageindex is handling its indexation task. It sounds like it definitely stops working when the amount of _new_ data is too big. I mean, the process looks like it evaluates first the amount of new data, rather than starting to index, thus, in case there is too much new data, you get a memory error message, and the game is over. I wish the pageindexing would work, and work, and work, no matter how much new data there is to index, until its done. If the amount of new data is acceptable, then he will start making the index. Not in one time : you will have to ask him several times, but at the end (search 10 times, more or less), you know its done, and you have not encoutered memory issue. To import all my data, I have had no other choice than to split my original big files in 10 or 20 pieces. After each partial import, I have had to run a few searches to activate the process, until I am sure the indexation is done (check explorer until size is not growing anymore). In other words, no way to import 50 megas of new data and to get it indexed. Have to split them first. At the end of the story, it works, and it's not working bad at all. Hans' TextExtract (limited to Word group) does a nice job as well : 133 results from 90 pages, 72236 pages searched in 1.95 seconds (regex doesn't work, but you can target anchored text). It's working, yes, but I don't feel safe. Mostly because of the trouble of getting .pageindex done. Biggest problem is I can not delete the current .pageindex wich took me more than one hour to get done. In case I would delete this file, then the amount of _new_ data (all 100 mo sqlite data would look like new) would be far to vast, PmWiki would run out of memory on every search and the indexation process would never start, failing first. Is there something to do with the native search engine to avoid it failing each time amount of new data is too big ? Or how to secure the pageindex processing ? ImportText has an internal mechanism to avoid hitting PHP's "maximum execution time" limits : « this script will perform imports for up to 15 seconds (as set by the $ImportTime variable). If it's unable to process all of the imported files within that period of time, it closes out its work and queues the remaining files to be processed on subsequent requests. » Would it be possible/easy/pertinent to implement such memory protection to the native search engine ? Related question is : as I'm using sqlite for storing a big amount of short and very short pages, why use the pmwiki .pageindex process rather than performing a fulltext search ? Thank you for your advices, Gilles. -- --------------------------------------- | A | de la langue française | B | http://www.languefrancaise.net | C | [email protected] --------------------------------------- @bobmonamour ---------------------------------------
_______________________________________________ pmwiki-users mailing list [email protected] http://www.pmichaud.com/mailman/listinfo/pmwiki-users
