Hi Jan, hi Jay, I was curious to check the speed of the lucene, I am attaching two screenshots
1. indexing 49K docs with one machine 2. indexing 49K in parallel with 3 machines As per the setup, it is running on the SLC5 machines, with python 2.5 and it is using pylucene. The hardware is always different, so i dont know now - the files sit on one machine in the network (and we ignore all pdfs - those are the red lines you see in the charts - please ignore the scale of the red line; i have no control over the chart generator [google] ). The workers are contacting remote machine to get urls and then fetch the data (so there is a delay from the network transfer, they also report back). For the 1 machine, I didn't print enough details (it says 0.2 - which might be 0.15-0.2 s). The average file (also limit to the I/O) per 3 workers was 0.0077 s per file -- 3*.0077 = 0.02; which might mean 3000 files per minute per one machine (?). The resulting index was 773 MB (you can find it inside afs) The indexing was using the default StandardAnalyzer, commiting after each 200 files, we are indexing only url and the fulltext (two fields) (for those who have access to my AFS account, you can actually test it yourself by: cd /afs/users/r/rchyla/w0/test ./start_jobs 1nw 3 # ie start the indexing job with 3 machines or you can grab the code and run it on your machine: https://svnweb.cern.ch/trac/rcarepo/browser/newseman/trunk/src/merkur/workflows/indexing/test_ft_indexing.py ) I expect pylucene to be slower than normal lucene, but it is still not slow, perhaps we can come up with a configuration to assess them both? Jan's numbers show 10K files per 10 min, if I am not wrong. Best, roman PS: somewhat related, you might also want to check this: https://svnweb.cern.ch/trac/rcarepo/wiki/InspireSemanticSearch#Fulltextsearchwithsemanticfeatures -- there it is 0.19 s/file indexing speed, but much more complex setup On Mon, May 17, 2010 at 5:37 PM, Jan Iwaszkiewicz <[email protected]> wrote: > Hi Jay, > > Thanks for the offer. Of course, it will be useful if you describe your Solr > tests. Just start a new paragraph (like the "Test with 58k text files..." > one). I also tested Solr before and it is interesting but has no support for > parallel processing yet. > This wiki is a draft summarising the different potential solutions for large > scale full-text indexing in Invenio. It's a work-in-progress description and > the performance results are only indicatory. Also the tests were done on the > same machine so relative speeds are still informative. I've added basic > hardware specs. > > Fell free to also express your view on the requirements for the full-text > search, should it be different to CDS/INSPIRE one. I heard you use different > stemming. Full-text indexing for Invenio is my main task at the moment and > I'm more than happy to work with your team to make sure that we don't > duplicate effort and have a solution covering all needs. > > Best Regards, > Jan > > > Jay Luker wrote: >> >> Hi all, >> >> Since we at ADS are starting to experiment with Solr I wondered if it >> might be helpful if I added some embellishments to the content at >> https://twiki.cern.ch/twiki/bin/view/CDS/TalkFullTextIndex. What is >> the goal of that page exactly? >> >> Also, I can't help pointing out that the performance numbers at the >> bottom aren't very informative due to a lack of context (hardware >> specs, etc.) >> >> > >
<<attachment: ft-indexing-one-machine.png>>
<<attachment: ft-indexing-3-machines.png>>
