Re: Fulltext indexing twiki page

Roman Chyla Mon, 17 May 2010 22:09:38 +0200

Hi Jan, hi Jay,

I was curious to check the speed of the lucene, I am attaching two screenshots


1. indexing 49K docs with one machine
2. indexing 49K in parallel with 3 machines

As per the setup, it is running on the SLC5 machines, with python 2.5
and it is using pylucene. The hardware is always different, so i dont
know now - the files sit on one machine in the network (and we ignore
all pdfs - those are the red lines you see in the charts - please
ignore the scale of the red line; i have no control over the chart
generator [google] ).

The workers are contacting remote machine to get urls and then fetch
the data (so there is a delay from the network transfer, they also
report back). For the 1 machine, I didn't print enough details (it
says 0.2 - which might be 0.15-0.2 s). The average file (also limit to
the I/O) per 3 workers was 0.0077 s per file -- 3*.0077 = 0.02; which
might mean 3000 files per minute per one machine (?).

The resulting index was 773 MB (you can find it inside afs)

The indexing was using the default StandardAnalyzer, commiting after
each 200 files, we are indexing only url and the fulltext (two fields)


(for those who have access to my AFS account, you can actually test it
yourself by:
cd /afs/users/r/rchyla/w0/test
./start_jobs 1nw 3 # ie start the indexing job with 3 machines

or you can grab the code and run it on your machine:
https://svnweb.cern.ch/trac/rcarepo/browser/newseman/trunk/src/merkur/workflows/indexing/test_ft_indexing.py
)

I expect pylucene to be slower than normal lucene, but it is still not
slow, perhaps we can come up with a configuration to assess them both?
Jan's numbers show 10K files per 10 min, if I am not wrong.

Best,

roman

PS: somewhat related, you might also want to check this:
https://svnweb.cern.ch/trac/rcarepo/wiki/InspireSemanticSearch#Fulltextsearchwithsemanticfeatures
 -- there it is 0.19 s/file indexing speed, but much more complex
setup

On Mon, May 17, 2010 at 5:37 PM, Jan Iwaszkiewicz
<[email protected]> wrote:
> Hi Jay,
>
> Thanks for the offer. Of course, it will be useful if you describe your Solr
> tests. Just start a new paragraph (like the "Test with 58k text files..."
> one). I also tested Solr before and it is interesting but has no support for
> parallel processing yet.
> This wiki is a draft summarising the different potential solutions for large
> scale full-text indexing in Invenio. It's a work-in-progress description and
> the performance results are only indicatory. Also the tests were done on the
> same machine so relative speeds are still informative. I've added basic
> hardware specs.
>
> Fell free to also express your view on the requirements for the full-text
> search, should it be different to CDS/INSPIRE one. I heard you use different
> stemming. Full-text indexing for Invenio is my main task at the moment and
> I'm more than happy to work with your team to make sure that we don't
> duplicate effort and have a solution covering all needs.
>
> Best Regards,
> Jan
>
>
> Jay Luker wrote:
>>
>> Hi all,
>>
>> Since we at ADS are starting to experiment with Solr I wondered if it
>> might be helpful if I added some embellishments to the content at
>> https://twiki.cern.ch/twiki/bin/view/CDS/TalkFullTextIndex. What is
>> the goal of that page exactly?
>>
>> Also, I can't help pointing out that the performance numbers at the
>> bottom aren't very informative due to a lack of context (hardware
>> specs, etc.)
>>
>>
>
>

<<attachment: ft-indexing-one-machine.png>>

<<attachment: ft-indexing-3-machines.png>>

Re: Fulltext indexing twiki page

Reply via email to