Re: Fulltext indexing twiki page

Jay Luker Mon, 17 May 2010 23:33:27 +0200

I would play along and provide some numbers myself but in our current
solr testbed the performance is consumed by the solr doc generation
and not the indexing itself. There are upwards of 12-15 metadata
fields in our schema in addition to the full text, and the values are
being pulled from various sources, including over http from an ADS web
API.


I will say that I've found the existing Solr documentation to be
sufficient, but maybe my expectations weren't as high. There are two
full-length books available, one of them free [2], and I find the Solr
wiki [3] to be consistently helpful.

Jan is right that parallel indexing is not currently supported in
Solr. I had heard that the DataImportHandler, which allows for the
pulling of data from a relational database (rather than pushing over
http), was going to provide for parallelization. I hadn't heard of
Katta, but that looks interesting.

Tomorrow I'll try to collect and ouline any special fulltext
requirements we have and add them to the wiki page.

--jay

[1] https://www.packtpub.com/solr-1-4-enterprise-search-server/book
[2] 
http://www.lucidimagination.com/Downloads/LucidWorks-for-Solr/Reference-Guide
[3] http://wiki.apache.org/solr/

On Mon, May 17, 2010 at 4:09 PM, Roman Chyla <[email protected]> wrote:
> Hi Jan, hi Jay,
>
> I was curious to check the speed of the lucene, I am attaching two screenshots
>
> 1. indexing 49K docs with one machine
> 2. indexing 49K in parallel with 3 machines
>
> As per the setup, it is running on the SLC5 machines, with python 2.5
> and it is using pylucene. The hardware is always different, so i dont
> know now - the files sit on one machine in the network (and we ignore
> all pdfs - those are the red lines you see in the charts - please
> ignore the scale of the red line; i have no control over the chart
> generator [google] ).
>
> The workers are contacting remote machine to get urls and then fetch
> the data (so there is a delay from the network transfer, they also
> report back). For the 1 machine, I didn't print enough details (it
> says 0.2 - which might be 0.15-0.2 s). The average file (also limit to
> the I/O) per 3 workers was 0.0077 s per file -- 3*.0077 = 0.02; which
> might mean 3000 files per minute per one machine (?).
>
> The resulting index was 773 MB (you can find it inside afs)
>
> The indexing was using the default StandardAnalyzer, commiting after
> each 200 files, we are indexing only url and the fulltext (two fields)
>
>
> (for those who have access to my AFS account, you can actually test it
> yourself by:
> cd /afs/users/r/rchyla/w0/test
> ./start_jobs 1nw 3 # ie start the indexing job with 3 machines
>
> or you can grab the code and run it on your machine:
> https://svnweb.cern.ch/trac/rcarepo/browser/newseman/trunk/src/merkur/workflows/indexing/test_ft_indexing.py
> )
>
> I expect pylucene to be slower than normal lucene, but it is still not
> slow, perhaps we can come up with a configuration to assess them both?
> Jan's numbers show 10K files per 10 min, if I am not wrong.
>
> Best,
>
> roman
>
> PS: somewhat related, you might also want to check this:
> https://svnweb.cern.ch/trac/rcarepo/wiki/InspireSemanticSearch#Fulltextsearchwithsemanticfeatures
>  -- there it is 0.19 s/file indexing speed, but much more complex
> setup
>
> On Mon, May 17, 2010 at 5:37 PM, Jan Iwaszkiewicz
> <[email protected]> wrote:
>> Hi Jay,
>>
>> Thanks for the offer. Of course, it will be useful if you describe your Solr
>> tests. Just start a new paragraph (like the "Test with 58k text files..."
>> one). I also tested Solr before and it is interesting but has no support for
>> parallel processing yet.
>> This wiki is a draft summarising the different potential solutions for large
>> scale full-text indexing in Invenio. It's a work-in-progress description and
>> the performance results are only indicatory. Also the tests were done on the
>> same machine so relative speeds are still informative. I've added basic
>> hardware specs.
>>
>> Fell free to also express your view on the requirements for the full-text
>> search, should it be different to CDS/INSPIRE one. I heard you use different
>> stemming. Full-text indexing for Invenio is my main task at the moment and
>> I'm more than happy to work with your team to make sure that we don't
>> duplicate effort and have a solution covering all needs.
>>
>> Best Regards,
>> Jan
>>
>>
>> Jay Luker wrote:
>>>
>>> Hi all,
>>>
>>> Since we at ADS are starting to experiment with Solr I wondered if it
>>> might be helpful if I added some embellishments to the content at
>>> https://twiki.cern.ch/twiki/bin/view/CDS/TalkFullTextIndex. What is
>>> the goal of that page exactly?
>>>
>>> Also, I can't help pointing out that the performance numbers at the
>>> bottom aren't very informative due to a lack of context (hardware
>>> specs, etc.)
>>>
>>>
>>
>>
>



-- 
******************************************************
Jay Luker               Astrophysics Data System (ADS)
[email protected]  Center for Astrophysics
617-495-4588            60 Garden Street  MS 67
617-495-7356 fax        Cambridge, MA  02138
******************************************************

Re: Fulltext indexing twiki page

Reply via email to