Re: Lucene indexing questions

Roman Chyla Thu, 7 Oct 2010 19:24:22 +0200

On Thu, Oct 7, 2010 at 5:29 PM, Tibor Simko <[email protected]> wrote:
> On Thu, 07 Oct 2010, Roman Chyla wrote:
>> thank you, so i guess there is something that interprets those
>> relations
>
> Yes, there is Python code in the web app server that walks through the
> dictionaries as needed.
>
>>>> Do you have some reasons to believe that the pairs are more storage
>>>> effective, than the points in the index?
>>>
>>> A web app node does not have to contact the DB node in order to walk
>>> over the citation map to provide a cite summary, because the full
>>> citation map is readily available in its memory.  Good for load
>>> distribution, hence speed and scalability.
>>
>> strictly speaking, we were having discussion about the storage, so it
>> still seems to be not soooo much more swelled
>
> The web app nodes fetch and cache citation dictionaries from the DB
> storage space upon Apache startup.  Then they don't bother going to the
> DB server for citation data anymore, except for quick timestamp checks
> to see if everything is still up to date.  So, if we have N web app
> nodes, they can process N*WP cite summary queries in parallel (where WP
> is the number of worker processes per node) without ever charging DB
> server for the citation data.  (Only for that quick timestamp check.)
>
> If we have data points stored in an `index node' that is separated from
> the `web app nodes', then the web app nodes would have to dispatch
> queries to the `index node' and gather response back, taking some time.
> If there is only one `index node', then this would become a real
> bottleneck.  If there are several `index nodes', then it is not so
> different from having in-memory citation dictionary nodes indeed, from
> the scalability point of view.  It would be a bit like doubling or
> shadowing the web app nodes, so to speak.  But wouldn't it require some
> Solr extension?


sorry, my brain half working... so i didn't write that the same thing,
in the scenario with citation index inside solr, could be standard
with lucene (using filters) -- and I believe that is what Grant
Ingersoll had in mind (but we didn't get his view on that) -- and i
believe no extension is needed, because the mechanism is already there

if the citation index is queried and maintained by the search engine,
the recids doesn't need to be shuffled between 'web app' and 'index
app', so the bottleneck isn't there. and it is obviously as scalable
as the other solutions, in theory, using less memory and possibly
safer -- because one can distribute the searchers (and they will not
slow down or kill the main app if a lot of cite summaries are
'suddenly' requested)

to answer the question on extensions, the filtering classes would
probably be where the logic (which is now in parts of the python code)
would go - but i don't believe that we speak about big chunks of code,
because obviously lucene as a search engine offers some special
features that in python must have been done differently, therefore
while i understand that the perspective of moving that functionality
into solr might be scary (please note, that i am just thinking
aloud!), it might not be so bad - in your last email you wrote that
refers/cited took 2 hours of one evening. While it would probably take
much longer to most of us, I think people should take into account
also this, especially if those 2~22 hours are compared to time spent
in making two search engine implementations talk together...merging
results, or doing some funny stuff with the results set


roman


>
> Best regards
> --
> Tibor Simko
>

Re: Lucene indexing questions

Reply via email to