Re: Lucene indexing questions

Brooks, Travis C. Wed, 6 Oct 2010 22:36:15 +0200

On Oct 6, 2010, at 1:08 PM, Roman Chyla wrote:

> Hi,
> 
> It will be interesting, what Grant Ingersoll tells you for the 2nd
> order queries, but let me muse about the same query


yes

> 
> author:ellis AND citedby:author:witten NOT refersto:author:witten
>  AND cited:10->20 AND refersto:keyword:muon
> 
> 
> first, let's assume we built the index with this necessary information
> 
> doc: 10
>  cited: 3,6,80,90,89...
>  citing_author: witten, frank, lagra, ngeyen, chu, thuey...
>  year: 2000
>  title: something something
>  authors: witten,ellis
> 
> doc: 90
>  cites: 3,8,90....
> 
> 
> the lucene query with the same effect then is:
> 
> ((author:ellis +citedby:witten -author:witten) +keyword:muon) -->
> cluster_by(len(cited))
> 
> 
> notes:
> - citedby:author:witten -- it doesnt make sense to me that it could
> be sb else than other author

I'm not sure I know what you mean here.    Recall that citedby:author:witten 
means "find me the papers that are cited by the papers that are written by 
witten"   

One can equally well do citedby:reportnumber:hep-th* meaning "find me all the 
papers that are cited by papers with a hep-th reportnumber"   

Or even better citedby:"topic:neutrino cited:500->99999" (i.e. the papers cited 
by the highly cited papers in neutrino physics...a very interesting list)


In order to fully reproduce the behavior one needs to have citing_author   AND 
citing_reportnumber AND citing_year AND .... in fact all indexes need to be 
reproduced as "citing".   Since there are something like 10 million 
citing<->cited pairs in the DB, we've just swelled the ranks of the indexes by 
a substantial factor, I think.

> 
> limitations:
> - 2nd order links must be carefully prepared (but honestly, how many
> of those 2nd order relations are really needed, and really used? this
> number is probably low...)

See above.   OTOH we haven't had this ability in SPIRES, so it is theoretically 
possible to live without the 2nd order stuff.   But I think it is more and 
more, not less and less important.   The citation relationship is central to 
the utility of these systems, as I think ADS would agree, and this stuff has 
immense power for metrics and discovering relationships.    The first physicist 
I showed "citedby" to referred to it immediately as "crack cocaine"


> - the index grows (but you can compare its size with the size of
> current in-memory dictionary, which is effectively doubled and holds
> the precious RAM - because of cited<->citing)
> 

I think the index grows a lot...

> opportunities:
> - it is exteremely easy to put any field/relation into the index (and
> reindex, which is both easy and fast)

I agree, and that's an opportunity, though since we have a few admins and 50K 
users, I'm not sure it is the right prioritization.   I.e. we may not care so 
much about indexing speed, preferring to deliver ultra-fast search, or 
semi-fast search with lots of added value(2nd order, summary, etc).

> - it allows to combine the full power of the search engine (but
> inevitably, things are done differently)

This is clearly an advantage, plus the maint. advantage of not 
writing/maintaining code that has already been written/is being maintained

> - assumption that it will be slower than python in-memory dictionary
> is assumption (and should be _recognized_ as such)

Agreed

> - it is just a different paradigm than rdbms

Yep, interestingly enough, so was SPIRES.

> 
> Thanks, Jay, for the offer of questions, it would be great if you
> could ask also about these two:
> 
> 1)
> -- is it possible to use payload for search? [i know it can influence
> scoring and be useful for display, but as i understand it, it is a
> metadata about the given position]
> 
> example, if we assume situation when we index authors <-- and add
> payload to them
> 
> field:author | payload [affiliation,field_of_study,email]
> ------------------------------
> ellis            | cern,umi  hep-theory [email protected]
> swank        | umi  hep-ex  [email protected]
> 
> is it possible to query this structure directly? ex.
> 
> "author:swink~4 and author:affiliation:cern"
> 
> (I want to find all names similar to swink, schwink, sink... and i
> also know the person is working at cern -- but i am not interested in
> a record which was written by swink@umi, and ellis@cern --> i want
> only swink@cern and for that i need payload)
> 
> 
> 2)
> What would be the best strategy to have several separate indexes? Ie.
> to have a separate index for metadata, for recently-changed-metadata,
> fulltext, citation-pairs?
> 
> presumably, all those indexes contain only records (so the results
> from them are mergeable on the recid match), but obviously the scoring
> function makes sense only inside the index; but if one would like to
> combine results (in a meaningful way) from the several indexes, what
> would be the best strategy?
> 
> thanks and cheers,
> 
>  roman
> 
> On Wed, Oct 6, 2010 at 9:14 PM, Jay Luker <[email protected]> wrote:
>> On Wed, Oct 6, 2010 at 12:40 PM, Tibor Simko <[email protected]> wrote:
>>> 
>>> To sum up, everything could be reproduced in Solr, but Solr would have
>>> to have direct access to the raw ranking data (=citation map), not only
>>> to ranked values (=citation counts), otherwise generation of things like
>>> cite summaries (which is one of the most used feature) would be slow.
>>> And we would have to port everything that operates on these raw data
>>> sets to Solr/Java, which is a very time consuming project when compared
>>> to alternative options such as dispatching only certain index (such as
>>> full-text) to Solr/Lucene and combining results back in Invenio.
>> 
>> OK, yes, the 2nd order stuff is tricky. Sometimes when you're just trying to
>> get an answer about how to do something from these "experts" you have to
>> first get past the phase where they try to convince you that you don't need
>> to do what you're trying to do.
>> 
>> 
>> --
>> ******************************************************
>> Jay Luker               Astrophysics Data System (ADS)
>> [email protected]  Center for Astrophysics
>> 617-495-4588            60 Garden Street  MS 67
>> 617-495-7356 fax        Cambridge, MA  02138
>> ******************************************************
>> 
>> 

Travis C. Brooks
Manager of Information Systems & SPIRES/INSPIRE
SLAC National Accelerator Laboratory Library
http://www.slac.stanford.edu/spires/

Re: Lucene indexing questions

Reply via email to