On Oct 6, 2010, at 1:08 PM, Roman Chyla wrote: > Hi, > > It will be interesting, what Grant Ingersoll tells you for the 2nd > order queries, but let me muse about the same query
yes > > author:ellis AND citedby:author:witten NOT refersto:author:witten > AND cited:10->20 AND refersto:keyword:muon > > > first, let's assume we built the index with this necessary information > > doc: 10 > cited: 3,6,80,90,89... > citing_author: witten, frank, lagra, ngeyen, chu, thuey... > year: 2000 > title: something something > authors: witten,ellis > > doc: 90 > cites: 3,8,90.... > > > the lucene query with the same effect then is: > > ((author:ellis +citedby:witten -author:witten) +keyword:muon) --> > cluster_by(len(cited)) > > > notes: > - citedby:author:witten -- it doesnt make sense to me that it could > be sb else than other author I'm not sure I know what you mean here. Recall that citedby:author:witten means "find me the papers that are cited by the papers that are written by witten" One can equally well do citedby:reportnumber:hep-th* meaning "find me all the papers that are cited by papers with a hep-th reportnumber" Or even better citedby:"topic:neutrino cited:500->99999" (i.e. the papers cited by the highly cited papers in neutrino physics...a very interesting list) In order to fully reproduce the behavior one needs to have citing_author AND citing_reportnumber AND citing_year AND .... in fact all indexes need to be reproduced as "citing". Since there are something like 10 million citing<->cited pairs in the DB, we've just swelled the ranks of the indexes by a substantial factor, I think. > > limitations: > - 2nd order links must be carefully prepared (but honestly, how many > of those 2nd order relations are really needed, and really used? this > number is probably low...) See above. OTOH we haven't had this ability in SPIRES, so it is theoretically possible to live without the 2nd order stuff. But I think it is more and more, not less and less important. The citation relationship is central to the utility of these systems, as I think ADS would agree, and this stuff has immense power for metrics and discovering relationships. The first physicist I showed "citedby" to referred to it immediately as "crack cocaine" > - the index grows (but you can compare its size with the size of > current in-memory dictionary, which is effectively doubled and holds > the precious RAM - because of cited<->citing) > I think the index grows a lot... > opportunities: > - it is exteremely easy to put any field/relation into the index (and > reindex, which is both easy and fast) I agree, and that's an opportunity, though since we have a few admins and 50K users, I'm not sure it is the right prioritization. I.e. we may not care so much about indexing speed, preferring to deliver ultra-fast search, or semi-fast search with lots of added value(2nd order, summary, etc). > - it allows to combine the full power of the search engine (but > inevitably, things are done differently) This is clearly an advantage, plus the maint. advantage of not writing/maintaining code that has already been written/is being maintained > - assumption that it will be slower than python in-memory dictionary > is assumption (and should be _recognized_ as such) Agreed > - it is just a different paradigm than rdbms Yep, interestingly enough, so was SPIRES. > > Thanks, Jay, for the offer of questions, it would be great if you > could ask also about these two: > > 1) > -- is it possible to use payload for search? [i know it can influence > scoring and be useful for display, but as i understand it, it is a > metadata about the given position] > > example, if we assume situation when we index authors <-- and add > payload to them > > field:author | payload [affiliation,field_of_study,email] > ------------------------------ > ellis | cern,umi hep-theory [email protected] > swank | umi hep-ex [email protected] > > is it possible to query this structure directly? ex. > > "author:swink~4 and author:affiliation:cern" > > (I want to find all names similar to swink, schwink, sink... and i > also know the person is working at cern -- but i am not interested in > a record which was written by swink@umi, and ellis@cern --> i want > only swink@cern and for that i need payload) > > > 2) > What would be the best strategy to have several separate indexes? Ie. > to have a separate index for metadata, for recently-changed-metadata, > fulltext, citation-pairs? > > presumably, all those indexes contain only records (so the results > from them are mergeable on the recid match), but obviously the scoring > function makes sense only inside the index; but if one would like to > combine results (in a meaningful way) from the several indexes, what > would be the best strategy? > > thanks and cheers, > > roman > > On Wed, Oct 6, 2010 at 9:14 PM, Jay Luker <[email protected]> wrote: >> On Wed, Oct 6, 2010 at 12:40 PM, Tibor Simko <[email protected]> wrote: >>> >>> To sum up, everything could be reproduced in Solr, but Solr would have >>> to have direct access to the raw ranking data (=citation map), not only >>> to ranked values (=citation counts), otherwise generation of things like >>> cite summaries (which is one of the most used feature) would be slow. >>> And we would have to port everything that operates on these raw data >>> sets to Solr/Java, which is a very time consuming project when compared >>> to alternative options such as dispatching only certain index (such as >>> full-text) to Solr/Lucene and combining results back in Invenio. >> >> OK, yes, the 2nd order stuff is tricky. Sometimes when you're just trying to >> get an answer about how to do something from these "experts" you have to >> first get past the phase where they try to convince you that you don't need >> to do what you're trying to do. >> >> >> -- >> ****************************************************** >> Jay Luker Astrophysics Data System (ADS) >> [email protected] Center for Astrophysics >> 617-495-4588 60 Garden Street MS 67 >> 617-495-7356 fax Cambridge, MA 02138 >> ****************************************************** >> >> Travis C. Brooks Manager of Information Systems & SPIRES/INSPIRE SLAC National Accelerator Laboratory Library http://www.slac.stanford.edu/spires/
