#420: Citesummary quirks
------------------------+----------------------
Reporter: tbrooks | Owner: valkyrie
Type: defect | Status: assigned
Priority: major | Milestone:
Component: WebSearch | Version:
Resolution: | Keywords: INSPIRE
------------------------+----------------------
Changes (by jblayloc):
* keywords: INSPIRE onDevPleasReview => INSPIRE
* status: in_merge => assigned
Comment:
Hi Heath,
These have been tricky to track down. Strangely fun though.
> Looking at find a beacom and topcite 1+ gives 77 records
>
http://inspirehepdev.cern.ch/search?ln=en&ln=en&p=find+a+j+beacom+and+topcite+1%2B&action_search=Search&sf=&so=d&rm=&rg=100&sc=0&of=hb
> citesummary only shows 69. Why aren't they the same?
This is because citesummary is configured such that the 'All Papers'
column does a logical AND against collection:citeable. I presume this is
to avoid citesummarizing things like peoples' HEPNames entries or
conferences or other records that shouldn't be counted as part of their
scholarly output.
The eight papers that differ between the set {{{author:beacom AND
cited:1->999999}}} and the set {{{author:beacom AND cited:1->999999 AND
collection:citeable}}} are:
* 494016 - talk
* 724598 - talk
* 506481 - report
* 752274 - I don't know what's wrong with this one
* 799126 - I don't know what's wrong with this one
* 752983 - lecture
* 827674 - temporary entry
* 798458 - report
I tried to determine what these were by doing {{{find a beacom and topcite
1+ and not collection:citeable}}}, selecting "detailed record" and paging
back and forth through them with the green arrows. Whether these items
should be citeable or not I leave to you. If policy changes we should
revisit current documents in these classes. If policy doesn't change this
should probably be added to the burgeoning FAQ.
> Mode gives 0 (seems wrong).
For "and topcite 1+" a mode of 0 certainly would be wrong. :) But are
you sure you're looking at the same thing I am? When I look at
http://inspirehepdev.cern.ch/search?ln=en&ln=en&p=find+a+beacom+and+topcite+1%2B&action_search=Search&sf=&so=d&rm=&rg=25&sc=0&of=hcs
I see a mode of 21, which looks reasonable to me?
One potential point of strangeness with the mode calculation is that when
there are several similar citation count clusters - as John has, with 3
21's and 3 26's, the selection of which to report for the mode is
essentially random.
> Clicking on links (both sorting and removing RPP) seems to work well.
That's good. I noticed that if you click to remove RPP the "Remove RPP"
link still lingers. I'll try to get the remove to remove its remove yet
today, and will update this ticket again when I've got that sorted out.
> Looking at Michael Barnett:
http://inspirehepdev.cern.ch/search?p=find%20a%20r%20m%20barnett%20not%20t%20rpp&of=hcs
> Clicking on "published only" number 63, takes you to 55 papers, which
doesn't match
> the link number.
Ah, now this one is interesting. If you observe the search box, the 55
papers you get from clicking on that published only number are the result
of the search:
{{{find a r m barnett not t rpp AND collection:citeable AND
collection:published}}}
to get the original 63, you need to say:
{{{find a r m barnett not t rpp AND collection:published}}} (ie, remove
the citeable restriction.) So here we're seeing the opposite problem of
the one with the beacom search above: the all papers column is here not
restricting to citeable papers for its counting when it should be. This
is because you used the 'remove PDG' link, and it didn't carry through the
'collection:citeable' restriction when it recalculated.
If you try something like {{{find a r m barnett not t rpp and
collection:citeable}}} you'll see (what I think is the) the correct
number, 55. I'll add this to my "remove RPP link strangeness" post-it-
note to try to fix Real Soon Now, and will comment back on this ticket
when I have.
For Barnett, the 8 unciteable papers are:
* 224839 - talk
* 104617 - talk
* 308177 - talk
* 348179 - talk
* 310452 - talk
* 308281 - talk
* 433501 - book
* 388766 - talk
and lastly
>
http://inspirehepdev.cern.ch/search?p=find%20a%20r%20m%20barnett%20not%20t%20rpp&of=hcs
> gives a median of 3, which seems wrong.
It may be surprising, but doubt its wrong. Recall the method of
calculating median: list all the N scores in order, then take whichever
score sits at exactly floor(N/2)+1. Then observe that Barnett's "All
Papers" column shows:
* 108 total citeable papers
* 18 papers with 1-9 citations
* 44 papers with 0 citations
So the median is the 54th entry in his citation list, which is item 10 of
18 in the 1-9 citation set. Having noticed cites display a long tail, I'd
expect most of the cites in the 1-9 range to be 1's, and 3 isn't very far
off from that.
So while I encourage everyone to check more cases to make sure that things
seem reasonable, I'm satisfied that the math parts are right.
So I believe my current todo list with this ticket is:
* "Remove RPP" link should remove itself
* "Remove RPP" link should keep maintain intersection against
collection:citeable
Thanks,
Joe
--
Ticket URL: <http://invenio-software.org/ticket/420#comment:13>
Invenio <http://invenio-software.org>