#420: Citesummary quirks
------------------------+----------------------
  Reporter:  tbrooks    |      Owner:  valkyrie
      Type:  defect     |     Status:  assigned
  Priority:  major      |  Milestone:
 Component:  WebSearch  |    Version:
Resolution:             |   Keywords:  INSPIRE
------------------------+----------------------
Changes (by jblayloc):

 * keywords:  INSPIRE onDevPleasReview => INSPIRE
 * status:  in_merge => assigned


Comment:

 Hi Heath,

 These have been tricky to track down.  Strangely fun though.

 > Looking at find a beacom and topcite 1+ gives 77 records
 >
 
http://inspirehepdev.cern.ch/search?ln=en&ln=en&p=find+a+j+beacom+and+topcite+1%2B&action_search=Search&sf=&so=d&rm=&rg=100&sc=0&of=hb
 > citesummary only shows 69. Why aren't they the same?

 This is because citesummary is configured such that the 'All Papers'
 column does a logical AND against collection:citeable.  I presume this is
 to avoid citesummarizing things like peoples' HEPNames entries or
 conferences or other records that shouldn't be counted as part of their
 scholarly output.

 The eight papers that differ between the set {{{author:beacom AND
 cited:1->999999}}} and the set {{{author:beacom AND cited:1->999999 AND
 collection:citeable}}} are:
 * 494016 - talk
 * 724598 - talk
 * 506481 - report
 * 752274 - I don't know what's wrong with this one
 * 799126 - I don't know what's wrong with this one
 * 752983 - lecture
 * 827674 - temporary entry
 * 798458 - report

 I tried to determine what these were by doing {{{find a beacom and topcite
 1+ and not collection:citeable}}}, selecting "detailed record" and paging
 back and forth through them with the green arrows.  Whether these items
 should be citeable or not I leave to you.  If policy changes we should
 revisit current documents in these classes.  If policy doesn't change this
 should probably be added to the burgeoning FAQ.

 > Mode gives 0 (seems wrong).

 For "and topcite 1+" a mode of 0 certainly would be wrong.  :)  But are
 you sure you're looking at the same thing I am?  When I look at
 
http://inspirehepdev.cern.ch/search?ln=en&ln=en&p=find+a+beacom+and+topcite+1%2B&action_search=Search&sf=&so=d&rm=&rg=25&sc=0&of=hcs
 I see a mode of 21, which looks reasonable to me?

 One potential point of strangeness with the mode calculation is that when
 there are several similar citation count clusters - as John has, with 3
 21's and 3 26's, the selection of which to report for the mode is
 essentially random.

 > Clicking on links (both sorting and removing RPP) seems to work well.

 That's good.  I noticed that if you click to remove RPP the "Remove RPP"
 link still lingers.  I'll try to get the remove to remove its remove yet
 today, and will update this ticket again when I've got that sorted out.

 > Looking at Michael Barnett:
 
http://inspirehepdev.cern.ch/search?p=find%20a%20r%20m%20barnett%20not%20t%20rpp&of=hcs
 > Clicking on "published only" number 63, takes you to 55 papers, which
 doesn't match
 > the link number.

 Ah, now this one is interesting.  If you observe the search box, the 55
 papers you get from clicking on that published only number are the result
 of the search:
 {{{find a r m barnett not t rpp AND collection:citeable AND
 collection:published}}}
 to get the original 63, you need to say:
 {{{find a r m barnett not t rpp AND collection:published}}} (ie, remove
 the citeable restriction.)  So here we're seeing the opposite problem of
 the one with the beacom search above: the all papers column is here not
 restricting to citeable papers for its counting when it should be.  This
 is because you used the 'remove PDG' link, and it didn't carry through the
 'collection:citeable' restriction when it recalculated.

 If you try something like {{{find a r m barnett not t rpp and
 collection:citeable}}} you'll see (what I think is the) the correct
 number, 55.  I'll add this to my "remove RPP link strangeness" post-it-
 note to try to fix Real Soon Now, and will comment back on this ticket
 when I have.

 For Barnett, the 8 unciteable papers are:
 * 224839 - talk
 * 104617 - talk
 * 308177 - talk
 * 348179 - talk
 * 310452 - talk
 * 308281 - talk
 * 433501 - book
 * 388766 - talk

 and lastly
 >
 
http://inspirehepdev.cern.ch/search?p=find%20a%20r%20m%20barnett%20not%20t%20rpp&of=hcs
 > gives a median of 3, which seems wrong.

 It may be surprising, but doubt its wrong.  Recall the method of
 calculating median: list all the N scores in order, then take whichever
 score sits at exactly floor(N/2)+1.  Then observe that Barnett's "All
 Papers" column shows:
 * 108 total citeable papers
 * 18 papers with 1-9 citations
 * 44 papers with 0 citations
 So the median is the 54th entry in his citation list, which is item 10 of
 18 in the 1-9 citation set.  Having noticed cites display a long tail, I'd
 expect most of the cites in the 1-9 range to be 1's, and 3 isn't very far
 off from that.

 So while I encourage everyone to check more cases to make sure that things
 seem reasonable, I'm satisfied that the math parts are right.

 So I believe my current todo list with this ticket is:
 * "Remove RPP" link should remove itself
 * "Remove RPP" link should keep maintain intersection against
 collection:citeable

 Thanks,
 Joe

-- 
Ticket URL: <http://invenio-software.org/ticket/420#comment:13>
Invenio <http://invenio-software.org>

Reply via email to