#420: Citesummary quirks
------------------------+----------------------
  Reporter:  tbrooks    |      Owner:  jblayloc
      Type:  defect     |     Status:  assigned
  Priority:  major      |  Milestone:
 Component:  WebSearch  |    Version:
Resolution:             |   Keywords:  INSPIRE
------------------------+----------------------

Comment (by hoc):

 Replying to [comment:13 jblayloc]:
 > Hi Heath,
 >
 > These have been tricky to track down.  Strangely fun though.
 >
 > > Looking at find a beacom and topcite 1+ gives 77 records
 > >
 
http://inspirehepdev.cern.ch/search?ln=en&ln=en&p=find+a+j+beacom+and+topcite+1%2B&action_search=Search&sf=&so=d&rm=&rg=100&sc=0&of=hb
 > > citesummary only shows 69. Why aren't they the same?
 >
 > This is because citesummary is configured such that the 'All Papers'
 column does a logical AND against collection:citeable.  I presume this is
 to avoid citesummarizing things like peoples' HEPNames entries or
 conferences or other records that shouldn't be counted as part of their
 scholarly output.
 >
 > The eight papers that differ between the set {{{author:beacom AND
 cited:1->999999}}} and the set {{{author:beacom AND cited:1->999999 AND
 collection:citeable}}} are:
 > * 494016 - talk
 > * 724598 - talk
 > * 506481 - report
 > * 752274 - I don't know what's wrong with this one
 > * 799126 - I don't know what's wrong with this one
 > * 752983 - lecture
 > * 827674 - temporary entry
 > * 798458 - report
 >
 OK, in my mind "citeable" means published in a journal or an eprint (the
 TC categories above don't exclude eprints from being citeable). We're
 maybe starting to push the envelope  a bit with pseudojournals like
 Conf.Proc. and citeable report numbers but for the time being let's be
 conservative and stick with true journals and eprints. The "published
 only" column should then limit itself to TC=P.

 > I tried to determine what these were by doing {{{find a beacom and
 topcite 1+ and not collection:citeable}}}, selecting "detailed record" and
 paging back and forth through them with the green arrows.  Whether these
 items should be citeable or not I leave to you.  If policy changes we
 should revisit current documents in these classes.  If policy doesn't
 change this should probably be added to the burgeoning FAQ.
 >
 Let's open this up to discussion in an EVO, how should we construct the
 "all" column (ie. resticted to citeable, I've suggested eprints and
 journal articles) and how should we construct the published only column
 (I've suggested TC=P). I don't really know enough about the collections to
 say what they're doing but in this beacom example it doesn't quite seem to
 be what I'd expect.

 > > Mode gives 0 (seems wrong).
 >
 > For "and topcite 1+" a mode of 0 certainly would be wrong.  :)  But are
 you sure you're looking at the same thing I am?  When I look at
 
http://inspirehepdev.cern.ch/search?ln=en&ln=en&p=find+a+beacom+and+topcite+1%2B&action_search=Search&sf=&so=d&rm=&rg=25&sc=0&of=hcs
 I see a mode of 21, which looks reasonable to me?
 >
 This is weird, when I looked at it later in the day, it no longer said 0.

 > One potential point of strangeness with the mode calculation is that
 when there are several similar citation count clusters - as John has, with
 3 21's and 3 26's, the selection of which to report for the mode is
 essentially random.
 >
 I'm wondering for citations if the mode concept is really helpful. If he
 had 10 papers and his citations were 1,1,21,22,23,25,26,30,45,59, saying
 mode=1 seems silly. It would make sense to bin his citations 0,4:2, 5-0:0,
 10-14:0, 15-19:0, 20-25:5, etc... mode=20-25 but you already see this in
 the citesumamry display (more or less).

 > > Clicking on links (both sorting and removing RPP) seems to work well.
 >
 > That's good.  I noticed that if you click to remove RPP the "Remove RPP"
 link still lingers.  I'll try to get the remove to remove its remove yet
 today, and will update this ticket again when I've got that sorted out.
 >
 > > Looking at Michael Barnett:
 
http://inspirehepdev.cern.ch/search?p=find%20a%20r%20m%20barnett%20not%20t%20rpp&of=hcs
 > > Clicking on "published only" number 63, takes you to 55 papers, which
 doesn't match
 > > the link number.
 >
 > Ah, now this one is interesting.  If you observe the search box, the 55
 papers you get from clicking on that published only number are the result
 of the search:
 > {{{find a r m barnett not t rpp AND collection:citeable AND
 collection:published}}}
 > to get the original 63, you need to say:
 > {{{find a r m barnett not t rpp AND collection:published}}} (ie, remove
 the citeable restriction.)  So here we're seeing the opposite problem of
 the one with the beacom search above: the all papers column is here not
 restricting to citeable papers for its counting when it should be.  This
 is because you used the 'remove PDG' link, and it didn't carry through the
 'collection:citeable' restriction when it recalculated.
 >
 > If you try something like {{{find a r m barnett not t rpp and
 collection:citeable}}} you'll see (what I think is the) the correct
 number, 55.  I'll add this to my "remove RPP link strangeness" post-it-
 note to try to fix Real Soon Now, and will comment back on this ticket
 when I have.
 >
 > For Barnett, the 8 unciteable papers are:
 > * 224839 - talk
 > * 104617 - talk
 > * 308177 - talk
 > * 348179 - talk
 > * 310452 - talk
 > * 308281 - talk
 > * 433501 - book
 > * 388766 - talk

 Yes, these look pretty unciteable.

 > and lastly
 > >
 
http://inspirehepdev.cern.ch/search?p=find%20a%20r%20m%20barnett%20not%20t%20rpp&of=hcs
 > > gives a median of 3, which seems wrong.
 >
 > It may be surprising, but doubt its wrong.  Recall the method of
 calculating median: list all the N scores in order, then take whichever
 score sits at exactly floor(N/2)+1.  Then observe that Barnett's "All
 Papers" column shows:
 > * 108 total citeable papers
 > * 18 papers with 1-9 citations
 > * 44 papers with 0 citations
 > So the median is the 54th entry in his citation list, which is item 10
 of 18 in the 1-9 citation set.  Having noticed cites display a long tail,
 I'd expect most of the cites in the 1-9 range to be 1's, and 3 isn't very
 far off from that.
 >
 > So while I encourage everyone to check more cases to make sure that
 things seem reasonable, I'm satisfied that the math parts are right.
 >
 Yes it is, mathematically, I should have paid closer attention to the
 numbers. Intuition, though, told me it was suspect and I think the 44
 papers in the 0 citations bin mostly shouldn't be there (unciteables). So
 the mathematics is working just fine but we're back to this collections
 problem. Beacom is younger than Barnett and has always had eprints, so his
 unciteable problem is far less noticeable. Barnett started well before
 eprints and so has a lot of unciteable papers, which are  numerous enough
 to mess up his stats.

 > So I believe my current todo list with this ticket is:
 > * "Remove RPP" link should remove itself
 > * "Remove RPP" link should keep maintain intersection against
 collection:citeable

 Sounds good. We'll see what we can find out about the citeable collection
 tomorrow.

 Cheers,

 Heath.

-- 
Ticket URL: <https://invenio-software.org/ticket/420#comment:15>
Invenio <http://invenio-software.org>

Reply via email to