#420: Citesummary quirks
------------------------+----------------------
Reporter: tbrooks | Owner: jblayloc
Type: defect | Status: assigned
Priority: major | Milestone:
Component: WebSearch | Version:
Resolution: | Keywords: INSPIRE
------------------------+----------------------
Comment (by hoc):
Replying to [comment:13 jblayloc]:
> Hi Heath,
>
> These have been tricky to track down. Strangely fun though.
>
> > Looking at find a beacom and topcite 1+ gives 77 records
> >
http://inspirehepdev.cern.ch/search?ln=en&ln=en&p=find+a+j+beacom+and+topcite+1%2B&action_search=Search&sf=&so=d&rm=&rg=100&sc=0&of=hb
> > citesummary only shows 69. Why aren't they the same?
>
> This is because citesummary is configured such that the 'All Papers'
column does a logical AND against collection:citeable. I presume this is
to avoid citesummarizing things like peoples' HEPNames entries or
conferences or other records that shouldn't be counted as part of their
scholarly output.
>
> The eight papers that differ between the set {{{author:beacom AND
cited:1->999999}}} and the set {{{author:beacom AND cited:1->999999 AND
collection:citeable}}} are:
> * 494016 - talk
> * 724598 - talk
> * 506481 - report
> * 752274 - I don't know what's wrong with this one
> * 799126 - I don't know what's wrong with this one
> * 752983 - lecture
> * 827674 - temporary entry
> * 798458 - report
>
OK, in my mind "citeable" means published in a journal or an eprint (the
TC categories above don't exclude eprints from being citeable). We're
maybe starting to push the envelope a bit with pseudojournals like
Conf.Proc. and citeable report numbers but for the time being let's be
conservative and stick with true journals and eprints. The "published
only" column should then limit itself to TC=P.
> I tried to determine what these were by doing {{{find a beacom and
topcite 1+ and not collection:citeable}}}, selecting "detailed record" and
paging back and forth through them with the green arrows. Whether these
items should be citeable or not I leave to you. If policy changes we
should revisit current documents in these classes. If policy doesn't
change this should probably be added to the burgeoning FAQ.
>
Let's open this up to discussion in an EVO, how should we construct the
"all" column (ie. resticted to citeable, I've suggested eprints and
journal articles) and how should we construct the published only column
(I've suggested TC=P). I don't really know enough about the collections to
say what they're doing but in this beacom example it doesn't quite seem to
be what I'd expect.
> > Mode gives 0 (seems wrong).
>
> For "and topcite 1+" a mode of 0 certainly would be wrong. :) But are
you sure you're looking at the same thing I am? When I look at
http://inspirehepdev.cern.ch/search?ln=en&ln=en&p=find+a+beacom+and+topcite+1%2B&action_search=Search&sf=&so=d&rm=&rg=25&sc=0&of=hcs
I see a mode of 21, which looks reasonable to me?
>
This is weird, when I looked at it later in the day, it no longer said 0.
> One potential point of strangeness with the mode calculation is that
when there are several similar citation count clusters - as John has, with
3 21's and 3 26's, the selection of which to report for the mode is
essentially random.
>
I'm wondering for citations if the mode concept is really helpful. If he
had 10 papers and his citations were 1,1,21,22,23,25,26,30,45,59, saying
mode=1 seems silly. It would make sense to bin his citations 0,4:2, 5-0:0,
10-14:0, 15-19:0, 20-25:5, etc... mode=20-25 but you already see this in
the citesumamry display (more or less).
> > Clicking on links (both sorting and removing RPP) seems to work well.
>
> That's good. I noticed that if you click to remove RPP the "Remove RPP"
link still lingers. I'll try to get the remove to remove its remove yet
today, and will update this ticket again when I've got that sorted out.
>
> > Looking at Michael Barnett:
http://inspirehepdev.cern.ch/search?p=find%20a%20r%20m%20barnett%20not%20t%20rpp&of=hcs
> > Clicking on "published only" number 63, takes you to 55 papers, which
doesn't match
> > the link number.
>
> Ah, now this one is interesting. If you observe the search box, the 55
papers you get from clicking on that published only number are the result
of the search:
> {{{find a r m barnett not t rpp AND collection:citeable AND
collection:published}}}
> to get the original 63, you need to say:
> {{{find a r m barnett not t rpp AND collection:published}}} (ie, remove
the citeable restriction.) So here we're seeing the opposite problem of
the one with the beacom search above: the all papers column is here not
restricting to citeable papers for its counting when it should be. This
is because you used the 'remove PDG' link, and it didn't carry through the
'collection:citeable' restriction when it recalculated.
>
> If you try something like {{{find a r m barnett not t rpp and
collection:citeable}}} you'll see (what I think is the) the correct
number, 55. I'll add this to my "remove RPP link strangeness" post-it-
note to try to fix Real Soon Now, and will comment back on this ticket
when I have.
>
> For Barnett, the 8 unciteable papers are:
> * 224839 - talk
> * 104617 - talk
> * 308177 - talk
> * 348179 - talk
> * 310452 - talk
> * 308281 - talk
> * 433501 - book
> * 388766 - talk
Yes, these look pretty unciteable.
> and lastly
> >
http://inspirehepdev.cern.ch/search?p=find%20a%20r%20m%20barnett%20not%20t%20rpp&of=hcs
> > gives a median of 3, which seems wrong.
>
> It may be surprising, but doubt its wrong. Recall the method of
calculating median: list all the N scores in order, then take whichever
score sits at exactly floor(N/2)+1. Then observe that Barnett's "All
Papers" column shows:
> * 108 total citeable papers
> * 18 papers with 1-9 citations
> * 44 papers with 0 citations
> So the median is the 54th entry in his citation list, which is item 10
of 18 in the 1-9 citation set. Having noticed cites display a long tail,
I'd expect most of the cites in the 1-9 range to be 1's, and 3 isn't very
far off from that.
>
> So while I encourage everyone to check more cases to make sure that
things seem reasonable, I'm satisfied that the math parts are right.
>
Yes it is, mathematically, I should have paid closer attention to the
numbers. Intuition, though, told me it was suspect and I think the 44
papers in the 0 citations bin mostly shouldn't be there (unciteables). So
the mathematics is working just fine but we're back to this collections
problem. Beacom is younger than Barnett and has always had eprints, so his
unciteable problem is far less noticeable. Barnett started well before
eprints and so has a lot of unciteable papers, which are numerous enough
to mess up his stats.
> So I believe my current todo list with this ticket is:
> * "Remove RPP" link should remove itself
> * "Remove RPP" link should keep maintain intersection against
collection:citeable
Sounds good. We'll see what we can find out about the citeable collection
tomorrow.
Cheers,
Heath.
--
Ticket URL: <https://invenio-software.org/ticket/420#comment:15>
Invenio <http://invenio-software.org>