Re: Interoperability - subject classification/terminology

2003-11-20 Thread Barry Mahon
20/11/2003 09:59:05, Stevan Harnad  wrote:

>
>
>(1) The discussion was about whether there is any need for
>a human-generated subject index when a full-text inverted index is
>available for boolean search. With abstract/indexing services only
>article titles and abstracts are available for searching, not article
>full-texts.

Agreed, but I would beg to differ that we are speaking about the same thing when
comparing inverted file full text with something like Medline, where indexing 
terms are
allocated from a  carefully crafted thesaurus where the medical terminology is
rationally and logically structured.
>
>(2) Even with abstract/indexing services it would be interesting to
>find out which users and how many do and do not use the subject index,
>and why and why not (and how long the subject index will continue to
>be a human-generated one -- if it still is at all -- in the era of
>automatic tools such as latent semantic indexing and the other new
>similarity and classification metrics).

Again agreed, but most of the studies done so far on comparing human indexed 
with
machine indexed texts show that the latter is not (yet?) at the point where 
synonyms
and such like can be dealt with. In any case I think the original entry on 
which this
thread has been predicated shows that the language in many areas is not precise
enough to make unambiguous indexing possible.

We are in a changing world, progress is undoubtedly being made but we are not
'there' - whatever that is, yet.


Barry Mahon


Re: Interoperability - subject classification/terminology

2003-11-20 Thread Stevan Harnad
On Thu, 20 Nov 2003, Barry Mahon wrote:

> 19/11/2003 19:29:32, Stevan Harnad  wrote:
> >
> >But please don't forget that journals never had (and never needed) subject
> >indices, the way books do
>
> What about abstracting and indexing services??

(1) The discussion was about whether there is any need for
a human-generated subject index when a full-text inverted index is
available for boolean search. With abstract/indexing services only
article titles and abstracts are available for searching, not article
full-texts. (Open access to the 2,500,000 annual articles in the 24,000
peer-reviewed journals means open access to the full text.)

(2) Even with abstract/indexing services it would be interesting to
find out which users and how many do and do not use the subject index,
and why and why not (and how long the subject index will continue to
be a human-generated one -- if it still is at all -- in the era of
automatic tools such as latent semantic indexing and the other new
similarity and classification metrics).

Stevan Harnad


Re: Interoperability - subject classification/terminology

2003-11-20 Thread Barry Mahon
19/11/2003 19:29:32, Stevan Harnad  wrote:
>
>But please don't forget that journals never had (and never needed) subject
>indices, the way books do, and one of the reasons is probably that,
>apart from happening to have been accepted for the same issue, their
>articles don't have much to do with one another -- and if the *were*
>gathered into a book as a collection, they wouldn't be in the *same*
>book, but many different, topic-specific ones.

What about abstracting and indexing services??

Bye, Barry


Re: Interoperability - subject classification/terminology

2003-11-19 Thread Stevan Harnad
On Wed, 19 Nov 2003, Chris Korycinski wrote:

> > But we are not talking here about books or book-indexing! We are
> > talking about the annual 2.5 million full-text refereed-journal
> > articles.
>
> ... in subjects outside science, remember.

I understand fully. My bet (that inverted full-text boolean search is
all that is needed to navigate the entire refereed-journal corpus, all
24,000 journals' worth, and would beat human classification any day)
does apply to all disciplines, both science and non-science.

But my bet does not apply to books and book indexes (although -- without
wagering! -- I do believe that software-based indexing and navigation
will prevail there too, as "semantic-web" tools grow and improve for
navigating large text corpora -- in any/every domain).

> My original comments apply just as well to articles as to books - many
> of the 'books' she works on are papers or conference proceedings.

If a set of articles is gathered together and published as an indexed
book, then that is a book! Nolo contendere. Book users don't have online
powers over the book, moreover navigating just a local book is a much
narrower and more focussed task than navigating the whole of the journal
literature in a field.

But please don't forget that journals never had (and never needed) subject
indices, the way books do, and one of the reasons is probably that,
apart from happening to have been accepted for the same issue, their
articles don't have much to do with one another -- and if the *were*
gathered into a book as a collection, they wouldn't be in the *same*
book, but many different, topic-specific ones.

Journal article space never had or needed a subject index in paper days,
a fortiori, it needs it even less in online days, with the possibility of
boolean inverted-text searching (as well as other digital prestidigitation,
such as similarity matching, latent semantic indexing, citation-linking,
citation-ranking, download-ranking, co-citation analysis, etc.).

> The reality is that these areas are intrinsically different as they often
> (I'm by no means saying always!) deal with concepts/points-of-view
> rather than facts. And concepts lie closer to the realms of metadata
> and hence are intractable by naive and simplistic schemes such as
> keyword/inverted file indexing.

I'm not sure I disagree. I agree that book-space, especially in some subjects,
needs something more than just keyword and inverted full-text searching: But my
guess is that that something more will turn out to consist of further
text-analytic software tools.

But if by metadata you mean that human judgment will have to do the tagging and
sorting, as in human indexing days, I doubt it (though I make no bets, outside 
the
one area I am pretty sure about: the annual 2,500,000 articles in the planet's
24,000 refereed journals -- across all disciplines and languages).

>  Have a look at the example I gave... it was edited out of my posting!

Apologies. here it is again:

> It is concepts, not words people want. The same concept is often expressed in
> different words, or, to take another example: "Major announced in Westminster 
> that
> Maastricht was totally unacceptable".
>
> Is this about Westminster? Majors? The Netherlands? No. Try "British foreign
> policy" or something similar (depending on the thrust of the book.

My guess? This particular example, and countless others like it, are already a 
piece
of cake for some of the more sophosticated digital-text processors I mentioned
above.

> Belive me, this is a simple example compared to many sociological or 
> philosophical
> texts and any inverted-file style of 'indexing' would produce complete 
> rubbish.

Indexing, yes, but software text-analyzers? I don't suggest you make any wagers!

But my bet about the refereed journal corpus stands!

Cheers, Stevan

Some references on LSI and SW:

http://lsa.colorado.edu/
http://www.cs.utk.edu/~lsi/
http://javelina.cet.middlebury.edu/lsa/out/lsa_definition.htm
http://www.w3.org/2001/sw/
http://citebase.eprints.org/cgi-bin/search
http://citeseer.nj.nec.com/cs
http://citeseer.nj.nec.com/white96similarity.html


Re: Interoperability - subject classification/terminology

2003-11-19 Thread Chris Korycinski
On 19 Nov 2003 at 15:17, Stevan Harnad wrote:


> But we are not talking here about books or book-indexing! We are
> talking about the annual 2.5 million full-text refereed-journal
> articles.
>

... in subjects outside science, remember.

My original comments apply just as well to articles as to books - many
of the 'books' she works on are papers or conference proceedings.
The reality is that these areas are intrinsically different as they often
(I'm by no means saying always!) deal with concepts/points-of-view
rather than facts. And concepts lie closer to the realms of metadata
and hence are intractable by naïve and simplistic schemes such as
keyword/inverted file indexing. Have a look at the example I gave -
sorry - you can't do that, as it was edited out of my posting!

Regards

Chris Korycinski

St Andrews eprints administrator, Main Library
===
phone: external 01334 462302 : internal x 2302
office hours: 9-5 Tue & Wed. 9-12 Th.


Re: Interoperability - subject classification/terminology

2003-11-19 Thread Stevan Harnad
On Wed, 19 Nov 2003, Chris Korycinski wrote:

> On 18 Nov 2003 at 17:48, Stevan Harnad wrote:
>
> > I don't know of any evidence that inverted full-text boolean search is
> > any less effective in one field than another. (Does anyone have any
> > such evidence?)

> I don't know about hard evidence, but my wife works as a professional
> indexer in the humanities (all the top scholars ask for her ! She isn't
> ever short of work) and indexing this style of book makes a joke out of
> keyword searching or any form of inverted file.

But we are not talking here about books or book-indexing! We are talking
about the annual 2.5 million full-text refereed-journal articles.

http://www.eprints.org/self-faq/#26.Classification

Stevan Harnad


Re: Interoperability - subject classification/terminology

2003-11-19 Thread Chris Korycinski
On 18 Nov 2003 at 17:48, Stevan Harnad wrote:

> On Tue, 18 Nov 2003, Franklin, Rosemary (franklra) wrote:
>
  How do you search nuances?
>
> I don't know of any evidence that inverted full-text boolean search is
> any less effective in one field than another. (Does anyone have any
> such evidence?)
>

I don't know about hard evidence, but my wife works as a professional
indexer in the humanities (all the top scholars ask for her ! She isn't
ever short of work) and indexing this style of book makes a joke out of
keyword searching or any form of inverted file. It is concepts, not words
people want. The same concept is often expressed in different words,
or, to take another example:
"Major announced  in Westminster that Maastricht was totally
unacceptable".

Is this about Westminster? Majors? The Netherlands?
No.
Try "British foreign policy" or something similar (depending on the
thrust of the book.

Belive me, this is a simple example compared to many sociological or
philosophical texts and any inverted-file style of 'indexing' would
produce complete rubbish.

Regards

Chris Korycinski

St Andrews eprints administrator, Main Library
===
phone: external 01334 462302 : internal x 2302
office hours: 9-5 Tue & Wed. 9-12 Th.


Re: Interoperability - subject classification/terminology

2003-11-15 Thread Stevan Harnad
On Sat, 15 Nov 2003, Prof. Tom Wilson wrote:

> Stevan Harnad says:
>
>s> Please remember that most researchers currently search their abstracts 
>databases
>s> and their toll-access journal content databases without the help of any 
>subject
>s> classification taxonomies. This will continue to be the case for the 
>open-access
>s> full-text database, once it grows to a significant size. Journal articles --
>s> especially when they include inverted full-text -- are not, and never
>s> were, searched via prepackaged subject classifications or taxonomies
>s> or aggregations.

> I think that Stevan is a little too sweeping in his generalisation here. In 
> the
> days before machine searching, pretty well all abstracting journals were
> organized according to some subject specific classification scheme: Chemical
> Abstracts, Metallurgical Abstracts, Nuclear Science Abstracts are among those 
> I
> searched on behalf of scientists in that dim and distant past. At that time
> users certainly relied upon those classification schemes to help them to 
> reduce
> the volume of material they needed to search.

I agree completely. But we are now in the days of machine searching, done by the
researchers themselves, for themselves, google-style. When search is restricted 
to
the inverted full-text corpus of the annual 2.5 million articles published in 
the
planet's 24,000 refereed journals, there is no need whatsoever to rely on
classification schemes.
http://www.eprints.org/self-faq/#26.Classification

> Those classification schemes
> continue today in the print versions and online versions generally offer the
> possibility of a search by class.

Yes, but does anyone bother to use them (online)?

> The debate about the cheapness of simplistic
> Boolean searching (which puts the costs on the user to disentangle the useful
> from the useless) versus the cost (to the producer) of high quality subject
> indexing and classification has never been settled - and doubtless never will
> be.

But one thing is sure: It is irrelevant to the issue of open access, and 
certainly
not something to wait for!

Stevan Harnad

> ___
> Professor T.D. Wilson, PhD
> Publisher/Editor in Chief
> Information Research
> InformationR.net
> University of Sheffield
> Sheffield S10 2TN,  UK
> e-mail: t.d.wil...@shef.ac.uk
> Web site: http://InformationR.net/
> ___
>


Re: Interoperability - subject classification/terminology

2003-11-14 Thread Stevan Harnad
On Thu, 13 Nov 2003, Franklin, Rosemary (franklra) wrote:

> Generally you are searching in natural language, depending on the fields
> tagged and how the file is organized.  Portals such as the HUMBUL site and
> others organized around broad subject areas are value-added OAI searching
> and have controlled vocabulary added, or they are in the process of adding.

I would like to make a bet about values that will prove to be worth and not 
worth
adding to a full-text corpus of refereed research journal articles. (Note that
this bet pertains *only* to the refereed journal article corpus, but that does
include all disciplines, including the humanities):

Until and unless XML tagging of the full-texts themselves prevails -- a
desirable outcome that is largely independent of the urgent goal of open
access -- nothing will come even close to matching (let alone beating)
the power of boolean search over the inverted full-texts, google-style
(but restricted to the OAI-compliant domain).

Please remember that most researchers currently search their abstracts 
databases and
their toll-access journal content databases without the help of any subject
classification taxonomies. This will continue to be the case for the open-access
full-text database, once it grows to a significant size. Journal articles --
especially when they include inverted full-text -- are not, and never
were, searched via prepackaged subject classifications or taxonomies
or aggregations. And even those taxonomies and aggregations that exist
were generated by machine analysis of the database rather than by human
classification. (In other words, they were generated by "semantic-web"
-- i.e., syntactic-web! -- computations on the full-text database.)

See Subject Thread:
    "Interoperability - subject classification/terminology"
http://www.ecs.soton.ac.uk/~harnad/Hypermail/Amsci/2384.html
and:
http://www.eprints.org/self-faq/#26.Classification

I know that especially in the humanities, many scholars and librarians are 
betting
otherwise. It will be interesting to see what the outcome turns out to be.

But let it be stressed again: This has nothing to do with open access, except
inasmuch as it is extremely important not to hold back open access for even one
microsecond in order to wait for classification/taxonomy values to be added -- 
any
more than open access should be delayed in any way to wait for preservation 
values
to be added.
http://www.eprints.org/self-faq/#1.Preservation

The intuitive point to keep in mind is that we are talking about OAI
eprint space, not google space. Needle/haystack problems in google space
vanish when it is contracted to just the OAI eprint subspace. OAI eprint space
consists of the yearly 2,500,000 articles in the planet's 24,000 peer-reviewed
journals in all fields and languages, before (preprints) and after peer
review (postprints).

http://www.eprints.org/self-faq/#What-is-Eprint

Stevan Harnad

NOTE: Complete archive of the ongoing discussion of providing open
access to the peer-reviewed research literature online is available at
the American Scientist September Forum (98 & 99 & 00 & 01 & 02 & 03):

http://amsci-forum.amsci.org/archives/American-Scientist-Open-Access-Forum.html
http://www.ecs.soton.ac.uk/~harnad/Hypermail/Amsci/index.html
Posted discussion to: american-scientist-open-access-fo...@amsci.org

Dual Open-Access Strategy:
BOAI-2 ("gold"): Publish your article in a suitable open-access
journal whenever one exists.
BOAI-1 ("green"): Otherwise, publish your article in a suitable
toll-access journal and also self-archive it.
http://www.soros.org/openaccess/read.shtml
http://www.eprints.org/signup/sign.php
http://www.ecs.soton.ac.uk/~harnad/Temp/self-archiving_files/Slide0026.gif
http://www.ecs.soton.ac.uk/~harnad/Temp/self-archiving_files/Slide0021.gif
http://www.ecs.soton.ac.uk/~harnad/Temp/self-archiving_files/Slide0024.gif
http://www.ecs.soton.ac.uk/~harnad/Temp/self-archiving_files/Slide0028.gif


Re: Interoperability - subject classification/terminology

2003-03-28 Thread David Goodman
An  example of an indexing service that uses self-assigned
field descriptors is Dissertation Abstracts. Furthermore, it uses a
controlled list: when you submit your thesis you select your field from a
list. It need not match the name of your department.

Anyone who thinks that this provides reliable subject indexing or scanning
should give it a try. Since recent theses have abstracts, it is possible
to perform a decent subject search using free text techniques to the
extent the abstracts are descriptive. But a scanning service based only on
the category would miss most of the work in most subjects.

I fully expect the academics using the new servers to produce a subject
indexing quagmire, which the librarians and information people will then
have to devise access routes for. It's happened before, and we've done
it before. All we can hope for is that they not make things worse by
reinventing  indexing, or recycling inadequate or outdated systems.
We expect nothing better from any group of  academic authors--after all,
we do just as badly with our own publications.

on Thu,
27 Mar 2003, Hussein Suleman wrote:

> hi
>
> well, sure, i agree in principle ... if arXiv and similar projects agree
> to bunch of all physics into a single category and use google for
> searching, with no browsing capabilities, it wouldnt be a problem at all.
>
> similarly, if we grouped together computer science, electrical
> engineering and information systems, that would be ok for gross-level
> interoperability ... once again, assuming searching is the only service
> required. frankfully, i think this is a little simplistic and assumes
> digital libraries are no more than submission+search systems.
>
> [aside: why does eprints support browsing by catgeories ?]
>
> besides, who decides what constitutes a discipline anyway ? has anyone
> ever been able to decide if computer science is engineering or science ?
>
> i think we have more questions than answers here and it isnt as simple
> as you point out or we wouldnt even be discussing this :)
>
> ttfn,
> hussein


Dr. David Goodman

Princeton University Library
and
Palmer School of Library and Information Science, LIU

dgood...@princeton.edu


Re: Interoperability - subject classification/terminology

2003-03-27 Thread Stevan Harnad
I suppose it was useful, overall, that Chris Gutteridge (unintentionally,
as it turned out) branched his original posting, which began this line
of discussion, to a number of other lists, apart from the OAI-general
list for which it was intended. I have since gradually phased out the
other lists from the discussion, but the value of the exercise was,
I think, in illustrating both the overlap and the distinctness between
the two "OA" movements: (1) the OAI (Open Archives Initiative), with its
technical mandate being to provide digital interoperability, and its
target being the entire digital library, and (2) the BOAI (Budapest Open
Access Initiative), with its activist mandate being to provide open,
free full-text access, and its target being the peer-reviewed research
literature only (and particularly its authors and their institutions).

What is relevant, even central, to the one, can accordingly be not only
irrelevant but downright misleading to the other. I reply below
accordingly:

On Thu, 27 Mar 2003, Hussein Suleman wrote:

> well, sure, i agree in principle ... if arXiv and similar projects agree
> to bunch all physics into a single category and use google for
> searching, with no browsing capabilities, it wouldnt be a problem at all.

The Physics ArXiv is a growing centralized subset of the Physics
literature, with its own native search capability, including a taxonomy.
It is the only such archive -- in size, scale and use -- and started
well before the OAI and BOAI. The OAI (also partly inspired by ArXiv)
has now introduced the possibility of distributed archiving, across
disciplines, integrated through interoperability. As such, it has
augmented or even replaced the notion of (1) centralized, discipline-based
archiving, with native search capabilities, by the notion of (2)
distributed, institution-based archiving, with separate OAI search services.

How (and whether) to preserve ArXiv's native search and taxonomy
functions is a technical question I leave to the experts. (One naive
thought is that the taxonomic decriptors applying to a paper are rather
like keywords, so a flat string of them could be preserved as a free-text
keyword-field, which would then be searchable in the usual boolean way;
there are probably tricks for preserving their hierarchical structure too,
if need be.)

But the point is that there are no "similar projects" -- at least not
among the preprint/postprint corpora covered by BOAI. There might be
among the broader kinds of digital collections covered by OAI, but that
is another matter, and it has to be kept distinct from the BOAI, whose
concern is with getting full-text open access to the entire
preprint/postprint literature, across all disciplines and institutions,
and as soon as possible, and hence with a minimum of obstacles (of which
the design and application of discipline-specific taxonomies, by way
of a prerequisite or constraint, would indeed be one).

To set one's intuitions, it is best to imagine searching the ISI
(Institute for Scientific Information) database, which is
multidisciplinary and covers the metadata, abstracts, and references for
about 7500 journals. Imagine this augmented to all disciplines,
all journals (about 20,000) and full-text. ISI has some very general
discipline classifiers, but that's all. And that's all that's needed to
confer a wealth of searching/navigation power, especially once augmented
by google-style full-text boolean search. No doubt such a corpus could
and would be augmented by further metaclassification schemes, but those
will be derived algorithmically, a posteriori, from the corpus itself,
rather than as a human pre-tagging, pre-classification process, applied
to each article as it is entered.

(Alerting, for example, would be a customized boolean rule, and probably
agent-based, applied across archives, rather than being a local-archive,
taxonomy-based function.)

> similarly, if we grouped together computer science, electrical
> engineering and information systems, that would be ok for gross-level
> interoperability ... once again, assuming searching is the only service
> required. frankly, i think this is a little simplistic and assumes
> digital libraries are no more than submission+search systems.

Digital libraries are no doubt more than that. But for the special
subset of the digital corpus that is the sole focus of the open-access
movement (the peer-reviewed research literature) and its main users
(researchers), searching is indeed the only service required. (This
of course includes scientometric as well as agent-based search.)
http://www.ecs.soton.ac.uk/~harnad/Temp/Ariadne-RAE.htm

> [aside: why does eprints support browsing by catgeories ?]

Good question! My answer would be that it is merely to support local
functionality. Whereas no one else on the planet may wish to search
only the Southampton ECS department's archive for work on agent-based
auctions, someone here at Southampton might. (But even this could be
done by

Re: Interoperability - subject classification/terminology

2003-03-27 Thread Claus Schroeter
Hi OAIers,

I agree with hussein so far. Building taxonomies is a lot of
effort. Just for inspiration I'd like to tell what we're doing at this
point:

We created a specialized Search engine that is not only fulltext boolean
based but also enables users to go deeper in thematic fields. We do this
by trainable classifiers that are able to learn which field is discussed
in a text. We're not using the OAI classifications since the underlying
taxonomy varies from archive to archive and we have also material from the
web that is not categorized.

The basic idea behind this system is to enable users to explore
fields of knowledge just by selecting the fields of interest. As a
positive side effect the taxonomy of this system is free configurable so
you may configure a taxonomy for physics, one for chemistry or whatever.
All taxonomies use the same texts in background.

Spoken in technical terms the taxonomy binding of a text is not fixed.
Perhaps it could be a good idea to implement this loose binding for OAI so
taxonomy bindings can be used as an exchangeable schema for archive
providers.

If you're interested please feel free to take a look at:

http://findemaschine.pro-physik.de/?language=e

Try for example the category ->Physics/Astrophysics and use the "further
restrictions" selector to restrict on ->Physics/History of Physics or
->Physics/Geophysics. You will see that the ranking will switch to
exactly the subtopic of interest.

Just my 0.3 cent...
Best Regards from Berlin
Claus

Chemie.DE Information Service GmbH  http://www.chemie.de/
Seydelstr. 28   mailto:schroe...@chemie.de
D-10117 Berlin  Tel  +49 (0)30 204 568 - 0
Fax  +49 (0)30 204 568 - 70


On Thu, 27 Mar 2003, Hussein Suleman wrote:

> well, sure, i agree in principle ... if arXiv and similar projects agree
> to bunch of all physics into a single category and use google for
> searching, with no browsing capabilities, it wouldnt be a problem at all.
>
> similarly, if we grouped together computer science, electrical
> engineering and information systems, that would be ok for gross-level
> interoperability ... once again, assuming searching is the only service
> required. frankfully, i think this is a little simplistic and assumes
> digital libraries are no more than submission+search systems.
>
> [aside: why does eprints support browsing by catgeories ?]
>
> besides, who decides what constitutes a discipline anyway ? has anyone
> ever been able to decide if computer science is engineering or science ?
>
> i think we have more questions than answers here and it isnt as simple
> as you point out or we wouldnt even be discussing this :)
>
> Stevan Harnad wrote:
> > On Thu, 27 Mar 2003, Hussein Suleman wrote:
> >
> >>...why not use sets for the separate
> >>disciplines, aimed at particular service providers?...
> >>some disciplines are not well-defined (namely, computer science)
> >>so such archives may want to play ball with multiple service providers
> >>and hence may need different sets.
> >
> > The question of taxonomic classification sets and version-control for
> > Open Archives is a technical one, so I will not presume to comment on it
> > except from the point of view of the potential *users* of one particular
> > kind of Archive Content, namely, unrefereed preprints and refereed
> > postprints of research papers from one or many or all disciplines: This
> > -- in the google-age of boolean inverted full-text searchability --
> > does not require a detailed a-priori taxonomy, as book metadata or the
> > metadata for other kinds of material might. A fairly general sorting by
> > discipline should suffice.
> > http://www.eprints.org/self-faq/#26.Classification
> > http://www.ecs.soton.ac.uk/~harnad/Hypermail/Amsci/2385.html
> >
> >>...the service provider can provide an
> >>interface for potential data providers to self-register.
> >
> > I hope that once the number and contents of Open-Access Eprint Archives
> > for research preprints and postprints have scaled up toward something
> > closer to universality, the simple metadata descriptors "pre-refereeing
> > preprint" and "refereed journal article" plus perhaps "discipline name"
> > will be enough to guide relevant service-providers in automatically
> > harvesting their relevant metadata. Multiple self-registration seems a
> > tedious and unnecessary constraint. (Possibly a master-registry of valid
> > institutions and disciplinary archives will also help, but may not be
> > necessary unless commercial spamming invades this sector too.)
> >
> >>what remains a difficult problem, however, is how to recreate the
> >>metadata used by the service provider as its native format. so, for a
> >>typical example, if arXiv classifies items using a specific set

Re: Interoperability - subject classification/terminology

2003-03-27 Thread Hussein Suleman
hi

well, sure, i agree in principle ... if arXiv and similar projects agree 
to bunch of all physics into a single category and use google for 
searching, with no browsing capabilities, it wouldnt be a problem at all.

similarly, if we grouped together computer science, electrical 
engineering and information systems, that would be ok for gross-level 
interoperability ... once again, assuming searching is the only service 
required. frankfully, i think this is a little simplistic and assumes 
digital libraries are no more than submission+search systems.

[aside: why does eprints support browsing by catgeories ?]

besides, who decides what constitutes a discipline anyway ? has anyone 
ever been able to decide if computer science is engineering or science ?

i think we have more questions than answers here and it isnt as simple 
as you point out or we wouldnt even be discussing this :)

ttfn,
hussein

Stevan Harnad wrote:
> On Thu, 27 Mar 2003, Hussein Suleman wrote:
> 
>>...why not use sets for the separate 
>>disciplines, aimed at particular service providers?...
>>some disciplines are not well-defined (namely, computer science) 
>>so such archives may want to play ball with multiple service providers 
>>and hence may need different sets.
> 
> The question of taxonomic classification sets and version-control for
> Open Archives is a technical one, so I will not presume to comment on it
> except from the point of view of the potential *users* of one particular
> kind of Archive Content, namely, unrefereed preprints and refereed
> postprints of research papers from one or many or all disciplines: This
> -- in the google-age of boolean inverted full-text searchability --
> does not require a detailed a-priori taxonomy, as book metadata or the
> metadata for other kinds of material might. A fairly general sorting by
> discipline should suffice.
> http://www.eprints.org/self-faq/#26.Classification
> http://www.ecs.soton.ac.uk/~harnad/Hypermail/Amsci/2385.html
> 
>>...the service provider can provide an 
>>interface for potential data providers to self-register.
> 
> I hope that once the number and contents of Open-Access Eprint Archives
> for research preprints and postprints have scaled up toward something
> closer to universality, the simple metadata descriptors "pre-refereeing
> preprint" and "refereed journal article" plus perhaps "discipline name"
> will be enough to guide relevant service-providers in automatically
> harvesting their relevant metadata. Multiple self-registration seems a
> tedious and unnecessary constraint. (Possibly a master-registry of valid
> institutions and disciplinary archives will also help, but may not be
> necessary unless commercial spamming invades this sector too.)
> 
>>what remains a difficult problem, however, is how to recreate the 
>>metadata used by the service provider as its native format. so, for a 
>>typical example, if arXiv classifies items using a specific set 
>>structure, this is certainly not going to be the default for an 
>>institutional archive. does the service provider automatically or 
>>manually reclassify? or does it not allow browsing by categories? 
> 
> Worrying about "recreating the categories" in this boolean full-text age
> is, I believe, a waste of time (for research preprints/postprints). Just
> harness google's harvested full-text to your engine's search capability,
> if it is incapable of contending with boolean full-text search on its
> own. (Manual reclassification! Heaven forfend! Don't bother classifying
> this material in the first place, beyond the simplest of first-cuts,
> such as discipline. Any further classification should be algorithmic and
> text-data-driven, not manual.)
> 
>>in either event, the quality of the metadata from the perspective of the 
>>service provider may be an impetus for potential users to want to 
>>replicate their effort rather than rely on the automated submission from 
>>their own institutions ... this needs more thought ...
> 
> Again, I speak only for research preprints/postprints, but please let's
> not inject any further credibility into the notion that self-archiving
> author/institutions will also have to self-advertise by multiple
> self-archiving of the same paper. Surely that is one headache that
> OAI-interoperability should eradicate from the planet! Self-archiving
> itself is self-advertising (and effort) enough. Please let us not
> now -- when the momentum is still not big enough -- saddle would-be
> self-archivers with needless extra worries, and tasks!
> http://www.ecs.soton.ac.uk/~harnad/Temp/tim-arch.htm
> 
> Stevan Harnad


Re: Interoperability - subject classification/terminology

2003-03-27 Thread Stevan Harnad
On Thu, 27 Mar 2003, Hussein Suleman wrote:

> ...why not use sets for the separate
> disciplines, aimed at particular service providers?...
> some disciplines are not well-defined (namely, computer science)
> so such archives may want to play ball with multiple service providers
> and hence may need different sets.

The question of taxonomic classification sets and version-control for
Open Archives is a technical one, so I will not presume to comment on it
except from the point of view of the potential *users* of one particular
kind of Archive Content, namely, unrefereed preprints and refereed
postprints of research papers from one or many or all disciplines: This
-- in the google-age of boolean inverted full-text searchability --
does not require a detailed a-priori taxonomy, as book metadata or the
metadata for other kinds of material might. A fairly general sorting by
discipline should suffice.
http://www.eprints.org/self-faq/#26.Classification
http://www.ecs.soton.ac.uk/~harnad/Hypermail/Amsci/2385.html

> ...the service provider can provide an
> interface for potential data providers to self-register.

I hope that once the number and contents of Open-Access Eprint Archives
for research preprints and postprints have scaled up toward something
closer to universality, the simple metadata descriptors "pre-refereeing
preprint" and "refereed journal article" plus perhaps "discipline name"
will be enough to guide relevant service-providers in automatically
harvesting their relevant metadata. Multiple self-registration seems a
tedious and unnecessary constraint. (Possibly a master-registry of valid
institutions and disciplinary archives will also help, but may not be
necessary unless commercial spamming invades this sector too.)

> what remains a difficult problem, however, is how to recreate the
> metadata used by the service provider as its native format. so, for a
> typical example, if arXiv classifies items using a specific set
> structure, this is certainly not going to be the default for an
> institutional archive. does the service provider automatically or
> manually reclassify? or does it not allow browsing by categories?

Worrying about "recreating the categories" in this boolean full-text age
is, I believe, a waste of time (for research preprints/postprints). Just
harness google's harvested full-text to your engine's search capability,
if it is incapable of contending with boolean full-text search on its
own. (Manual reclassification! Heaven forfend! Don't bother classifying
this material in the first place, beyond the simplest of first-cuts,
such as discipline. Any further classification should be algorithmic and
text-data-driven, not manual.)

> in either event, the quality of the metadata from the perspective of the
> service provider may be an impetus for potential users to want to
> replicate their effort rather than rely on the automated submission from
> their own institutions ... this needs more thought ...

Again, I speak only for research preprints/postprints, but please let's
not inject any further credibility into the notion that self-archiving
author/institutions will also have to self-advertise by multiple
self-archiving of the same paper. Surely that is one headache that
OAI-interoperability should eradicate from the planet! Self-archiving
itself is self-advertising (and effort) enough. Please let us not
now -- when the momentum is still not big enough -- saddle would-be
self-archivers with needless extra worries, and tasks!
http://www.ecs.soton.ac.uk/~harnad/Temp/tim-arch.htm

Stevan Harnad


Re: Interoperability - subject classification/terminology

2003-03-27 Thread Hussein Suleman
this may be stating the obvious, but why not use sets for the separate
disciplines, aimed at particular service providers? i say it that way
because some disciplines are not well-defined (namely, computer science)
so such archives may want to play ball with multiple service providers
and hence may need different sets.

in any event, for something like physics, a simple set might do the
trick at the source. then, somewhat in keeping with the Kepler model (as
published in DLib a while back), the service provider can provide an
interface for potential data providers to self-register. i know this
sounds dodgy, but think of it as an alternative mechanism for
contribution. either individual users submit individual papers or groups
submit baseURLS - both go through some kind of review and while one
leads to once-off storage, the other leads to periodic harvesting.

what remains a difficult problem, however, is how to recreate the
metadata used by the service provider as its native format. so, for a
typical example, if arXiv classifies items using a specific set
structure, this is certainly not going to be the default for an
institutional archive. does the service provider automatically or
manually reclassify? or does it not allow browsing by categories? in
either event, the quality of the metadata from the perspective of the
service provider may be an impetus for potential users to want to
replicate their effort rather than rely on the automated submission from
their own institutions ... this needs more thought ...

hussein

Christopher Gutteridge wrote:
> Disciplinary/subject archives vs. Institutional/Organisation/Region based
> archives. This is going to be a key challenge now open archives begin
> to gain momentum.
>
> For example; we are planning a University-wide eprints archive. I am
> concerned that some physisists will want to place their items in both
> the university eprints service AND the arXiv physics archive. They may
> be required to use the university service, but want to use arXiv as it
> is the primary source for their discipline. This is a duplication of
> effort and a potential irritation.
>
> Ultimately, of course, I'd hope that diciplinary archives will be replaced
> with subject-specific OAI service providers harvesting from the institutional
> archives. But there is going to be a very long transition period in which
> the solution evolves from our experience.
>
> What I'm asking is; has anyone given consideration to ways of smoothing
> over this duplication of effort? Possibly some negotiated automated process
> for insitutional archives uploading to the subject archive, or at least
> assisting the author in the process.
>
> This isn't the biggest issue, but it'd be good to address it before it
> becomes more of a problem.
>
>   Christopher Gutteridge
>   GNU EPrints Head Developer
>   http://software.eprints.org/


Re: Interoperability - subject classification/terminology

2003-03-12 Thread David Goodman
I'd amend it slightly, and I hope Tim would agree:
let the humans worry about access, and impact, and making sure
appropriate specialists--AI people, librarians, and others-- continue
working on getting the computers to do
yet better indexing.


On Wed, 12 Mar 2003, Tim Brody wrote:

> - Original Message -
> From: "Guy Aron" 
>
> > ... I would still
> > like clarification as to whether the original poster
> > was being asked to contribute LC classification
> > numbers or LC headings.
>
> I think the thread started with a concern that choosing LC headings when
> submitting an article to an e-print (E-Prints.org?) archive is burdensome.
>
> From my limited experience catagorisation serves two purposes:
> 1) Allowing the location of items within a physical system (e.g. library
> shelves)
> 2) Locating items on a common theme/subject, as a means of discovery
>
> 1) is replaced by URIs (URLs, DOIs, etc) in the online age. For the research
> literature citations (and citation analysis) should cover 2), supplemented
> with free-text searches.
>
> As the Web has grown it has gone from classification (Yahoo directory), to
> boolean-search (Altavista), to graph-based search (Google). Each of these
> steps has come about because of a factor increase in the information to
> search across, a process that requires decreasing human effort, and as the
> classification schemes themselves become so large as to be meaningless.
>
> I suspect the same will be true of the research literature.
>
> So the moral of my story is: let the computers worry about indexing, and let
> the humans worry about access and impact.
>
> All the best,
> Tim.
>


Dr. David Goodman

Princeton University Library
and
Palmer School of Library and Information Science, LIU

dgood...@princeton.edu


Re: Interoperability - subject classification/terminology

2003-03-12 Thread Tim Brody
- Original Message -
From: "Guy Aron" 

> ... I would still
> like clarification as to whether the original poster
> was being asked to contribute LC classification
> numbers or LC headings.

I think the thread started with a concern that choosing LC headings when
submitting an article to an e-print (E-Prints.org?) archive is burdensome.

>From my limited experience catagorisation serves two purposes:
1) Allowing the location of items within a physical system (e.g. library
shelves)
2) Locating items on a common theme/subject, as a means of discovery

1) is replaced by URIs (URLs, DOIs, etc) in the online age. For the research
literature citations (and citation analysis) should cover 2), supplemented
with free-text searches.

As the Web has grown it has gone from classification (Yahoo directory), to
boolean-search (Altavista), to graph-based search (Google). Each of these
steps has come about because of a factor increase in the information to
search across, a process that requires decreasing human effort, and as the
classification schemes themselves become so large as to be meaningless.

I suspect the same will be true of the research literature.

So the moral of my story is: let the computers worry about indexing, and let
the humans worry about access and impact.

All the best,
Tim.


Re: Interoperability - subject classification/terminology

2003-03-12 Thread Guy Aron
David Goodman wrote

> LC subject headings -- and the LC classification -- were never meant to
> apply to individual articles.
> The systems were devised, and have been maintained, to describe and
> classify entire books. (There do exist other systems that were intended to
> be suitable for both books and articles, such as BSO and UDC.)
>
> As a librarian, I consider both LC systems to be of only slight
> usefullness in science for books, even if one is an
> expert in the system, and totally useless and misleading if one is not.
> They are more valuable for books in other fields--I am told LC subject
> headings work nicely in history, and the LC classification for
> literature.
>
> As applied to journal articles, either of them is absurd altogether. 
> The existing indexing and classifying systems for journal articles,
> such as Inspec's or Chem Abs's have enough difficulties,
> without using a system not even intended for the purpose.
>
> In science, if you write a descriptive title
> and an informative abstract that includes all the important
> keywords, free text searching and citation linking will do much better
> than LC.

Dr Goodman is entitled to his opinion. Before more electrons get spilt
on this discussion, I would still like clarification as to whether the
original poster was being asked to contribute LC classification numbers
or LC headings. (I understand that Dr Goodman thinks they're both equally
useless.) When this has been resolved maybe we'll have a better idea
about what we're debating.

Guy Aron



Re: Interoperability - subject classification/terminology

2003-03-12 Thread David Goodman
LC subject headings -- and the LC classification --  were never meant to
apply to individual articles.
The systems were devised, and have been maintained, to describe and
classify entire books. (There do exist other systems that were intended to
be suitable for both books and articles, such as BSO and UDC.)

As a librarian, I consider both LC systems to be  of only slight
usefullness in science for books, even if one is an expert in the system,
and totally useless and misleading if one is not. They are more valuable
for books in other fields--I am told LC subject headings work nicely
in history, and the LC classification for literature.

As applied to journal articles, either of them is absurd altogether.  The
existing indexing and classifying systems  for journal articles,
such as Inspec's or Chem Abs's  have enough difficulties,
without using a system not even intended for the purpose.

In science, if you write a descriptive title
 and an informative abstract that includes all the important
keywords, free text searching and citation linking will do much better
than LC.


 [iso-8859-1] Guy Aron wrote:

> I think we need to be careful to identify just what
> activity the original poster was being required to do.
> The discussion so far seems to assume that Eprints was
> requiring input of Library of Congress classification
> number. From his original post, however, it seems more
> likely that the input being required was Library of
> Congress subject headings:
> > >
> > >  On Fri, 7 Mar 2003, W F Clocksin wrote:
> > >
> > > > Hi. I am a beginning user of Eprints, and am
> > entering metadata on the
> > > > default test archive interface. It is a real
> > nuisance to have to
> > > > specify the Subject (which uses the Library of
> > Congress system).  For
> > > > books this makes sense because the catalog
> > information is in the front
> > > > matter of the book, but it is unclear to me why
> > I should have to do
> > > > this for journal articles.
>
> I would not particularly support the use of a
> classification system like LC or Dewey in an eprint
> archive. The use of a controlled vocabulary like LC,
> however, seems much more appropriate. Controlled
> vocabulary versus free text is still controversial;
> it's a different issue, however, from classification.
> Before we spend more time on this discussion I think
> we need to be clear just what it is we're debating.
> Perhaps the original poster could clarify this point?
>
> Guy Aron
>
> http://mobile.yahoo.com.au - Yahoo! Mobile
> - Check & compose your email via SMS on your Telstra or Vodafone mobile.
>


Dr. David Goodman

Princeton University Library
and
Palmer School of Library and Information Science, LIU

dgood...@princeton.edu


Re: Interoperability - subject classification/terminology

2003-03-11 Thread Guy Aron
I think we need to be careful to identify just what
activity the original poster was being required to do.
The discussion so far seems to assume that Eprints was
requiring input of Library of Congress classification
number. From his original post, however, it seems more
likely that the input being required was Library of
Congress subject headings:
> >
> >  On Fri, 7 Mar 2003, W F Clocksin wrote:
> >
> > > Hi. I am a beginning user of Eprints, and am
> entering metadata on the
> > > default test archive interface. It is a real
> nuisance to have to
> > > specify the Subject (which uses the Library of
> Congress system).  For
> > > books this makes sense because the catalog
> information is in the front
> > > matter of the book, but it is unclear to me why
> I should have to do
> > > this for journal articles.

I would not particularly support the use of a
classification system like LC or Dewey in an eprint
archive. The use of a controlled vocabulary like LC,
however, seems much more appropriate. Controlled
vocabulary versus free text is still controversial;
it's a different issue, however, from classification.
Before we spend more time on this discussion I think
we need to be clear just what it is we're debating.
Perhaps the original poster could clarify this point?

Guy Aron

http://mobile.yahoo.com.au - Yahoo! Mobile
- Check & compose your email via SMS on your Telstra or Vodafone mobile.


Re: Interoperability - subject classification/terminology

2003-03-11 Thread David Goodman
Stevan, I agree with your conclusion-- it would be both confusing and
wasteful to develop local elaborate classification schemes.

As for the more specific difficulties of access you simply do not appear
to
realize how difficult much of this can be in practice, especially to
beginners.
My journal examples were meant to indicate the difficulty of classifying
even roughly papers in those disciplines, not just journals.
Resolving journal names  with a fuzzy match  will work -- 95% of the
time.
A good library intends to locate and retrieve not 95%, but 100%,
and can usually accomplish 98 or 99 % for journal articles.
(Considerably less for informally published items, depending on area.
Google and other citation-indexed based systems are very good, but not
that good.

Actually, there is a partial solution to locating known items
that we both know about:
to eliminate primary reliance on journal names,
indexes, and so forth and go with links and DOIs. The real difficulty
comes
when you want to identify items that you haven't already known about.

There is a certain tendency that we all have to underestimate other
peoples' problems.


Stevan Harnad wrote:
>
> On Mon, 10 Mar 2003, David Goodman wrote:
>
> > The reason I suggested classification is that various people in the
> > subjects covered have told me that they use this archive by checking
> > everything in their subject classification each day, and that the current
> > rather straight-forward classification suits them fine.
>
> I assume that "this archive" refers to the Physics ArXiv, which is a
> global, discipline-based archive. Some users monitor some topics daily
> or weekly, and there are ways to accommodate their needs that include a
> subject taxonomy. (Whether that taxonomy, and the classification of the
> the papers within it, is best done, in our online digital era, by human
> classifiers and/or authors, rather than by a text-processing algorithm,
> is another question.)
>
> I was not referring, however, to global, discipline-based archives,
> but to local, institutional archives. For local search and use they
> certainly don't need a global taxonomy; and as bits of a harvested
> distributed worldwide "virtual archive" they are surely better sorted
> and navigated globally by cross-archive search tools than by local
> classification schemes.
>
> > People work in various ways, especially for current awareness. One of the
> > many virtues of systems such as this is that they can be designed to be
> > adaptable to individuals.
>
> The current-awareness alerting system (likewise probably better if based
> on text-processing algorithms rather than human classifiers and/or
> authors) is not the same issue as the question of whether or not there
> is any need to develop a classification system for local institutional
> refereed-research output archives. (The Eprints software, for example,
> has an alerting capability but no elaborate classification system.)
>
> > I did not mention Boolean full-text searching, only because I assumed it.
> > Stevan, would anyone design such a system without it--still, now?
>
> Not only is the boolean capability there with all inverted digital
> full-text, but (I'm betting) it can beat any human classification scheme
> (with the help of the right text-processing algorithms).
>
> > And I remain much less sanguine than you about the ability to accommodate
> > all  the fields of science -- let alone all academic knowledge -- in a
> > single relatively simple system.
>
> In one (local, institutional) archiving system or in one classification
> system? I am sanguine about the first (though not necessarily all squashed
> into a single university archive: many interoperable departmental ones
> will probably work better) whereas I consider the second unnecessary and
> a waste of time (beyond a very rudimentary, first-cut classification)
> scheme: Computational algorithms on the full-text should do the rest. Not
> human classifiers (including the authors). Remember: we are talking about
> journal articles, not books or other works. Who ever searched the journal
> literature on the basis of a fixed human classification of it? (And if
> they did, how much mileage did they really get out of that taxonomy,
> compared to computational sorting based on full-text analysis?)
>
> > Anyone who has ever worked in a library can tell you about the
> > unreliability of a rough arrangement by discipline and journal name.
>
> Unreliability for what? Ambulatory, analog search? Of course. But we
> are talking about digital data and digital search. Who searches the
> journal system by taxonomy rather than, say, boolean word-search?
>
> > What subject is Phys Rev B (Condensed Matter)? or J Chem Phys? or Brain
> > Research?
>
> Who cares?
>
> If I am looking for stuff on neuropeptides, my boolean search will
> retrieve any papers from the latter two journals regardless, as long as
> they contain the indicators my algorithm specifies.
>
> > And if you alw

Re: Interoperability - subject classification/terminology

2003-03-11 Thread Stevan Harnad
On Mon, 10 Mar 2003, David Goodman wrote:

> The reason I suggested classification is that various people in the
> subjects covered have told me that they use this archive by checking
> everything in their subject classification each day, and that the current
> rather straight-forward classification suits them fine.

I assume that "this archive" refers to the Physics ArXiv, which is a
global, discipline-based archive. Some users monitor some topics daily
or weekly, and there are ways to accommodate their needs that include a
subject taxonomy. (Whether that taxonomy, and the classification of the
the papers within it, is best done, in our online digital era, by human
classifiers and/or authors, rather than by a text-processing algorithm,
is another question.)

I was not referring, however, to global, discipline-based archives,
but to local, institutional archives. For local search and use they
certainly don't need a global taxonomy; and as bits of a harvested
distributed worldwide "virtual archive" they are surely better sorted
and navigated globally by cross-archive search tools than by local
classification schemes.

> People work in various ways, especially for current awareness. One of the
> many virtues of systems such as this is that they can be designed to be
> adaptable to individuals.

The current-awareness alerting system (likewise probably better if based
on text-processing algorithms rather than human classifiers and/or
authors) is not the same issue as the question of whether or not there
is any need to develop a classification system for local institutional
refereed-research output archives. (The Eprints software, for example,
has an alerting capability but no elaborate classification system.)

> I did not mention Boolean full-text searching, only because I assumed it.
> Stevan, would anyone design such a system without it--still, now?

Not only is the boolean capability there with all inverted digital
full-text, but (I'm betting) it can beat any human classification scheme
(with the help of the right text-processing algorithms).

> And I remain much less sanguine than you about the ability to accommodate
> all  the fields of science -- let alone all academic knowledge -- in a
> single relatively simple system.

In one (local, institutional) archiving system or in one classification
system? I am sanguine about the first (though not necessarily all squashed
into a single university archive: many interoperable departmental ones
will probably work better) whereas I consider the second unnecessary and
a waste of time (beyond a very rudimentary, first-cut classification)
scheme: Computational algorithms on the full-text should do the rest. Not
human classifiers (including the authors). Remember: we are talking about
journal articles, not books or other works. Who ever searched the journal
literature on the basis of a fixed human classification of it? (And if
they did, how much mileage did they really get out of that taxonomy,
compared to computational sorting based on full-text analysis?)

> Anyone who has ever worked in a library can tell you about the
> unreliability of a rough arrangement by discipline and journal name.

Unreliability for what? Ambulatory, analog search? Of course. But we
are talking about digital data and digital search. Who searches the
journal system by taxonomy rather than, say, boolean word-search?

> What subject is Phys Rev B (Condensed Matter)? or J Chem Phys? or Brain
> Research?

Who cares?

If I am looking for stuff on neuropeptides, my boolean search will
retrieve any papers from the latter two journals regardless, as long as
they contain the indicators my algorithm specifies.

> And if you always remember journal names correctly, I congratulate you but
> wish you weren't unique. All your plans--as is inevitable--are shaped by
> your own preferences. So would mine be, but at least I realize
> it--sometimes.

No need to remember journal names correctly (fuzzy matches can be
fine-tuned -- see http://http://paracite.eprints.org/) and (in my
optinion) no longer any need for any prefabricated a-priori human
taxonomies (in searching the refereed research journal literature) --
though a-posteriori algorithmic ones can be generated on the fly.

Stevan Harnad


Re: Interoperability - subject classification/terminology

2003-03-11 Thread David Goodman
The reason I suggested classification is that various people in the
subjects covered  have told
me that they use this archive by checking everything in their subject
classification each day, and that the current rather straight-forward
classification suits them fine.

People work in various ways, especially for current awareness. One of the
many virtues of systems such as this is that they can be designed to be
adaptable to individuals. One of the pitfalls in designing a system, any
system, is to set it up to suit oneself alone.
You mention LoC, an excellent case in point.
It works beautifully--for catalogers.

I did not mention Boolean full-text searching, only because I assumed it.
Stevan, would anyone design such a system without it--still, now?

And I remain much less sanguine than you about the ability to accommodate
all  the fields of science -- let alone all academic knowledge -- in a
single relatively simple system.

Anyone who has ever worked in a library can tell you about the
unreliability of a rough arrangement by discipline and journal name.
What subject is Phys Rev B (Condensed Matter)? or J Chem Phys? or Brain
Research?
And if you always remember journal names correctly, I congratulate you but
wish you weren't unique. All your plans--as is inevitable--are shaped by
your own preferences. So would mine be, but at least I realize
it--sometimes.


On Sat, 8 Mar 2003, Stevan Harnad wrote:

> On Fri, 7 Mar 2003, David Goodman wrote:
>
> > I agree that a
> > decentralized archive, as distinguished from arXiV, does not need
> > much in the way of classification
>
> Not even ArXiv needs it: Those are physics articles, not books. They don't
> need LoC classification, only full-text boolean search, with
> scientometric ranking along the lines of:
> http://citebase.eprints.org/cgi-bin/search
>
> Moreover, if ever a useful taxonomy is generated for the refereed research
> article literature, it will be one that is scientometrically (i.e.,
> computationally) generated *from* such a digital database, not an
> old-style a-priori human classification.
>
> > I suspect the practical access for the immediate future will be
> > by known author, supplemented by the citation network.
>
> and boolean full-text search.
>
> > On the other hand, to rely on OAI harvesters and automated search tools
> > for accessing the union of all such collections is premature.
>
> Yes, but not for the reason I think you have in mind! It is premature
> because the union of all such collections is still so empty! As it
> grows, the associated tools will grow (they are the easy part!).
>
> > I am not certain whether it is within human capabilities to design
> > this--certainly none of the extensive efforts at automatic document
> > retrieval are really adequate--it's a problem of the same magnitude
> > as AI in general.
>
> For the human written word corpus as a whole. But not for the 20,000
> refereed research journals, classified, as a first cut, by their
> discipline and journalname. The rest most definitely *is* within human
> capabilities to design (along the lines mentioned above).
>
> > I would love to see this solved, of course, because the
> > known manual methods, as they are applied in libraries and
> > indexing services, are almost equally unsatisfactory.
>
> In the case of the refereed journal corpus (the only corpus at issue
> here), they are not only unsatisfactory, but completely unnecessary.
> Let us nto conflate this very special (and small and tractable) part
> with the (possibly intractable) whole.
>
> Stevan Harnad
>


Dr. David Goodman

Princeton University Library
and
Palmer School of Library and Information Science, LIU

dgood...@princeton.edu


Re: Interoperability - subject classification/terminology

2003-03-08 Thread Stevan Harnad
On Fri, 7 Mar 2003, David Goodman wrote:

> I agree that a
> decentralized archive, as distinguished from arXiV, does not need
> much in the way of classification

Not even ArXiv needs it: Those are physics articles, not books. They don't
need LoC classification, only full-text boolean search, with
scientometric ranking along the lines of:
http://citebase.eprints.org/cgi-bin/search

Moreover, if ever a useful taxonomy is generated for the refereed research
article literature, it will be one that is scientometrically (i.e.,
computationally) generated *from* such a digital database, not an
old-style a-priori human classification.

> I suspect the practical access for the immediate future will be
> by known author, supplemented by the citation network.

and boolean full-text search.

> On the other hand, to rely on OAI harvesters and automated search tools
> for accessing the union of all such collections is premature.

Yes, but not for the reason I think you have in mind! It is premature
because the union of all such collections is still so empty! As it
grows, the associated tools will grow (they are the easy part!).

> I am not certain whether it is within human capabilities to design
> this--certainly none of the extensive efforts at automatic document
> retrieval are really adequate--it's a problem of the same magnitude
> as AI in general.

For the human written word corpus as a whole. But not for the 20,000
refereed research journals, classified, as a first cut, by their
discipline and journalname. The rest most definitely *is* within human
capabilities to design (along the lines mentioned above).

> I would love to see this solved, of course, because the
> known manual methods, as they are applied in libraries and
> indexing services, are almost equally unsatisfactory.

In the case of the refereed journal corpus (the only corpus at issue
here), they are not only unsatisfactory, but completely unnecessary.
Let us nto conflate this very special (and small and tractable) part
with the (possibly intractable) whole.

Stevan Harnad


Re: Interoperability - subject classification/terminology

2003-03-08 Thread David Goodman
The matter is not quite so simple, Stevan. I agree that a
decentralized archive, as distinguished from arXiV, does not need
much in the way of classification--especially if its
classification will be different from that of every other such
archive. I suspect the practical access for the immediate future will be
by known author, supplemented by the citation network.

On the other hand , to rely on OAI harvesters and automated search tools
for accessing the union of all such collections is premature.
I am not certain whether it is within human capabilities to design
this--certainly none of the extensive efforts at automatic document
retrieval are really adequate--it's a problem of the same magnitude
as AI in general. I would love to see this solved, of course, because the
known manual methods, as they are applied in libraries and indexing services,
are almost equally unsatisfactory.
(All the above is a 3-sentence summary of decades of work of many good
researchers, and I am the first to admit that it is an inexpert summary at that)

But it does see that we will be adopting a policy of getting it all
accumulated, and hoping that the next (intellectual) generation will be
smart enough to get it organized.
It should be obvious that this is not an argument against making our
material  available, which must be done while
the material can still be captured.

On Fri, 7 Mar
2003, Stevan Harnad wrote:

> [Thread: http://www.ecs.soton.ac.uk/~harnad/Hypermail/Amsci/2384.html ]
>
> I agree 100% with the point made by the commentator below:
> Institutional Eprint Archives for refereed research papers do *not*
> require an elaborate classification system (such as Library of
> Congress). These are not books. And the OAI harvesters and search
> engines will be the real, cross-archive search tools; elaborate
> pre-classification is not needed just for searching within one's own
> university's local research output (and creating such an elaborate
> classification system is, in my opinion, a waste of time). (And in any
> case, I would put my money on boolean inverted full-text search, with
> scientometric impact ranking, over any prefabricated human taxonomy in
> this online age.)
>
> Reply to comment below: Just pick in one default subject and forget
> about the rest.
>
> Stevan Harnad
>
>
>  On Fri, 7 Mar 2003, W F Clocksin wrote:
>
> > Hi. I am a beginning user of Eprints, and am entering metadata on the
> > default test archive interface. It is a real nuisance to have to
> > specify the Subject (which uses the Library of Congress system).  For
> > books this makes sense because the catalog information is in the front
> > matter of the book, but it is unclear to me why I should have to do
> > this for journal articles. For multidisciplinary articles, it might
> > mean specifying a number of Subjects using the scrolling textbox, which
> > would take longer than copy/pasting the rest of the metadata. I would
> > rather just leave out the Subject. To what extent is a required Subject
> > built into ePrints, or is it simply feature of the test interface that
> > I could omit from a custom interface?
> >
> > William Clocksin
> > www.clocksin.com
>


Dr. David Goodman

Princeton University Library
and
Palmer School of Library and Information Science, LIU

dgood...@princeton.edu


Re: Interoperability - subject classification/terminology

2003-03-07 Thread Stevan Harnad
[Thread: http://www.ecs.soton.ac.uk/~harnad/Hypermail/Amsci/2384.html ]

I agree 100% with the point made by the commentator below:
Institutional Eprint Archives for refereed research papers do *not*
require an elaborate classification system (such as Library of
Congress). These are not books. And the OAI harvesters and search
engines will be the real, cross-archive search tools; elaborate
pre-classification is not needed just for searching within one's own
university's local research output (and creating such an elaborate
classification system is, in my opinion, a waste of time). (And in any
case, I would put my money on boolean inverted full-text search, with
scientometric impact ranking, over any prefabricated human taxonomy in
this online age.)

Reply to comment below: Just pick in one default subject and forget
about the rest.

Stevan Harnad


 On Fri, 7 Mar 2003, W F Clocksin wrote:

> Hi. I am a beginning user of Eprints, and am entering metadata on the
> default test archive interface. It is a real nuisance to have to
> specify the Subject (which uses the Library of Congress system).  For
> books this makes sense because the catalog information is in the front
> matter of the book, but it is unclear to me why I should have to do
> this for journal articles. For multidisciplinary articles, it might
> mean specifying a number of Subjects using the scrolling textbox, which
> would take longer than copy/pasting the rest of the metadata. I would
> rather just leave out the Subject. To what extent is a required Subject
> built into ePrints, or is it simply feature of the test interface that
> I could omit from a custom interface?
>
> William Clocksin
> www.clocksin.com


Re: Interoperability - subject classification/terminology

2003-01-16 Thread Guy Aron
On Wednesday 15 Steve Hitchcock wrote
> [ ... ]Full text indexing can begin to tell us what a
> text is *about*, rather than simply where it is located, the classical
> purpose of classification. Through knowing what a text is about, we can
> make connections with other works in ways that are much more flexible than
> is offered by classification.

I don't know that I quite go along with this. If I classify a book in
the 330s in Dewey, this tells us more than that the book is in the 330s -
it tells us that the book belongs in that class. A classification number
is more than just "marking and parking", eg Shelf 20, Row 2. To me the
"classical purpose" of classification is to group similar things together.
This grouping has to be based on some analysis of what topics the thing
covers (or, in Dewey, the discipline from which it emanated).

Guy Aron
RMIT University Library


Re: Interoperability - subject classification/terminology

2003-01-15 Thread Steve Hitchcock
At 13:31 15/01/03 +, Pauline Simpson wrote:

>Following on from the OAI Geneva meeting  - to open the discussion  please see
>http://tardis.eprints.org/discussion/

Pauline, A thought-provoking page that helpfully outlines all the
issues. A few points below, but first we need to make a distinction between
works where the full text is not available digitally, and those where it
is. So the question whether there is a need for classification boils down
to: Yes for the former, and (mostly) No for the latter.

By (mostly) I mean let's make it optional. That means, in the case of
institutional repositories of research papers (the latter category), don't
burden the repository with the need to maintain categorization as a core
task. Leave that to services. If it's worth doing, then people will find
the resources to do it, but it must not compromise the task of
repositories, which is to make the texts available.

If full texts are available, we have the chance to automate search and
indexing, say full-text indexing or citation indexing. This is vastly more
powerful and cost-effective, but we have to recognise it is not the same
thing as classification. Full text indexing can begin to tell us what a
text is *about*, rather than simply where it is located, the classical
purpose of classification. Through knowing what a text is about, we can
make connections with other works in ways that are much more flexible than
is offered by classification.

You ask: Can we rely on web search engines like Google to search deeply or
accurately enough?

At the moment, simply, yes. It's not the fault of Google that it can't
index most of the journal literature.

Where I think classification may continue to have a role is in interface
design - you give examples. Classification can inform browsing. This brings
us back to services. Services will produce interfaces. In principle,
repositories do not need to produce user (as opposed to author or
management) interfaces, although in practice there will be few
institutional repositories that will be able to resist doing so, for good
reasons, but again, they don't have to, and it should be optional and minimal.

When you ask if the 'push' scenario should replace harvesting, that's
interesting because it is counter to the framework OAI has put in place.
That is, to reduce the burden on data providers at the expense of service
providers, recognising that we have to make the entry threshold for authors
and repositories as low as possible. That can make it difficult for service
providers, see Liu et al.
http://www.dlib.org/dlib/april01/liu/04liu.html
but overall it probably remains the best approach, especially if
repositories concentrate on optimising the submitted metadata within the
OAI framework.

Steve Hitchcock
Open Citation (OpCit) Project 
IAM Research Group, Department of Electronics and Computer Science
University of Southampton SO17 1BJ,  UK
Email: sh...@ecs.soton.ac.uk
Tel:  +44 (0)23 8059 3256 Fax: +44 (0)23 8059 2865


___
OAI-eprints mailing list
oai-epri...@lists.openlib.org
http://lists.openlib.org/mailman/listinfo/oai-eprints


Re: Interoperability - subject classification/terminology

2003-01-15 Thread Pauline Simpson
Following on from the OAI Geneva meeting  - to open the discussion  please see

http://tardis.eprints.org/discussion/

Pauline


Pauline Simpson,
Head of Information Services, Southampton Oceanography Centre
and
Faculty Liaison Leader for Science, Engineering and Math
and
TARDIS (Univ Southampton e-Prints) Project Manager

University of Southampton Waterfront Campus,  European Way,
Southampton, SO14 3ZH, England

  Tel:  +44-(0)23 8059 6111: Fax  +44-(0)23 8059 6115
  email:  p...@soc.soton.ac.uk ; p...@soton.ac.uk
  Web :  http://www.soc.soton.ac.uk

OAI-eprints mailing list
oai-epri...@lists.openlib.org
http://lists.openlib.org/mailman/listinfo/oai-eprints


Re: Interoperability - subject classification/terminology

2002-11-25 Thread Subbiah Arunachalam
Thanks very much Peter, Stevan, Johnson and others who have given your
valuable comments. Let me pose my question in another form:

The new information and communication technologies have tremendous
potential to facilitate communication flow among scientists (researchers)
and between scientists and their 'clients' (in the case of agricultural
research, the clients are the farmers and policymakers). At present,
physicists (especially high energy physicists and astronomers) and
computer scientists are taking considerable advantage of ICTs.
Agricultural scientists are among the poorest users of ICTs. How can
we reach the benefits of ICTs to agricultural researchers? How can
we make the transition from a 'poor use today' to a ' much better use
tomorrow'? If I am able to find the funds, how can I go about actually
making the transition to the better tomorrow? It is one thing to say
that different subjects/ fields have different cultures, but another
to do something about it. I am interested in changing the culture in
agriculture. In my opinion, agriculture is a key area today. There is so
much needless poverty and hunger in the world. Most developing countries
depend on agriculture for their survival. We need to act quickly in
that area.

Regards.

Arun


Re: Interoperability - subject classification/terminology

2002-11-25 Thread Stevan Harnad
On Sun, 24 Nov 2002, Thomas Krichel wrote:

>sh> (2) The University Eprint Archive as a means of providing open access
>sh> to all of the university's peer-reviewed research output (before and
>sh> after peer review). Almost without exception, this is the work that
>sh> also appears in the peer-reviewed journals sooner or later (indeed,
>sh> that is how it gets peer-reviewed).
>sh>
>sh> It should be clear that (2) is a very special subset of (1). But
>sh> it should be equally clear that that special subset does not have any
>sh> particular or pressing classification problem!
>
>   I beg to differ. Scholars are subject to herd behavior. You will not
>   get scholars to deposit papers in the local archive if their colleagues
>   in other universities don't do it.

Agreed.

>   Thus you have to approach scholars by community.

Agreed.

>   To do that, you need to classify the
>   material that you have per discipline,

You just lost me! Isn't a university a scholarly community? Moreover,
the scholar's university is a scholarly community with which the scholar
shares some rather vital interests: They employ the scholar, the scholar's
research funding pays some of their overhead, hence they have a shared
interest in each of their scholars' maximizing their research impact.

There is no such shared interest with a "discipline," distributed
worldwide. (If anything, there is competition for impact within a
discipline!)

>   in order to build
>   discipline-specific aggregators, such as the (pioneering)
>   RePEc project for economics.

I admire RePEc http://repec.org/, and apologize for having failed to
mention it, along with ArXiv, ResearchIndex, and the Institutional
Eprint Archives. RePEc is a very valuable and important contributor to
open access and self-archiving (although it is not all full-text and not
all open-access). It is a collaborative effort among institutions.

But it is not at all clear that as institutional self-archiving
(in all university departments and disciplines) gains momentum there
will be any need for classification (though it would not hurt to
tag papers from economics departments "economics" too, especially
for preprints). The classification will be amply accomplished by the
journal-names along with boolean search through the inverted indices
of the articles' titles, keywords, and full-texts. When the time comes,
a master-classification of the planet's 20,000 peer-reviewed journals
can be added as a supplement, along with any further taxonomies that
analyses of the inverted full-text corpus itself generate, supplemented
by citation and co-citation analyses and other scientometric goodies.

>sh> can beat google-style boolean search on an inverted full-text index,
>sh> especially if aided by citation-frequency, hit-based, recency-based,
>sh> or relevance-based ranking of search output, as done, for example,
>sh> by http://citebase.eprints.org/help/index.php ).
>
>   Yes but all those services require discipline based,
>   relational dataset to be precise.

google? a discipline-based relational dataset?

Stevan

NOTE: A complete archive of the ongoing discussion of providing open
access to the peer-reviewed research literature online is available at
the American Scientist September Forum (98 & 99 & 00 & 01 & 02):


http://amsci-forum.amsci.org/archives/American-Scientist-Open-Access-Forum.html
or
http://www.ecs.soton.ac.uk/~harnad/Hypermail/Amsci/index.html

Discussion can be posted to: american-scientist-open-access-fo...@amsci.org

See also the Budapest Open Access Initiative:
http://www.soros.org/openaccess

the Free Online Scholarship Movement:
http://www.earlham.edu/~peters/fos/timeline.htm

the OAI site:
http://www.openarchives.org

and the free OAI institutional archiving software site:
http://www.eprints.org/


Re: Interoperability - subject classification/terminology

2002-11-25 Thread Thomas Krichel
  Stevan Harnad writes

> (2) The University Eprint Archive as a means of providing open access
> to all of the university's peer-reviewed research output (before and
> after peer review). Almost without exception, this is the work that
> also appears in the peer-reviewed journals sooner or later (indeed,
> that is how it gets peer-reviewed).
> 
> It should be clear that (2) is a very special subset of (1). But
> it should be equally clear that that special subset does not have any
> particular or pressing classification problem!

  I beg to differ. Scholars are subject to herd behavior. You will not
  get scholars to deposit papers in the local archive if their colleagues
  in other universities don't do it. Thus you have to approach 
  scholars by community. To do that, you need to classify the 
  mateiral that you have per discipline, in order to build 
  discipline-specific aggregators, such as the (pioneering)
  RePEc project for economics. 

> can beat google-style boolean search on an inverted full-text index,
> especially if aided by citation-frequency, hit-based, recency-based,
> or relevance-based ranking of search output, as done, for example,
> by http://citebase.eprints.org/help/index.php ).

  Yes but all those services require discipline based, 
  relational dataset to be precise. 


  Cheers,

  Thomas Krichel   mailto:kric...@openlib.org
  http://openlib.org/home/krichel
  RePEc:per:1965-06-05:thomas_krichel

___
OAI-eprints mailing list
oai-epri...@lists.openlib.org
http://lists.openlib.org/mailman/listinfo/oai-eprints


Re: Interoperability - subject classification/terminology

2002-11-23 Thread Peter Suber
At 04:19 AM 11/23/2002 Subbiah Arunachalam wrote:

> Why is it that Open Archives/ E-prints works well in
> some fields (physics, astronomy, computer science) and
> not in other fields (say, agriculture)? I would like
> to hear from members of the list.

Arun: Here's my list of the FOS-relevant differences among the
disciplines.  Some are effects rather than causes of archive use, and some
are relevant to aspects of FOS other than archive use. But it's a start.

Different disciplines have different needs:

Some have superb print indices, online indices, or search engines,
and some don't.

Some have online preprint exchanges, and some don't.

The literature in some fields is pure text, perhaps with an occasional
table or illustration, while in others it relies heavily on images
or even multi-media presentations.

In some, journal literature is the primary literature, while in
others it only reports on the history and interpretation of the
primary literature.

In some fields, both truth and money are at stake in the results
reported in scholarly literature, while in others, only truth is
at stake.

In some fields, most published research is funded, while in others
very little is.

In some disciplines, the cost of research is greater than the cost
of publication, while in others, the reverse is true.

In some fields, most journal publishers are for-profit corporations,
while in other fields most are non-profit universities, libraries,
or professional societies.

In some fields, nearly all publishing researchers are employed by
universities, while in others the fraction is significantly smaller.

In some fields, the sets of journal readers and journal authors are
nearly identical, while in others they overlap only slightly.

In some fields, research will be impeded if access to journal
literature is not timely, while in others timeliness matters much
less.

In some fields, the percentage of published literature which is online
is comparatively high and growing fast; in others it is negligible
and growing glacially.

We should not expect, then, that a solution which fits all disciplines
will occur early in this evolution, or that a solution with this potential
will apply to all disciplines at roughly the same time.

http://www.earlham.edu/~peters/fos/index.htm#disciplines

I'm continually revising this list and in any case look forward to other,
more specific answers to Arun's question.

  Peter
--
Peter Suber, Professor of Philosophy
Earlham College, Richmond, Indiana, 47374
Email pet...@earlham.edu
Web http://www.earlham.edu/~peters

Editor, Free Online Scholarship Newsletter
http://www.earlham.edu/~peters/fos/
Editor, FOS News blog
http://www.earlham.edu/~peters/fos/fosblog.html


Re: Interoperability - subject classification/terminology

2002-11-23 Thread Stevan Harnad
On Sat, 23 Nov 2002, [iso-8859-1] Subbiah Arunachalam wrote:

> Why is it that Open Archives/ E-prints works well in
> some fields (physics, astronomy, computer science) and
> not in other fields (say, agriculture)? I would like
> to hear from members of the list.

Others are invited to reply too. Here is my own candidate explanation:

(1) It is not that physics or astronomy or computer science are
different from other fields with regard to the benefits or feasibility of
self-archiving and open access in their fields. All fields can benefit
from it and it is feasible in all fields. There are reasons, however,
why self-arching *began* in physics/astronomy, and why it came early in
computer science too.

(2) Self-archiving began in physics (and soon generalized to astronomy)
because physics already had, in paper days, a "preprint culture."
Physicists had already learned, well before the online era, that they
could accelerate the pace and interactivity of research if they did
not wait till published versions of papers appeared in print. Especially
in high-energy physics, they adopted the practise of mailing preprints
of their work to one another, to routing lists, and to a number of
central depositories.

(3) This practise simply generalized, in the beginning of the '90s,
quite naturally, as the technology became available, first to email
routing lists, and then to a web depository. Given the existing preprint
culture, this subsequent development requires no special explanation.
The physicists were smarter than the rest of us in having already
discovered the benefits to research progress of sharing preprints as
early as possible. They would have had to be rather thick to just keep
doing that in paper once email and the web were available!

(4) The practise of self-archiving immediately began to spread to other
areas of physics and allied fields (astronomy, mathematics), but the
important fact has to be noted that from the very beginning in August
1991 to the present day, over a decade later, that growth has been
merely linear (which means, currently, 3500 deposits per month)
http://arxiv.org/show_monthly_submissions

(5) At that linear growth rate, it would take 10 years before everything
being published in physics (in that year, 2012) was being self-archived.
Physics/astronomy/maths are still ahead of all disciplines, but their
lead is not dramatic enough, and another decade would be far, far too
long a wait. What is needed is something that will not only (i) accelerate
self-archiving in those head-start fields to a curvilinear upward
growth-rate that will capture their total current research output much
sooner, but also something that will (ii) universalize the practise
of self-archiving to all the other late-comer disciplines, and capture
their full research output too (currently about 2,000,000 articles per
year, appearing in the approximately 20,000 peer-reviewed journals 
that exist today in all disciplines and languages worldwide).

(6) My own hypothesis is that distributed, institutional self-archiving
will be the critical factor that will induce this acceleration and
universalization of self-archiving, as centralized, discipline-based
self-archiving alone has so far failed to do.

(7) The reason is that the rationale for institutional self-archiving
makes the benefits of open access explicit for all
researchers. Researchers and their own institutions (not their
disciplines) are the co-beneficiaries of the maximized research
visibility, accessibility, usage, citation and impact that are provided
by maximizing research access (i.e., universal, open access) through
self-archiving. It is researchers and their institutions whose research
output and research impact, and the indirect rewards that they bring --
in the form of research funding, income and standing, prizes and prestige
-- benefit from open access.

(8) In addition, research institutions have the further motivation to
try to relieve their serials subscription/license crises by doing whatever
they can to promote open access through self-archiving: Distributed
self-archiving is reciprocal.

(9) And the motivation for institutional reciprocity in self-archiving
is not just based on (a) the potential to maximize the impact of
institutional research output, nor on the possibility of eventually
(b) relieving institutional serials budget burdens. Access itself -- (c)
access to the peer-reviewed research output of all other universities --
can only enhance the quality and productivity of their own researchers'
word, for in the current toll-access system no institution, not even
the biggest or wealthiest institution, can afford to provide access
to anywhere near the total peer-reviewed research literature for its
researchers (in any field).  

(10) The fourth reason that distributed institutional self-archiving may
well prove to be the way to accelerate and universalize open access is
that (d) internal and external research assessment (to reward researc

Re: Interoperability - subject classification/terminology

2002-11-23 Thread Subbiah Arunachalam
Why is it that Open Archives/ E-prints works well in
some fields (physics, astronomy, computer science) and
not in other fields (say, agriculture)? I would like
to hear from members of the list.

Arun
[Subbiah Arunachalam]

 --- Pauline Simpson  wrote: >
Dear  All
>
> At the OAI Geneva I undertook to do the following:
>
> 2. Investigate OAI and OAF email archives for prior
> discussion and
> synthesize
>
> 3. Open the discussion with the intention of
> constructing a model/s to
> address perceived problems. We will need a statement
> of the problem/s and
> suggested solutions (some already articulated on
> Saturday)
>
> At present we have completed item 2 and and are now
> compiling a table of
> all e-Print archives (that we can find!) with an
> annotation of what subject
> classification they 'appear' to be using  :  LOC;
> DDC;  In House
> Classification (possibly based on LOC or another);
> In House
> terminology;  Faculty/Dept/Group; None.
>
> We hope to have completed this work by the end of
> next week and will be
> placing it on the web for validation and additions.
> This will then form
> the basis of discussion on the way forward.  I will
> email the list again
> when we have done this.
>
> If any of you know of the existence of such a table
> already please let me
> know  (and send it to me) so that we do not
> duplicate effort.
>
> It has taken longer than I thought but I believe
> this evidence gathering
> exercise will be a worthwhile tool in our
> deliberations concerning
> harvesting and interoperability between
> institutional and discipline based
> e-Print archives.
>
> best wishes
>
> Pauline
>
>
--
> Pauline Simpson,
> Head of Information Services, Southampton
> Oceanography Centre
> and
> Faculty Liaison Leader for Science, Engineering and
> Math
> and
> TARDIS (Univ Southampton e-Prints) Project Manager
>
> University of Southampton Waterfront Campus,
> European Way,
> Southampton, SO14 3ZH, England
>
>   Tel:  +44-(0)23 8059 6111: Fax  +44-(0)23 8059
> 6115
>   email:  p...@soc.soton.ac.uk ; p...@soton.ac.uk
>   Web :  http://www.soc.soton.ac.uk
>
>
> ___
> OAI-eprints mailing list
> oai-epri...@lists.openlib.org
>
http://lists.openlib.org/mailman/listinfo/oai-eprints


Re: Interoperability - subject classification/terminology

2002-11-22 Thread Stevan Harnad
I would like to raise a query on a point of information regarding the
problem of subject classification for University Eprint Archives.

Let us first clarify a potential point of misunderstanding. There are
(at least) two ways to think of University Eprint Archives, both of
them important and valid, but most decidedly not both the same. Hence
conflating these two aspects of Institutional Archiving and assuming one
size shoe is needed for both risks creating podiatric problems for both!

(1) The University Eprint Archive as the University Digital Library --
or, more specifically, the University Digital Library for all of the
University's own scholarly, scientific and pedagogic output. (This
includes journal articles, books, teaching materials, and any other
digital content the University produces and wishes to include in its
Eprint Archive.)

There is no question whatsoever that a rigorous system of classification
and tagging -- to make such a total university digital output navigable,
and integrable and interoperable with corresponding digital output from
other universities, in similar University Eprint Archives -- is extremely
important to have, indeed a prerequisite for the usefulness and usability
of such an Archive.

(2) The University Eprint Archive as a means of providing open access
to all of the university's peer-reviewed research output (before and
after peer review). Almost without exception, this is the work that
also appears in the peer-reviewed journals sooner or later (indeed,
that is how it gets peer-reviewed).

It should be clear that (2) is a very special subset of (1). But
it should be equally clear that that special subset does not have any
particular or pressing classification problem! These are not books. They
are journal articles. Our journal articles are not indexed in our
university library card catalogues (only the journals in which they appear
are). When we want to search the journal literature, we do not look
to any university classification system, we go to indexing services
such as INSPEC, MEDLINE, ISI, etc. (These have their own classification
systems, but I am willing to bet that for this corpus not one of those
can beat google-style boolean search on an inverted full-text index,
especially if aided by citation-frequency, hit-based, recency-based,
or relevance-based ranking of search output, as done, for example,
by http://citebase.eprints.org/help/index.php ).

I think it is extremely important to make it crystal clear that the
peer-reviewed research corpus -- and those University Eprint Archives
for which this is the main target literature at this time -- do not have
a classification problem, and need not and should not wait for any
solution to any classification problem before getting on with the
infinitely more pressing task of filling those archives with their
university's research output!

Now some specific comments and queries:

On Fri, 22 Nov 2002, Pauline Simpson wrote:

> At the OAI Geneva I undertook to do the following:
> 2. Investigate OAI and OAF email archives for prior discussion and
> synthesize
> 3. Open the discussion with the intention of constructing a model/s to
> address perceived problems. We will need a statement of the problem/s and
> suggested solutions (some already articulated on Saturday)

I was unfortunately unable to attend the Geneva OAI Meeting, so I would
like to address a question to Pauline:

Are the perceived problems in question the classification problems of
University Eprint Archives conceived in sense (1), i.e. as university
digital libraries for all university scholarly  and pedagogic output?
or conceived in sense (2), i.e., as a means of providing open access to
university research output?

And if the two were not distinguished formally and explcitily in this
way, was it made clear to all concerned at least informally that the
classification problem applies only to (1) and not to (2)?

> At present we have completed item 2 and and are now compiling a table of
> all e-Print archives (that we can find!) with an annotation of what subject
> classification they 'appear' to be using  :  LOC;  DDC;  In House
> Classification (possibly based on LOC or another);  In House
> terminology;  Faculty/Dept/Group; None.

Again, I wonder whether you could make it clear what the objective of
this exercise would be for those University Eprint Archives that have
been created exclusively, or primarily, to provide open access to
university research output (i.e., 2), hence having no need whatsoever
to adopt or use any classification system?

> I believe this evidence gathering
> exercise will be a worthwhile tool in our deliberations concerning
> harvesting and interoperability between institutional and discipline based
> e-Print archives.

Again, I think it would be immensely helpful, and would help both agenda
(1) and agenda (2) along their respective paths if the two agendas were
clearly distinguished and it were made clear that the classification
pro

Interoperability - subject classification/terminology

2002-11-22 Thread Pauline Simpson

Dear  All

At the OAI Geneva I undertook to do the following:

2. Investigate OAI and OAF email archives for prior discussion and
synthesize

3. Open the discussion with the intention of constructing a model/s to
address perceived problems. We will need a statement of the problem/s and
suggested solutions (some already articulated on Saturday)

At present we have completed item 2 and and are now compiling a table of
all e-Print archives (that we can find!) with an annotation of what subject
classification they 'appear' to be using  :  LOC;  DDC;  In House
Classification (possibly based on LOC or another);  In House
terminology;  Faculty/Dept/Group; None.

We hope to have completed this work by the end of next week and will be
placing it on the web for validation and additions.  This will then form
the basis of discussion on the way forward.  I will email the list again
when we have done this.

If any of you know of the existence of such a table already please let me
know  (and send it to me) so that we do not duplicate effort.

It has taken longer than I thought but I believe this evidence gathering
exercise will be a worthwhile tool in our deliberations concerning
harvesting and interoperability between institutional and discipline based
e-Print archives.

best wishes

Pauline

--
Pauline Simpson,
Head of Information Services, Southampton Oceanography Centre
and
Faculty Liaison Leader for Science, Engineering and Math
and
TARDIS (Univ Southampton e-Prints) Project Manager

University of Southampton Waterfront Campus,  European Way,
Southampton, SO14 3ZH, England

 Tel:  +44-(0)23 8059 6111: Fax  +44-(0)23 8059 6115
 email:  p...@soc.soton.ac.uk ; p...@soton.ac.uk
 Web :  http://www.soc.soton.ac.uk


___
OAI-eprints mailing list
oai-epri...@lists.openlib.org
http://lists.openlib.org/mailman/listinfo/oai-eprints