Inline below is the response from Ms. Ellen Voorhees (person in
charge of TREC) concerning my inquiry about gaining access to TREC
data. I suggest reading from the bottom and working your way up. I
edited out some of the copies of old messages to shorten the length
here.
As you can read, there is some opportunity in here for us to gain
access to TREC data. The bigger opportunity (and work), I feel, may
be the chance to, going forward, help NIST create and distribute
collections with an open source license and make them freely
available to anyone to use.
My suggestion at this point, would be to figure out if there are ways
we as a community could help and also think about if it is worthwhile
to find a way to purchase 1 or more collections for use by committers
(we could make the data available on zones.
So, what do people think?
Begin forwarded message:
From: Ellen Voorhees <[EMAIL PROTECTED]>
Date: August 24, 2007 2:43:35 PM EDT
To: Grant Ingersoll <[EMAIL PROTECTED]>
Cc: [EMAIL PROTECTED]
Subject: Re: TREC Evaluation and Open Source
So, I think the current scheme could work for Lucene committers/
active contributors provided there is
a central machine that all have access to. (I admit to pretty
much total ignorance in the actual
practice of an open source project.) If the cost of getting the
documents is too great, Lucene as
a project could sign up to participate in TREC and obtain many of
the document sets for free.
Hmmm, we do have a machine that only committers have access to.
So, if the ASF were to purchase a copy of the data, we could put
it on this machine and use it, correct? And individual committers/
contributors could be given access to it as long as they sign the
individual forms?
Yes, precisely.
This is not a solution for the Lucene community, which is too
large and far-flung to count
as an "organization" in the spirit of the Data Use forms. For
collections that are already
created under existing agreements, I see no alternative to
community members obtaining the document
sets on their own. On the other hand, if for community members
you are mostly interested in distributing some
data set that will show whether Lucene is installed correctly
(i.e., not a test collection that IR research should be
done on), the subset of the TREC ad hoc collections containing
just the Federal Register documents
can be used since those documents have no copyright restrictions
and we have some judgments
and relevance judgments for them. (But the documents are wonky
enough and there are few enough
topics that I do not consider this a good test collection for
research.)
We have demos and data sets for testing installation. Mostly, we
are looking for feedback in the traditional TREC spirit, i.e.
running experiments, testing relevance algorithms, etc. Also,
testing scalability, etc. Plus, it helps users make direct
comparisons when choosing a search system
Yes, that is what I originally figured you wanted the collections
for, and the Federal Register subset is not
a viable candidate for that. The standard ad hoc collections are
probably not sufficient for testing scalability---
they are only 800,000-1,000,000 documents and about 2GB of text.
The collections built in the 'terabyte'
track used a crawl of .GOV that is about .5TB of text (this is one
of the document sets distributed by
the University of Glasgow). Note that we (NIST TREC staff and
terabyte track organizers)
have some reservations about the completeness of the relevance
judgments for the terabyte
collections.
I think it would be fantastic for the community if there were a
good document set that
was able to be distributed through an open source license only.
I'd be happy to use TREC to get
topics and relevance judgments for such a document set so there
could be a readily-available,
basic, ad hoc retrieval test collection. But our (TREC staff)
experience to date with trying to find
such document sets has been very negative.
Can I have your permission to share our conversation with the
larger Lucene development community on the java-
[EMAIL PROTECTED] mailing list? If you would like, I can
summarize it instead and report back to the group. I can run the
summary by you first if you would like to edit it.
Perhaps we can help with the collection task, although I can't
promise it. Your staff does a great job already and are
undoubtedly the experts on it, but there may very well be
individuals who are willing to help, under the proper guidance.
Also, I wonder if groups like iBiblio, Creative Commons or
Wikipedia might be able to help out. I have met with Paul Jones
at iBiblio before and they have an extensive collection of open
source documents. Just not sure if they fit the TREC criteria.
Is the criteria publicly documented somewhere?
Cheers,
Grant
You may share the conversation with the Lucene community, either
summarized or straight.
We do not have a specific list of criteria for document sets.
Since a full (TREC) test collection is built
by using the document set in a TREC track, the "vetting" process
has generally happened through
the track proposal process. In general, a track is focused on a
particular task, and the document
set needs to be a reasonable surrogate for the types of documents
that are typical for that task.
So, the genomics track has used subsets of the medical literature,
the web track used crawls of the web.
We also want the document sets to be large enough to be
interesting---there is no point putting resources
into building a test collection that no one believes is
representative of anything real. If the relevance judgments
are to be created by NIST assessors, then they have to be "general
information" sorts of things since
we do not have a body of assessors with subject matter expertise in
any one area. So, the genomics
track judgments are not made at NIST, and the original TREC ad
hoc collections were
mostly newswire. We also want to make sure that the document
collection will be available
for a (relatively) long time. Again, there is no point committing
resources to create topics
and relevance judgments unless the documents will be available for
a significant time.
This latter point also implies taking a snapshot of dynamic
collections. That is (one of)
the reasons TREC has not used the live web or live Wikipedia as a
document collection---
to have a standard test collection you need a frozen document set.
FYI, the call for track proposals for TREC 2008 is currently open
until mid-September; see
http://trec.nist.gov/tracks.html. In my comments above about a
basic, ad hoc collection,
I was basically envisioning a newsy collection, but that's probably
just lack of
imagination on my part. Of course, there is no requirement for you
to go through
TREC to create a collection, and there would probably be little
point in doing so if the
document set (or task) are such that NIST assessors can't do the
assessing. There are other
evaluation venues (NTCIR, CLEF), or you the Lucene community may
decide to just build it
yourselves. In the latter case, TREC staff can offer our advise on
what we've learned about
collection building through the years.
A second FYI, if you want to get some more of an idea of the
considerations that went
into building the early TREC collections, chapter 2 of the TREC book
( http://mitpress.mit.edu/catalog/item/default.asp?
ttype=2&tid=10667&mode=toc )
is "The TREC Test Collections" authored by Donna Harman.
Ellen
Begin forwarded message:
From: Ellen Voorhees <[EMAIL PROTECTED]>
Date: August 24, 2007 11:17:17 AM EDT
To: Grant Ingersoll <[EMAIL PROTECTED]>
Cc: [EMAIL PROTECTED]
Subject: Re: TREC Evaluation and Open Source
The way the TREC Data Use licenses currently work, the
"organization" that requests the
data is the legal entity that owns the machine on which the data is
put. (An example form is at http://www.nist.gov/srd/
trec_org.htm .) That organization defines who
it is that may access the data, with the expectation that access
would require a person-specific
account on the machine. Each such person is to sign an Individual
form. (The
intent of the Individual forms is that, officially, a copyright
owner of the data may
ask for a list of all individuals that have (or have had) access to
the data. No one has ever
asked for such a list, but the language is in the forms to allow
this.) Individuals may access the data
remotely, provided they do so through the account on the host
machine. Individuals at
a remote location may not make a copy of the data for their local
machine, because that would
be redistributing the documents.
So, I think the current scheme could work for Lucene committers/
active contributors provided there is
a central machine that all have access to. (I admit to pretty much
total ignorance in the actual
practice of an open source project.) If the cost of getting the
documents is too great, Lucene as
a project could sign up to participate in TREC and obtain many of
the document sets for free.
This is not a solution for the Lucene community, which is too large
and far-flung to count
as an "organization" in the spirit of the Data Use forms. For
collections that are already
created under existing agreements, I see no alternative to
community members obtaining the document
sets on their own. On the other hand, if for community members you
are mostly interested in distributing some
data set that will show whether Lucene is installed correctly
(i.e., not a test collection that IR research should be
done on), the subset of the TREC ad hoc collections containing just
the Federal Register documents
can be used since those documents have no copyright restrictions
and we have some judgments
and relevance judgments for them. (But the documents are wonky
enough and there are few enough
topics that I do not consider this a good test collection for
research.)
I think it would be fantastic for the community if there were a
good document set that
was able to be distributed through an open source license only.
I'd be happy to use TREC to get
topics and relevance judgments for such a document set so there
could be a readily-available,
basic, ad hoc retrieval test collection. But our (TREC staff)
experience to date with trying to find
such document sets has been very negative.
Ellen
Grant Ingersoll wrote:
Thank you for the detailed response. By the way, this whole
discussion, I figured, just falls under the category of: it can't
hurt to ask. I know the answer may very well be no and I
completely understand why it should be for the reasons you have
cited: creating these collections takes a lot of work and requires
a lot of storage and bandwidth. So, I hope I am not coming across
as being critical of the current state of TREC. I very much value
what TREC does, I have participated it in the past and really
enjoyed it (other than the long hours I put in running
experiments :-) ) The high quality of TREC is one of the reasons
why I wanted to ask in the first place!
I think what I am trying to find out more about is if there is any
possibility that the Lucene community (or maybe just the
committers or active contributors who are not prohibited from
contributing based on where they live) could gain access to these
documents. That is, could the collections (or future collections)
be licensed under an open source license and hosted somewhere that
is publicly available and does not require a fee to be paid to LDC
or the like. Perhaps the ASF or iBiblio would do this, or maybe
some company would, I don't know, but I am willing to ask the
appropriate people. There are plenty of places out there that
provide mirrors, etc. for Apache and iBiblio for free such that
storage or cost should not be an issue.
I guess some of the difficulty lies in how open source is
developed versus how commercial/research systems are developed.
We don't have a pool of money that we can use to purchase document
collections. Right now, the best we can make publicly available
to our users is Wikipedia, which they download and use with some
of our tools. It also isn't even clear to me what defines the
organization that would be buying the collection if there were
money. For instance, if the Apache Soft. Found. purchased the
document collection, would that mean that anyone at the ASF could
use it? The problem is, other than one full-time system admin,
all of the ASF is a volunteer organization (and a rather large,
global one at that.) So, how do you define how a project like
Lucene as a whole can use TREC if the ASF were to pay the fee? It
would be the equivalent of total redistribution to anyone.
I think there are a couple of options that might work:
1. We restrict usage to committers on the project who agree not to
redistribute, etc. just like any other researcher/organization
2. We make future collections available under an open source license
Perhaps there may be a way in the future for Lucene members or the
ASF to contribute to making the collection. Knowing the Lucene
community and ASF, I would bet Lucene people would volunteer.
However, I am not in a position to volunteer the ASF or others at
this point, but I am in a position to see if others are interested
in doing so.
Thanks,
Grant
On Aug 22, 2007, at 3:40 PM, Ellen Voorhees wrote:
I am unclear as to what, precisely, you see as the issues. In
particular, I would claim that
TREC is an evaluation for the retrieval community as a whole.
Participation in TREC is open to (almost) anyone*. There is no
charge for participation itself,
though participants are responsible for the registration fee and
travel expenses
if they attend the meeting held in November. It is also true
that participants must purchase
the document sets used in some of the tasks, though he majority
of documents sets are free for participants.
Individuals who do not participate in TREC can (and do) obtain
the TREC test collections.
The topics and relevance judgments can be down-loaded directly
from the appropriate
pages in the Data section of the TREC web site. Non-participants
must purchase most of the
document sets.
We have made a very concerted effort to obtain document sets as
free from restrictions
as possible. Nonetheless, good (i.e., representative of content
people might actually search)
documents tend to be the intellectual property of some
organization and thus subject to copyright.
There are also administrative and distribution costs that must be
covered. The majority of the
document sets used in TREC are covered by a license that 1)
allows the data to be used
for research purposes only and 2) prohibits the redistribution of
the documents by anyone
other than the organization originally granted that right. So,
some of the TREC document
sets must be obtained from the Linguistic Data Consortium
(www.ldc.upenn.edu), some from NIST,
some from the University of Glasgow (http://ir.dcs.gla.ac.uk/
test_collections/). Since the agreements
are already in place with the original sources of documents for
the current collections, we cannot
change the license agreements after the fact. The least
expensive document sets are US $180;
the most expensive are 400 pounds.
I am very much interested in knowing what specific obstacles keep
you from participating
in TREC and any suggestions you may have for eliminating/
minimizing those. We are well aware that
the fewer restrictions (of any kind) there are on data sets the
more use they receive and the
more novel uses are made of them. But we are equally aware of
the difficulties of obtaining
large, representative, appropriate document sets that may be
distributed world-wide with
no restrictions.
Ellen Voorhees
* The qualification is there because as federal employees NIST
staff members are prohibited from corresponding
with certain countries. Citizens of those countries are
therefore unable to participate in TREC.
Begin forwarded message:
From: Ellen Voorhees <[EMAIL PROTECTED]>
Date: August 22, 2007 3:40:15 PM EDT
To: Grant Ingersoll <[EMAIL PROTECTED]>
Cc: [EMAIL PROTECTED]
Subject: Re: TREC Evaluation and Open Source
I am unclear as to what, precisely, you see as the issues. In
particular, I would claim that
TREC is an evaluation for the retrieval community as a whole.
Participation in TREC is open to (almost) anyone*. There is no
charge for participation itself,
though participants are responsible for the registration fee and
travel expenses
if they attend the meeting held in November. It is also true that
participants must purchase
the document sets used in some of the tasks, though he majority of
documents sets are free for participants.
Individuals who do not participate in TREC can (and do) obtain the
TREC test collections.
The topics and relevance judgments can be down-loaded directly from
the appropriate
pages in the Data section of the TREC web site. Non-participants
must purchase most of the
document sets.
We have made a very concerted effort to obtain document sets as
free from restrictions
as possible. Nonetheless, good (i.e., representative of content
people might actually search)
documents tend to be the intellectual property of some organization
and thus subject to copyright.
There are also administrative and distribution costs that must be
covered. The majority of the
document sets used in TREC are covered by a license that 1) allows
the data to be used
for research purposes only and 2) prohibits the redistribution of
the documents by anyone
other than the organization originally granted that right. So,
some of the TREC document
sets must be obtained from the Linguistic Data Consortium
(www.ldc.upenn.edu), some from NIST,
some from the University of Glasgow (http://ir.dcs.gla.ac.uk/
test_collections/). Since the agreements
are already in place with the original sources of documents for the
current collections, we cannot
change the license agreements after the fact. The least expensive
document sets are US $180;
the most expensive are 400 pounds.
I am very much interested in knowing what specific obstacles keep
you from participating
in TREC and any suggestions you may have for eliminating/minimizing
those. We are well aware that
the fewer restrictions (of any kind) there are on data sets the
more use they receive and the
more novel uses are made of them. But we are equally aware of the
difficulties of obtaining
large, representative, appropriate document sets that may be
distributed world-wide with
no restrictions.
Ellen Voorhees
* The qualification is there because as federal employees NIST
staff members are prohibited from corresponding
with certain countries. Citizens of those countries are therefore
unable to participate in TREC.
Grant Ingersoll wrote:
Dear Ms. Voorhees,
My name is Grant Ingersoll and I am committer on the Lucene Java
search library (http://lucene.apache.org) at the Apache Software
Foundation (ASF). I am not, however, writing in any official
capacity as a representative of the ASF. Perhaps at a later date,
this will change, but for now I just want to keep things informal.
I am, however, interested in starting a discussion about how open
source projects like Lucene could participate in future TREC
evaluations, or at least gain access to TREC data resources.
While the people involved in Lucene feel we have built a top notch
search system, one of the things the community as a whole lacks is
the ability to do formal evaluations like TREC offers, and thus
research and development of new algorithms is hindered. Granted,
individuals may perform TREC evaluations given they have purchased
a license to the data, but the community as a whole does not have
this ability.
I am wondering if there is some way in which we can arrange for
open source projects to obtain access to the TREC collections.
The biggest barrier for projects like Lucene, obviously, is the
fee that needs to be paid. Furthermore, there are undoubtedly
distribution and copyright concerns. Yet, a part of me feels that
we can work something out through creative licensing or some other
novel approach that protects the appropriate interests, furthers
TREC's mission and supports the vibrant Open Source community
around Lucene and other search engines. Perhaps it would be
possible to require that any participant who wants the TREC data
must prove that they are appropriately affiliated with an official
open source project, as defined by the Open Source Initiative
(http://www.opensource.org). Many tool vendors have similar
licenses that allow open source participants to use their tool
while working on open source projects. Perhaps we could provide a
similar approach to the TREC data.
I feel this would benefit TREC substantially, by providing an
open, baseline system for all the world to see and I see that it
fits very much with the motto of TREC "...to encourage research
in information retrieval from large text collections."
Naturally, it benefits Lucene by allowing Lucene to undertake more
formal evaluation of relevance, etc.
If you are interested in more background on this on the Lucene
Java developers mailing list, please refer to
http://www.gossamer-threads.com/lists/lucene/java-dev/52022?
search_string=TREC;#52022
I look forward to hearing back from you and I would be more than
happy to answer any questions you have.
Sincerely,
Grant Ingersoll
---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]