Inline below is the response from Ms. Ellen Voorhees (person in charge of TREC) concerning my inquiry about gaining access to TREC data. I suggest reading from the bottom and working your way up. I edited out some of the copies of old messages to shorten the length here.

As you can read, there is some opportunity in here for us to gain access to TREC data. The bigger opportunity (and work), I feel, may be the chance to, going forward, help NIST create and distribute collections with an open source license and make them freely available to anyone to use.

My suggestion at this point, would be to figure out if there are ways we as a community could help and also think about if it is worthwhile to find a way to purchase 1 or more collections for use by committers (we could make the data available on zones.

So, what do people think?

Begin forwarded message:

From: Ellen Voorhees <[EMAIL PROTECTED]>
Date: August 24, 2007 2:43:35 PM EDT
To: Grant Ingersoll <[EMAIL PROTECTED]>
Cc: [EMAIL PROTECTED]
Subject: Re: TREC Evaluation and Open Source




So, I think the current scheme could work for Lucene committers/ active contributors provided there is a central machine that all have access to. (I admit to pretty much total ignorance in the actual practice of an open source project.) If the cost of getting the documents is too great, Lucene as a project could sign up to participate in TREC and obtain many of the document sets for free.


Hmmm, we do have a machine that only committers have access to. So, if the ASF were to purchase a copy of the data, we could put it on this machine and use it, correct? And individual committers/ contributors could be given access to it as long as they sign the individual forms?

Yes, precisely.



This is not a solution for the Lucene community, which is too large and far-flung to count as an "organization" in the spirit of the Data Use forms. For collections that are already created under existing agreements, I see no alternative to community members obtaining the document sets on their own. On the other hand, if for community members you are mostly interested in distributing some data set that will show whether Lucene is installed correctly (i.e., not a test collection that IR research should be done on), the subset of the TREC ad hoc collections containing just the Federal Register documents can be used since those documents have no copyright restrictions and we have some judgments and relevance judgments for them. (But the documents are wonky enough and there are few enough topics that I do not consider this a good test collection for research.)

We have demos and data sets for testing installation. Mostly, we are looking for feedback in the traditional TREC spirit, i.e. running experiments, testing relevance algorithms, etc. Also, testing scalability, etc. Plus, it helps users make direct comparisons when choosing a search system

Yes, that is what I originally figured you wanted the collections for, and the Federal Register subset is not a viable candidate for that. The standard ad hoc collections are probably not sufficient for testing scalability--- they are only 800,000-1,000,000 documents and about 2GB of text. The collections built in the 'terabyte' track used a crawl of .GOV that is about .5TB of text (this is one of the document sets distributed by the University of Glasgow). Note that we (NIST TREC staff and terabyte track organizers) have some reservations about the completeness of the relevance judgments for the terabyte
collections.


I think it would be fantastic for the community if there were a good document set that was able to be distributed through an open source license only. I'd be happy to use TREC to get topics and relevance judgments for such a document set so there could be a readily-available, basic, ad hoc retrieval test collection. But our (TREC staff) experience to date with trying to find
such document sets has been very negative.



Can I have your permission to share our conversation with the larger Lucene development community on the java- [EMAIL PROTECTED] mailing list? If you would like, I can summarize it instead and report back to the group. I can run the summary by you first if you would like to edit it.

Perhaps we can help with the collection task, although I can't promise it. Your staff does a great job already and are undoubtedly the experts on it, but there may very well be individuals who are willing to help, under the proper guidance. Also, I wonder if groups like iBiblio, Creative Commons or Wikipedia might be able to help out. I have met with Paul Jones at iBiblio before and they have an extensive collection of open source documents. Just not sure if they fit the TREC criteria. Is the criteria publicly documented somewhere?

Cheers,
Grant


You may share the conversation with the Lucene community, either summarized or straight.

We do not have a specific list of criteria for document sets. Since a full (TREC) test collection is built by using the document set in a TREC track, the "vetting" process has generally happened through the track proposal process. In general, a track is focused on a particular task, and the document set needs to be a reasonable surrogate for the types of documents that are typical for that task. So, the genomics track has used subsets of the medical literature, the web track used crawls of the web. We also want the document sets to be large enough to be interesting---there is no point putting resources into building a test collection that no one believes is representative of anything real. If the relevance judgments are to be created by NIST assessors, then they have to be "general information" sorts of things since we do not have a body of assessors with subject matter expertise in any one area. So, the genomics track judgments are not made at NIST, and the original TREC ad hoc collections were mostly newswire. We also want to make sure that the document collection will be available for a (relatively) long time. Again, there is no point committing resources to create topics and relevance judgments unless the documents will be available for a significant time. This latter point also implies taking a snapshot of dynamic collections. That is (one of) the reasons TREC has not used the live web or live Wikipedia as a document collection---
to have a standard test collection you need a frozen document set.

FYI, the call for track proposals for TREC 2008 is currently open until mid-September; see http://trec.nist.gov/tracks.html. In my comments above about a basic, ad hoc collection, I was basically envisioning a newsy collection, but that's probably just lack of imagination on my part. Of course, there is no requirement for you to go through TREC to create a collection, and there would probably be little point in doing so if the document set (or task) are such that NIST assessors can't do the assessing. There are other evaluation venues (NTCIR, CLEF), or you the Lucene community may decide to just build it yourselves. In the latter case, TREC staff can offer our advise on what we've learned about
collection building through the years.

A second FYI, if you want to get some more of an idea of the considerations that went
into building the early TREC collections, chapter 2 of the TREC book
( http://mitpress.mit.edu/catalog/item/default.asp? ttype=2&tid=10667&mode=toc )
is "The TREC Test Collections" authored by Donna Harman.

Ellen

Begin forwarded message:

From: Ellen Voorhees <[EMAIL PROTECTED]>
Date: August 24, 2007 11:17:17 AM EDT
To: Grant Ingersoll <[EMAIL PROTECTED]>
Cc: [EMAIL PROTECTED]
Subject: Re: TREC Evaluation and Open Source

The way the TREC Data Use licenses currently work, the "organization" that requests the data is the legal entity that owns the machine on which the data is put. (An example form is at http://www.nist.gov/srd/ trec_org.htm .) That organization defines who it is that may access the data, with the expectation that access would require a person-specific account on the machine. Each such person is to sign an Individual form. (The intent of the Individual forms is that, officially, a copyright owner of the data may ask for a list of all individuals that have (or have had) access to the data. No one has ever asked for such a list, but the language is in the forms to allow this.) Individuals may access the data remotely, provided they do so through the account on the host machine. Individuals at a remote location may not make a copy of the data for their local machine, because that would
be redistributing the documents.

So, I think the current scheme could work for Lucene committers/ active contributors provided there is a central machine that all have access to. (I admit to pretty much total ignorance in the actual practice of an open source project.) If the cost of getting the documents is too great, Lucene as a project could sign up to participate in TREC and obtain many of the document sets for free.

This is not a solution for the Lucene community, which is too large and far-flung to count as an "organization" in the spirit of the Data Use forms. For collections that are already created under existing agreements, I see no alternative to community members obtaining the document sets on their own. On the other hand, if for community members you are mostly interested in distributing some data set that will show whether Lucene is installed correctly (i.e., not a test collection that IR research should be done on), the subset of the TREC ad hoc collections containing just the Federal Register documents can be used since those documents have no copyright restrictions and we have some judgments and relevance judgments for them. (But the documents are wonky enough and there are few enough topics that I do not consider this a good test collection for research.)

I think it would be fantastic for the community if there were a good document set that was able to be distributed through an open source license only. I'd be happy to use TREC to get topics and relevance judgments for such a document set so there could be a readily-available, basic, ad hoc retrieval test collection. But our (TREC staff) experience to date with trying to find
such document sets has been very negative.


Ellen




Grant Ingersoll wrote:
Thank you for the detailed response. By the way, this whole discussion, I figured, just falls under the category of: it can't hurt to ask. I know the answer may very well be no and I completely understand why it should be for the reasons you have cited: creating these collections takes a lot of work and requires a lot of storage and bandwidth. So, I hope I am not coming across as being critical of the current state of TREC. I very much value what TREC does, I have participated it in the past and really enjoyed it (other than the long hours I put in running experiments :-) ) The high quality of TREC is one of the reasons why I wanted to ask in the first place!

I think what I am trying to find out more about is if there is any possibility that the Lucene community (or maybe just the committers or active contributors who are not prohibited from contributing based on where they live) could gain access to these documents. That is, could the collections (or future collections) be licensed under an open source license and hosted somewhere that is publicly available and does not require a fee to be paid to LDC or the like. Perhaps the ASF or iBiblio would do this, or maybe some company would, I don't know, but I am willing to ask the appropriate people. There are plenty of places out there that provide mirrors, etc. for Apache and iBiblio for free such that storage or cost should not be an issue.

I guess some of the difficulty lies in how open source is developed versus how commercial/research systems are developed. We don't have a pool of money that we can use to purchase document collections. Right now, the best we can make publicly available to our users is Wikipedia, which they download and use with some of our tools. It also isn't even clear to me what defines the organization that would be buying the collection if there were money. For instance, if the Apache Soft. Found. purchased the document collection, would that mean that anyone at the ASF could use it? The problem is, other than one full-time system admin, all of the ASF is a volunteer organization (and a rather large, global one at that.) So, how do you define how a project like Lucene as a whole can use TREC if the ASF were to pay the fee? It would be the equivalent of total redistribution to anyone.

I think there are a couple of options that might work:
1. We restrict usage to committers on the project who agree not to redistribute, etc. just like any other researcher/organization
2. We make future collections available under an open source license

Perhaps there may be a way in the future for Lucene members or the ASF to contribute to making the collection. Knowing the Lucene community and ASF, I would bet Lucene people would volunteer. However, I am not in a position to volunteer the ASF or others at this point, but I am in a position to see if others are interested in doing so.

Thanks,
Grant

On Aug 22, 2007, at 3:40 PM, Ellen Voorhees wrote:

I am unclear as to what, precisely, you see as the issues. In particular, I would claim that
TREC is an evaluation for the retrieval community as a whole.

Participation in TREC is open to (almost) anyone*. There is no charge for participation itself, though participants are responsible for the registration fee and travel expenses if they attend the meeting held in November. It is also true that participants must purchase the document sets used in some of the tasks, though he majority of documents sets are free for participants.

Individuals who do not participate in TREC can (and do) obtain the TREC test collections. The topics and relevance judgments can be down-loaded directly from the appropriate pages in the Data section of the TREC web site. Non-participants must purchase most of the
document sets.

We have made a very concerted effort to obtain document sets as free from restrictions as possible. Nonetheless, good (i.e., representative of content people might actually search) documents tend to be the intellectual property of some organization and thus subject to copyright. There are also administrative and distribution costs that must be covered. The majority of the document sets used in TREC are covered by a license that 1) allows the data to be used for research purposes only and 2) prohibits the redistribution of the documents by anyone other than the organization originally granted that right. So, some of the TREC document sets must be obtained from the Linguistic Data Consortium (www.ldc.upenn.edu), some from NIST, some from the University of Glasgow (http://ir.dcs.gla.ac.uk/ test_collections/). Since the agreements are already in place with the original sources of documents for the current collections, we cannot change the license agreements after the fact. The least expensive document sets are US $180;
the most expensive are 400 pounds.

I am very much interested in knowing what specific obstacles keep you from participating in TREC and any suggestions you may have for eliminating/ minimizing those. We are well aware that the fewer restrictions (of any kind) there are on data sets the more use they receive and the more novel uses are made of them. But we are equally aware of the difficulties of obtaining large, representative, appropriate document sets that may be distributed world-wide with
no restrictions.

Ellen Voorhees

* The qualification is there because as federal employees NIST staff members are prohibited from corresponding with certain countries. Citizens of those countries are therefore unable to participate in TREC.



Begin forwarded message:

From: Ellen Voorhees <[EMAIL PROTECTED]>
Date: August 22, 2007 3:40:15 PM EDT
To: Grant Ingersoll <[EMAIL PROTECTED]>
Cc: [EMAIL PROTECTED]
Subject: Re: TREC Evaluation and Open Source

I am unclear as to what, precisely, you see as the issues. In particular, I would claim that
TREC is an evaluation for the retrieval community as a whole.

Participation in TREC is open to (almost) anyone*. There is no charge for participation itself, though participants are responsible for the registration fee and travel expenses if they attend the meeting held in November. It is also true that participants must purchase the document sets used in some of the tasks, though he majority of documents sets are free for participants.

Individuals who do not participate in TREC can (and do) obtain the TREC test collections. The topics and relevance judgments can be down-loaded directly from the appropriate pages in the Data section of the TREC web site. Non-participants must purchase most of the
document sets.

We have made a very concerted effort to obtain document sets as free from restrictions as possible. Nonetheless, good (i.e., representative of content people might actually search) documents tend to be the intellectual property of some organization and thus subject to copyright. There are also administrative and distribution costs that must be covered. The majority of the document sets used in TREC are covered by a license that 1) allows the data to be used for research purposes only and 2) prohibits the redistribution of the documents by anyone other than the organization originally granted that right. So, some of the TREC document sets must be obtained from the Linguistic Data Consortium (www.ldc.upenn.edu), some from NIST, some from the University of Glasgow (http://ir.dcs.gla.ac.uk/ test_collections/). Since the agreements are already in place with the original sources of documents for the current collections, we cannot change the license agreements after the fact. The least expensive document sets are US $180;
the most expensive are 400 pounds.

I am very much interested in knowing what specific obstacles keep you from participating in TREC and any suggestions you may have for eliminating/minimizing those. We are well aware that the fewer restrictions (of any kind) there are on data sets the more use they receive and the more novel uses are made of them. But we are equally aware of the difficulties of obtaining large, representative, appropriate document sets that may be distributed world-wide with
no restrictions.

Ellen Voorhees

* The qualification is there because as federal employees NIST staff members are prohibited from corresponding with certain countries. Citizens of those countries are therefore unable to participate in TREC.


Grant Ingersoll wrote:
Dear Ms. Voorhees,

My name is Grant Ingersoll and I am committer on the Lucene Java search library (http://lucene.apache.org) at the Apache Software Foundation (ASF). I am not, however, writing in any official capacity as a representative of the ASF. Perhaps at a later date, this will change, but for now I just want to keep things informal.

I am, however, interested in starting a discussion about how open source projects like Lucene could participate in future TREC evaluations, or at least gain access to TREC data resources. While the people involved in Lucene feel we have built a top notch search system, one of the things the community as a whole lacks is the ability to do formal evaluations like TREC offers, and thus research and development of new algorithms is hindered. Granted, individuals may perform TREC evaluations given they have purchased a license to the data, but the community as a whole does not have this ability.

I am wondering if there is some way in which we can arrange for open source projects to obtain access to the TREC collections. The biggest barrier for projects like Lucene, obviously, is the fee that needs to be paid. Furthermore, there are undoubtedly distribution and copyright concerns. Yet, a part of me feels that we can work something out through creative licensing or some other novel approach that protects the appropriate interests, furthers TREC's mission and supports the vibrant Open Source community around Lucene and other search engines. Perhaps it would be possible to require that any participant who wants the TREC data must prove that they are appropriately affiliated with an official open source project, as defined by the Open Source Initiative (http://www.opensource.org). Many tool vendors have similar licenses that allow open source participants to use their tool while working on open source projects. Perhaps we could provide a similar approach to the TREC data.

I feel this would benefit TREC substantially, by providing an open, baseline system for all the world to see and I see that it fits very much with the motto of TREC "...to encourage research in information retrieval from large text collections." Naturally, it benefits Lucene by allowing Lucene to undertake more formal evaluation of relevance, etc.

If you are interested in more background on this on the Lucene Java developers mailing list, please refer to http://www.gossamer-threads.com/lists/lucene/java-dev/52022? search_string=TREC;#52022

I look forward to hearing back from you and I would be more than happy to answer any questions you have.

Sincerely,
Grant Ingersoll






---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

Reply via email to