Re: TREC Collection, NIST and Lucene

Grant Ingersoll Fri, 24 Aug 2007 14:54:02 -0700

Inline below is the response from Ms. Ellen Voorhees (person incharge of TREC) concerning my inquiry about gaining access to TRECdata. I suggest reading from the bottom and working your way up. Iedited out some of the copies of old messages to shorten the lengthhere.

As you can read, there is some opportunity in here for us to gainaccess to TREC data. The bigger opportunity (and work), I feel, maybe the chance to, going forward, help NIST create and distributecollections with an open source license and make them freelyavailable to anyone to use.

My suggestion at this point, would be to figure out if there are wayswe as a community could help and also think about if it is worthwhileto find a way to purchase 1 or more collections for use by committers(we could make the data available on zones.


So, what do people think?

Begin forwarded message:

From: Ellen Voorhees <[EMAIL PROTECTED]>
Date: August 24, 2007 2:43:35 PM EDT
To: Grant Ingersoll <[EMAIL PROTECTED]>
Cc: [EMAIL PROTECTED]
Subject: Re: TREC Evaluation and Open Source
So, I think the current scheme could work for Lucene committers/active contributors provided there isa central machine that all have access to. (I admit to prettymuch total ignorance in the actualpractice of an open source project.) If the cost of getting thedocuments is too great, Lucene asa project could sign up to participate in TREC and obtain many ofthe document sets for free.
Hmmm, we do have a machine that only committers have access to.So, if the ASF were to purchase a copy of the data, we could putit on this machine and use it, correct? And individual committers/contributors could be given access to it as long as they sign theindividual forms?
Yes, precisely.
This is not a solution for the Lucene community, which is toolarge and far-flung to countas an "organization" in the spirit of the Data Use forms. Forcollections that are alreadycreated under existing agreements, I see no alternative tocommunity members obtaining the documentsets on their own. On the other hand, if for community membersyou are mostly interested in distributing somedata set that will show whether Lucene is installed correctly(i.e., not a test collection that IR research should bedone on), the subset of the TREC ad hoc collections containingjust the Federal Register documentscan be used since those documents have no copyright restrictionsand we have some judgmentsand relevance judgments for them. (But the documents are wonkyenough and there are few enoughtopics that I do not consider this a good test collection forresearch.)
We have demos and data sets for testing installation. Mostly, weare looking for feedback in the traditional TREC spirit, i.e.running experiments, testing relevance algorithms, etc. Also,testing scalability, etc. Plus, it helps users make directcomparisons when choosing a search system
Yes, that is what I originally figured you wanted the collectionsfor, and the Federal Register subset is nota viable candidate for that. The standard ad hoc collections areprobably not sufficient for testing scalability---they are only 800,000-1,000,000 documents and about 2GB of text.The collections built in the 'terabyte'track used a crawl of .GOV that is about .5TB of text (this is oneof the document sets distributed bythe University of Glasgow). Note that we (NIST TREC staff andterabyte track organizers)have some reservations about the completeness of the relevancejudgments for the terabyte
collections.
I think it would be fantastic for the community if there were agood document set thatwas able to be distributed through an open source license only.I'd be happy to use TREC to gettopics and relevance judgments for such a document set so therecould be a readily-available,basic, ad hoc retrieval test collection. But our (TREC staff)experience to date with trying to find
such document sets has been very negative.
Can I have your permission to share our conversation with thelarger Lucene development community on the java-[EMAIL PROTECTED] mailing list? If you would like, I cansummarize it instead and report back to the group. I can run thesummary by you first if you would like to edit it.
Perhaps we can help with the collection task, although I can'tpromise it. Your staff does a great job already and areundoubtedly the experts on it, but there may very well beindividuals who are willing to help, under the proper guidance.Also, I wonder if groups like iBiblio, Creative Commons orWikipedia might be able to help out. I have met with Paul Jonesat iBiblio before and they have an extensive collection of opensource documents. Just not sure if they fit the TREC criteria.Is the criteria publicly documented somewhere?
Cheers,
Grant
You may share the conversation with the Lucene community, eithersummarized or straight.
We do not have a specific list of criteria for document sets.Since a full (TREC) test collection is builtby using the document set in a TREC track, the "vetting" processhas generally happened throughthe track proposal process. In general, a track is focused on aparticular task, and the documentset needs to be a reasonable surrogate for the types of documentsthat are typical for that task.So, the genomics track has used subsets of the medical literature,the web track used crawls of the web.We also want the document sets to be large enough to beinteresting---there is no point putting resourcesinto building a test collection that no one believes isrepresentative of anything real. If the relevance judgmentsare to be created by NIST assessors, then they have to be "generalinformation" sorts of things sincewe do not have a body of assessors with subject matter expertise inany one area. So, the genomicstrack judgments are not made at NIST, and the original TREC adhoc collections weremostly newswire. We also want to make sure that the documentcollection will be availablefor a (relatively) long time. Again, there is no point committingresources to create topicsand relevance judgments unless the documents will be available fora significant time.This latter point also implies taking a snapshot of dynamiccollections. That is (one of)the reasons TREC has not used the live web or live Wikipedia as adocument collection---
to have a standard test collection you need a frozen document set.
FYI, the call for track proposals for TREC 2008 is currently openuntil mid-September; seehttp://trec.nist.gov/tracks.html. In my comments above about abasic, ad hoc collection,I was basically envisioning a newsy collection, but that's probablyjust lack ofimagination on my part. Of course, there is no requirement for youto go throughTREC to create a collection, and there would probably be littlepoint in doing so if thedocument set (or task) are such that NIST assessors can't do theassessing. There are otherevaluation venues (NTCIR, CLEF), or you the Lucene community maydecide to just build ityourselves. In the latter case, TREC staff can offer our advise onwhat we've learned about
collection building through the years.
A second FYI, if you want to get some more of an idea of theconsiderations that went
into building the early TREC collections, chapter 2 of the TREC book
( http://mitpress.mit.edu/catalog/item/default.asp?ttype=2&tid=10667&mode=toc )
is "The TREC Test Collections" authored by Donna Harman.

Ellen

Begin forwarded message:

From: Ellen Voorhees <[EMAIL PROTECTED]>
Date: August 24, 2007 11:17:17 AM EDT
To: Grant Ingersoll <[EMAIL PROTECTED]>
Cc: [EMAIL PROTECTED]
Subject: Re: TREC Evaluation and Open Source
The way the TREC Data Use licenses currently work, the"organization" that requests thedata is the legal entity that owns the machine on which the data isput. (An example form is at http://www.nist.gov/srd/trec_org.htm .) That organization defines whoit is that may access the data, with the expectation that accesswould require a person-specificaccount on the machine. Each such person is to sign an Individualform. (Theintent of the Individual forms is that, officially, a copyrightowner of the data mayask for a list of all individuals that have (or have had) access tothe data. No one has everasked for such a list, but the language is in the forms to allowthis.) Individuals may access the dataremotely, provided they do so through the account on the hostmachine. Individuals ata remote location may not make a copy of the data for their localmachine, because that would
be redistributing the documents.
So, I think the current scheme could work for Lucene committers/active contributors provided there isa central machine that all have access to. (I admit to pretty muchtotal ignorance in the actualpractice of an open source project.) If the cost of getting thedocuments is too great, Lucene asa project could sign up to participate in TREC and obtain many ofthe document sets for free.
This is not a solution for the Lucene community, which is too largeand far-flung to countas an "organization" in the spirit of the Data Use forms. Forcollections that are alreadycreated under existing agreements, I see no alternative tocommunity members obtaining the documentsets on their own. On the other hand, if for community members youare mostly interested in distributing somedata set that will show whether Lucene is installed correctly(i.e., not a test collection that IR research should bedone on), the subset of the TREC ad hoc collections containing justthe Federal Register documentscan be used since those documents have no copyright restrictionsand we have some judgmentsand relevance judgments for them. (But the documents are wonkyenough and there are few enoughtopics that I do not consider this a good test collection forresearch.)
I think it would be fantastic for the community if there were agood document set thatwas able to be distributed through an open source license only.I'd be happy to use TREC to gettopics and relevance judgments for such a document set so therecould be a readily-available,basic, ad hoc retrieval test collection. But our (TREC staff)experience to date with trying to find
such document sets has been very negative.


Ellen




Grant Ingersoll wrote:
Thank you for the detailed response. By the way, this wholediscussion, I figured, just falls under the category of: it can'thurt to ask. I know the answer may very well be no and Icompletely understand why it should be for the reasons you havecited: creating these collections takes a lot of work and requiresa lot of storage and bandwidth. So, I hope I am not coming acrossas being critical of the current state of TREC. I very much valuewhat TREC does, I have participated it in the past and reallyenjoyed it (other than the long hours I put in runningexperiments :-) ) The high quality of TREC is one of the reasonswhy I wanted to ask in the first place!
I think what I am trying to find out more about is if there is anypossibility that the Lucene community (or maybe just thecommitters or active contributors who are not prohibited fromcontributing based on where they live) could gain access to thesedocuments. That is, could the collections (or future collections)be licensed under an open source license and hosted somewhere thatis publicly available and does not require a fee to be paid to LDCor the like. Perhaps the ASF or iBiblio would do this, or maybesome company would, I don't know, but I am willing to ask theappropriate people. There are plenty of places out there thatprovide mirrors, etc. for Apache and iBiblio for free such thatstorage or cost should not be an issue.
I guess some of the difficulty lies in how open source isdeveloped versus how commercial/research systems are developed.We don't have a pool of money that we can use to purchase documentcollections. Right now, the best we can make publicly availableto our users is Wikipedia, which they download and use with someof our tools. It also isn't even clear to me what defines theorganization that would be buying the collection if there weremoney. For instance, if the Apache Soft. Found. purchased thedocument collection, would that mean that anyone at the ASF coulduse it? The problem is, other than one full-time system admin,all of the ASF is a volunteer organization (and a rather large,global one at that.) So, how do you define how a project likeLucene as a whole can use TREC if the ASF were to pay the fee? Itwould be the equivalent of total redistribution to anyone.
I think there are a couple of options that might work:
1. We restrict usage to committers on the project who agree not toredistribute, etc. just like any other researcher/organization
2. We make future collections available under an open source license
Perhaps there may be a way in the future for Lucene members or theASF to contribute to making the collection. Knowing the Lucenecommunity and ASF, I would bet Lucene people would volunteer.However, I am not in a position to volunteer the ASF or others atthis point, but I am in a position to see if others are interestedin doing so.
Thanks,
Grant

On Aug 22, 2007, at 3:40 PM, Ellen Voorhees wrote:
I am unclear as to what, precisely, you see as the issues. Inparticular, I would claim that
TREC is an evaluation for the retrieval community as a whole.
Participation in TREC is open to (almost) anyone*. There is nocharge for participation itself,though participants are responsible for the registration fee andtravel expensesif they attend the meeting held in November. It is also truethat participants must purchasethe document sets used in some of the tasks, though he majorityof documents sets are free for participants.
Individuals who do not participate in TREC can (and do) obtainthe TREC test collections.The topics and relevance judgments can be down-loaded directlyfrom the appropriatepages in the Data section of the TREC web site. Non-participantsmust purchase most of the
document sets.
We have made a very concerted effort to obtain document sets asfree from restrictionsas possible. Nonetheless, good (i.e., representative of contentpeople might actually search)documents tend to be the intellectual property of someorganization and thus subject to copyright.There are also administrative and distribution costs that must becovered. The majority of thedocument sets used in TREC are covered by a license that 1)allows the data to be usedfor research purposes only and 2) prohibits the redistribution ofthe documents by anyoneother than the organization originally granted that right. So,some of the TREC documentsets must be obtained from the Linguistic Data Consortium(www.ldc.upenn.edu), some from NIST,some from the University of Glasgow (http://ir.dcs.gla.ac.uk/test_collections/). Since the agreementsare already in place with the original sources of documents forthe current collections, we cannotchange the license agreements after the fact. The leastexpensive document sets are US $180;
the most expensive are 400 pounds.
I am very much interested in knowing what specific obstacles keepyou from participatingin TREC and any suggestions you may have for eliminating/minimizing those. We are well aware thatthe fewer restrictions (of any kind) there are on data sets themore use they receive and themore novel uses are made of them. But we are equally aware ofthe difficulties of obtaininglarge, representative, appropriate document sets that may bedistributed world-wide with
no restrictions.

Ellen Voorhees
* The qualification is there because as federal employees NISTstaff members are prohibited from correspondingwith certain countries. Citizens of those countries aretherefore unable to participate in TREC.

Begin forwarded message:

From: Ellen Voorhees <[EMAIL PROTECTED]>
Date: August 22, 2007 3:40:15 PM EDT
To: Grant Ingersoll <[EMAIL PROTECTED]>
Cc: [EMAIL PROTECTED]
Subject: Re: TREC Evaluation and Open Source
I am unclear as to what, precisely, you see as the issues. Inparticular, I would claim that
TREC is an evaluation for the retrieval community as a whole.
Participation in TREC is open to (almost) anyone*. There is nocharge for participation itself,though participants are responsible for the registration fee andtravel expensesif they attend the meeting held in November. It is also true thatparticipants must purchasethe document sets used in some of the tasks, though he majority ofdocuments sets are free for participants.
Individuals who do not participate in TREC can (and do) obtain theTREC test collections.The topics and relevance judgments can be down-loaded directly fromthe appropriatepages in the Data section of the TREC web site. Non-participantsmust purchase most of the
document sets.
We have made a very concerted effort to obtain document sets asfree from restrictionsas possible. Nonetheless, good (i.e., representative of contentpeople might actually search)documents tend to be the intellectual property of some organizationand thus subject to copyright.There are also administrative and distribution costs that must becovered. The majority of thedocument sets used in TREC are covered by a license that 1) allowsthe data to be usedfor research purposes only and 2) prohibits the redistribution ofthe documents by anyoneother than the organization originally granted that right. So,some of the TREC documentsets must be obtained from the Linguistic Data Consortium(www.ldc.upenn.edu), some from NIST,some from the University of Glasgow (http://ir.dcs.gla.ac.uk/test_collections/). Since the agreementsare already in place with the original sources of documents for thecurrent collections, we cannotchange the license agreements after the fact. The least expensivedocument sets are US $180;
the most expensive are 400 pounds.
I am very much interested in knowing what specific obstacles keepyou from participatingin TREC and any suggestions you may have for eliminating/minimizingthose. We are well aware thatthe fewer restrictions (of any kind) there are on data sets themore use they receive and themore novel uses are made of them. But we are equally aware of thedifficulties of obtaininglarge, representative, appropriate document sets that may bedistributed world-wide with
no restrictions.

Ellen Voorhees
* The qualification is there because as federal employees NISTstaff members are prohibited from correspondingwith certain countries. Citizens of those countries are thereforeunable to participate in TREC.
Grant Ingersoll wrote:
Dear Ms. Voorhees,
My name is Grant Ingersoll and I am committer on the Lucene Javasearch library (http://lucene.apache.org) at the Apache SoftwareFoundation (ASF). I am not, however, writing in any officialcapacity as a representative of the ASF. Perhaps at a later date,this will change, but for now I just want to keep things informal.
I am, however, interested in starting a discussion about how opensource projects like Lucene could participate in future TRECevaluations, or at least gain access to TREC data resources.While the people involved in Lucene feel we have built a top notchsearch system, one of the things the community as a whole lacks isthe ability to do formal evaluations like TREC offers, and thusresearch and development of new algorithms is hindered. Granted,individuals may perform TREC evaluations given they have purchaseda license to the data, but the community as a whole does not havethis ability.
I am wondering if there is some way in which we can arrange foropen source projects to obtain access to the TREC collections.The biggest barrier for projects like Lucene, obviously, is thefee that needs to be paid. Furthermore, there are undoubtedlydistribution and copyright concerns. Yet, a part of me feels thatwe can work something out through creative licensing or some othernovel approach that protects the appropriate interests, furthersTREC's mission and supports the vibrant Open Source communityaround Lucene and other search engines. Perhaps it would bepossible to require that any participant who wants the TREC datamust prove that they are appropriately affiliated with an officialopen source project, as defined by the Open Source Initiative(http://www.opensource.org). Many tool vendors have similarlicenses that allow open source participants to use their toolwhile working on open source projects. Perhaps we could provide asimilar approach to the TREC data.
I feel this would benefit TREC substantially, by providing anopen, baseline system for all the world to see and I see that itfits very much with the motto of TREC "...to encourage researchin information retrieval from large text collections."Naturally, it benefits Lucene by allowing Lucene to undertake moreformal evaluation of relevance, etc.
If you are interested in more background on this on the LuceneJava developers mailing list, please refer tohttp://www.gossamer-threads.com/lists/lucene/java-dev/52022?search_string=TREC;#52022
I look forward to hearing back from you and I would be more thanhappy to answer any questions you have.
Sincerely,
Grant Ingersoll




---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

Re: TREC Collection, NIST and Lucene

Reply via email to