Re: Can we use TREC data set in open source?

2013-09-20 Thread Han Jiang
 I read here http://lemurproject.org/clueweb09/ that there is a hosted
 version of ClueWeb09 (the latest is ClueWeb12, for which I don't find a
 hosted version), and to get access to it, someone from the ASF will need
 to sign an Organizational Agreement with them as well as each individual
 in the project will need to sign an Individual Agreement (retained by the
 ASF). Perhaps this can be available only to committers.

This is nice! I'll try to ask ASF about this.

 To this day, I think the only way it will happen is for the community
 to build a completely open system, perhaps based off of Common Crawl or
 our own crawl and host it ourselves and develop judgments, etc.

Yeah, this is what we need in ORP.

 Most people like the idea, but are not sure how to distribute it in an
 open way (ClueWeb comes as 4 1TB disks right now) and I am also not sure
 how they would handle any copyright/redaction claims against it.  There
 is, of course, little incentive for those involved to solve these, either,
 as most people who are interested sign the form and pay the $600 for the
 disks.

Sigh, yes, it is hard to make a data set totally public. Actually, one of
my purpose in this question is to see whether it is acceptable in our
community (i.e. lucene/solr only) to obtain a data set not open to all
people. When expand to a larger scope, the license issue is somewhat
hairy...


And since Shai has found a possible 'free' data set, I think it is possible
for ASF to obtain an Organizational Agreement for this. I'll try to contact
ASF  CMU about how they define person with the authority in OSS.


On Tue, Sep 17, 2013 at 6:11 AM, Grant Ingersoll gsing...@apache.orgwrote:

 Inline below

 On Sep 9, 2013, at 10:53 PM, Han Jiang jiangha...@gmail.com wrote:

 Back in 2007 Grant contacted with NIST about making TREC collection
 available to our community:

 http://mail-archives.apache.org/mod_mbox/lucene-dev/200708.mbox/browser

 I think a try for this is really important to our project and people who
 use Lucene. All these years the speed performance is mainly tuned on
 Wikipedia, however it's not very 'standard':

 * it doesn't represent how real-world search works;
 * it cannot be used to evaluate the relevance of our scoring models;
 * researchers tend to do experiments on other data sets, and usually it is
   hard to know whether Lucene performs its best performance;

 And personally I agree with this line:

  I think it would encourage Lucene users/developers to think about
  relevance as much as we think about speed.

 There's been much work to make Lucene's scoring models pluggable in 4.0,
 and it'll be great if we can explore more about it. It is very appealing
 to
 see a high-performance library work along with state-of-the-art ranking
 methods.


 And about TREC data set, the problems we met are:

 1. NIST/TREC does not own the original collections, therefore it might be
necessary to have direct contact with those organizations who really
 did,
such as:

http://ir.dcs.gla.ac.uk/test_collections/access_to_data.html
http://lemurproject.org/clueweb12/

 2. Currently, there is no open-source license for any of the data sets, so
it won't be as 'open' as Wikipedia is.

As is proposed by Grant, a possibility is to make the data set
 accessible
only to committers instead of all users. It is not very open-source
 then,
but TREC data sets is public and usually available to researchers, so
people can still reproduce performance test.

 I'm quite curious, has anyone explored getting an open-source license for
 one of those data sets? And is our community still interested about this
 issue after all these years?


 It continues to be of interest to me.  I've had various conversations
 throughout the years on it.  Most people like the idea, but are not sure
 how to distribute it in an open way (ClueWeb comes as 4 1TB disks right
 now) and I am also not sure how they would handle any copyright/redaction
 claims against it.  There is, of course, little incentive for those
 involved to solve these, either, as most people who are interested sign the
 form and pay the $600 for the disks.  I've had a number of conversations
 about how I view this to be a significant barrier to open research, esp. in
 under-served countries and to open source.  People sympathize with me, but
 then move on.

 To this day, I think the only way it will happen is for the community to
 build a completely open system, perhaps based off of Common Crawl or our
 own crawl and host it ourselves and develop judgments, etc.  We tried to
 get this off the ground w/ the Open Relevance Project, but there was never
 a sustainable effort, and thus I have little hope at this point for it (but
 I would love to be proven wrong)  For it to succeed, I think we would need
 the backing of a University with students interested in curating such a
 collection, the judgments, etc.  I think we could figure out how to
 distribute the data either 

Re: Can we use TREC data set in open source?

2013-09-16 Thread Grant Ingersoll
Inline below

On Sep 9, 2013, at 10:53 PM, Han Jiang jiangha...@gmail.com wrote:

 Back in 2007 Grant contacted with NIST about making TREC collection 
 available to our community: 
 
 http://mail-archives.apache.org/mod_mbox/lucene-dev/200708.mbox/browser
 
 I think a try for this is really important to our project and people who 
 use Lucene. All these years the speed performance is mainly tuned on 
 Wikipedia, however it's not very 'standard':
 
 * it doesn't represent how real-world search works; 
 * it cannot be used to evaluate the relevance of our scoring models;
 * researchers tend to do experiments on other data sets, and usually it is 
   hard to know whether Lucene performs its best performance; 
 
 And personally I agree with this line:
 
  I think it would encourage Lucene users/developers to think about 
  relevance as much as we think about speed.
 
 There's been much work to make Lucene's scoring models pluggable in 4.0, 
 and it'll be great if we can explore more about it. It is very appealing to 
 see a high-performance library work along with state-of-the-art ranking 
 methods. 
 
 
 And about TREC data set, the problems we met are:
 
 1. NIST/TREC does not own the original collections, therefore it might be 
necessary to have direct contact with those organizations who really did,
such as:
 
http://ir.dcs.gla.ac.uk/test_collections/access_to_data.html
http://lemurproject.org/clueweb12/
 
 2. Currently, there is no open-source license for any of the data sets, so 
it won't be as 'open' as Wikipedia is.
 
As is proposed by Grant, a possibility is to make the data set accessible
only to committers instead of all users. It is not very open-source then,
but TREC data sets is public and usually available to researchers, so 
people can still reproduce performance test.
 
 I'm quite curious, has anyone explored getting an open-source license for 
 one of those data sets? And is our community still interested about this 
 issue after all these years?
 

It continues to be of interest to me.  I've had various conversations 
throughout the years on it.  Most people like the idea, but are not sure how to 
distribute it in an open way (ClueWeb comes as 4 1TB disks right now) and I am 
also not sure how they would handle any copyright/redaction claims against it.  
There is, of course, little incentive for those involved to solve these, 
either, as most people who are interested sign the form and pay the $600 for 
the disks.  I've had a number of conversations about how I view this to be a 
significant barrier to open research, esp. in under-served countries and to 
open source.  People sympathize with me, but then move on.

To this day, I think the only way it will happen is for the community to 
build a completely open system, perhaps based off of Common Crawl or our own 
crawl and host it ourselves and develop judgments, etc.  We tried to get this 
off the ground w/ the Open Relevance Project, but there was never a sustainable 
effort, and thus I have little hope at this point for it (but I would love to 
be proven wrong)  For it to succeed, I think we would need the backing of a 
University with students interested in curating such a collection, the 
judgments, etc.  I think we could figure out how to distribute the data either 
as an AWS public data set or possibly via the ASF or similar (although I am 
pretty sure the ASF would balk at multi-TB sized downloads).  

Happy to hear other ideas.


Grant Ingersoll | @gsingers
http://www.lucidworks.com







Can we use TREC data set in open source?

2013-09-09 Thread Han Jiang
Back in 2007 Grant contacted with NIST about making TREC collection
available to our community:

http://mail-archives.apache.org/mod_mbox/lucene-dev/200708.mbox/browser

I think a try for this is really important to our project and people who
use Lucene. All these years the speed performance is mainly tuned on
Wikipedia, however it's not very 'standard':

* it doesn't represent how real-world search works;
* it cannot be used to evaluate the relevance of our scoring models;
* researchers tend to do experiments on other data sets, and usually it is
  hard to know whether Lucene performs its best performance;

And personally I agree with this line:

 I think it would encourage Lucene users/developers to think about
 relevance as much as we think about speed.

There's been much work to make Lucene's scoring models pluggable in 4.0,
and it'll be great if we can explore more about it. It is very appealing to
see a high-performance library work along with state-of-the-art ranking
methods.


And about TREC data set, the problems we met are:

1. NIST/TREC does not own the original collections, therefore it might be
   necessary to have direct contact with those organizations who really did,
   such as:

   http://ir.dcs.gla.ac.uk/test_collections/access_to_data.html
   http://lemurproject.org/clueweb12/

2. Currently, there is no open-source license for any of the data sets, so
   it won't be as 'open' as Wikipedia is.

   As is proposed by Grant, a possibility is to make the data set accessible
   only to committers instead of all users. It is not very open-source then,
   but TREC data sets is public and usually available to researchers, so
   people can still reproduce performance test.

I'm quite curious, has anyone explored getting an open-source license for
one of those data sets? And is our community still interested about this
issue after all these years?



-- 
Han Jiang

Team of Search Engine and Web Mining,
School of Electronic Engineering and Computer Science,
Peking University, China


Re: Can we use TREC data set in open source?

2013-09-09 Thread Shai Erera
I read here http://lemurproject.org/clueweb09/ that there is a hosted
version of ClueWeb09 (the latest is ClueWeb12, for which I don't find a
hosted version), and to get access to it, someone from the ASF will need to
sign an Organizational Agreement with them as well as each individual in
the project will need to sign an Individual Agreement (retained by the
ASF). Perhaps this can be available only to committers.

Though, we need to get access to ClueWeb12 if we want to publish Lucene
results on the latest data set. TREC papers are already based on that
version.

But if we just want to measure performance, relevancy etc., ClueWeb09 could
be a good start.

Shai

On Tue, Sep 10, 2013 at 5:53 AM, Han Jiang jiangha...@gmail.com wrote:

 Back in 2007 Grant contacted with NIST about making TREC collection
 available to our community:

 http://mail-archives.apache.org/mod_mbox/lucene-dev/200708.mbox/browser

 I think a try for this is really important to our project and people who
 use Lucene. All these years the speed performance is mainly tuned on
 Wikipedia, however it's not very 'standard':

 * it doesn't represent how real-world search works;
 * it cannot be used to evaluate the relevance of our scoring models;
 * researchers tend to do experiments on other data sets, and usually it is
   hard to know whether Lucene performs its best performance;

 And personally I agree with this line:

  I think it would encourage Lucene users/developers to think about
  relevance as much as we think about speed.

 There's been much work to make Lucene's scoring models pluggable in 4.0,
 and it'll be great if we can explore more about it. It is very appealing
 to
 see a high-performance library work along with state-of-the-art ranking
 methods.


 And about TREC data set, the problems we met are:

 1. NIST/TREC does not own the original collections, therefore it might be
necessary to have direct contact with those organizations who really
 did,
such as:

http://ir.dcs.gla.ac.uk/test_collections/access_to_data.html
http://lemurproject.org/clueweb12/

 2. Currently, there is no open-source license for any of the data sets, so
it won't be as 'open' as Wikipedia is.

As is proposed by Grant, a possibility is to make the data set
 accessible
only to committers instead of all users. It is not very open-source
 then,
but TREC data sets is public and usually available to researchers, so
people can still reproduce performance test.

 I'm quite curious, has anyone explored getting an open-source license for
 one of those data sets? And is our community still interested about this
 issue after all these years?



 --
 Han Jiang

 Team of Search Engine and Web Mining,
 School of Electronic Engineering and Computer Science,
 Peking University, China