I think the biggest hurdle we have in front of us is curating a data set that we can redistribute. I'm in the process of uploading all the ASF public mail archives as of Sept. 13 to Amazon S3. I also have some tools (thanks to Chris Rhodes) for processing this into Solr XML. I think this would give us a standard corpus to start with and would fairly well mimic some enterprise search/eDiscovery tasks pretty well.
At any rate, as with any community, the proof is in people stepping up to help out. I like that so many people suggested we keep going. As for what to do, I think the options are pretty wide open and there is opportunity for people to define the project w/o any previous encumbrances. Some ideas that have been kicked around in the past: 1. Creative-commons data set, judgments, queries 2. Open Street Map (spatial search) 3. Mail archives 4. A crowd sourcing application. Given a set of documents and queries, have people provide judgments. Ideally, this runs in a web container and we could probably even find resources to host it here. Combining that with one of the items above, we would be on our way. App could also solicit queries by providing users open search box and opportunities to browse the data. I know much of this is simplistic, but it is a start. -Grant On Sep 13, 2010, at 9:04 PM, Dan Cardin wrote: > Hello, > > I am new to ORP. I would like to contribute to the project. I do not have a > lot of experience in this field of IR, crowd sourcing or AI. If someone > could take the lead and set forward path I would be willing to contribute my > skill set to ORP. > > How can I help? I have a lot of experience doing software development and > system administration. > > Cheers, > --Dan > > On Mon, Sep 13, 2010 at 1:36 PM, Omar Alonso <[email protected]> wrote: > >> I think ORP is a great candidate for crowdsourcing/human computation. In >> the last year or so there's been quite a bit of research and applications on >> this. See the page for the SIGIR workshop on using crowdsourcing for IR >> evaluation: >> http://www.ischool.utexas.edu/~cse2010/<http://www.ischool.utexas.edu/%7Ecse2010/> >> >> Omar >> >> --- On Mon, 9/13/10, Itamar Syn-Hershko <[email protected]> wrote: >> >>> From: Itamar Syn-Hershko <[email protected]> >>> Subject: Re: Whither ORP? >>> To: [email protected] >>> Date: Monday, September 13, 2010, 9:33 AM >>> With the proper two-way open-source >>> development process (taking and then giving) I think it can >>> become an important part of open-IR technologies, just like >>> what Lucene did to the search engines world. What ORP has to >>> offer is of great interest to HebMorph, an open-source >>> project of mine trying to decide on what is the best way to >>> index and search Hebrew texts. >>> >>> To this end I decided to put some of the development >>> efforts of the HebMorph project into making tools for the >>> ORP. I have announced this before, but unfortunately I had >>> to attend to more pressing tasks before I could complete >>> this (and there was no response from the community >>> anyway...). Just in case you're interested in seeing what I >>> came up with so far: http://github.com/synhershko/Orev. >>> >>> IMHO, the ORP should stand by itself, and relate to >>> Lucene/Solr only as its basis framework for these initial >>> stages. Perhaps also try to attract more people who could >>> find an interest in what it has to offer, so it can really >>> start growing. >>> >>> Itamar. >>> >>> On 12/9/2010 1:29 PM, Grant Ingersoll wrote: >>>> On Sep 11, 2010, at 8:51 PM, Robert Muir wrote: >>>> >>>> >>>>> i propose we take what we have and import into >>> lucene-java's benchmark >>>>> contrib. it already has integration with >>> wikipedia and reuters for perf >>>>> purposes, and the quality package is actually >>> there anyways. later, maybe >>>>> more people have time and contrib/benchmark >>> evolves naturally... e.g. to >>>>> modules/benchmark with solr support as a first big >>> step. >>>>> >>>> Yeah, that seems reasonable. I have been >>> thinking lately that it might be useful to pull our DocMaker >>> stuff out separately from benchmark so that people have easy >>> ways of generating content from things like Wikipedia, etc. >>>> >>>> Still, at the end of the day, I like what ORP _could_ >>> bring to the table and to some extent I think that is lost >>> by folding it into Lucene benchmark. >>>> >>>> >>>>> On Sep 11, 2010 7:33 PM, "Grant Ingersoll"<[email protected]> >>> wrote: >>>>> >>>>>> Seems ORP isn't really catching on with >>> people. I know personally I don't >>>>>> >>>>> have the time I had hoped to have to get it going. >>> At the same time, I >>>>> really think it could be a good project. We've got >>> some tools put together, >>>>> but we still haven't done much about the bigger >>> goal of a "self contained" >>>>> evaluation. >>>>> >>>>>> Any thoughts on how we should proceed with >>> ORP? >>>>>> >>>>>> -Grant >>>>>> >>>> >>>> >>>> >> >> >> >> -------------------------- Grant Ingersoll http://lucenerevolution.org Apache Lucene/Solr Conference, Boston Oct 7-8
