What about 200.000 Jeopardy questions in JSON format?
https://www.reddit.com/r/datasets/comments/1uyd0t/200000_jeopardy_questions_in_a_json_file/
 
<https://www.reddit.com/r/datasets/comments/1uyd0t/200000_jeopardy_questions_in_a_json_file/>
I downloaded the file in a few seconds, and it also has some structured 
content, e.g.

  {
    "category": "NOVELS",
    "air_date": "2005-01-27",
    "question": "'Even the epilogue is lengthy in this 1869 Tolstoy epic; it 
comes out in 2 parts &, in our copy, is 105 pages long'",
    "value": "$400",
    "answer": "War and Peace",
    "round": "Jeopardy!",
    "show_number": "4699"
  },
  {
    "category": "BRIGHT IDEAS",
    "air_date": "2005-01-27",
    "question": "'In 1948 scientists at Bristol-Meyers \"buffered\" this 
medicine for the first time'",
    "value": "$400",
    "answer": "aspirin",
    "round": "Jeopardy!",
    "show_number": "4699"
  },

Lots of docs. Enough free-text to learn some analysis, enough metadata for some 
meaningful facets / filters…

As long as we only provide a URL and not re-distribute the content, licensing 
is less of a concern.

Jan

> 1. sep. 2020 kl. 15:59 skrev Alexandre Rafalovitch <[email protected]>:
> 
> I've thought of providing instructions. But for good indexing, we
> should use adoc format as source, rather than html (as Cassandra's
> presentation showed), so that means dependencies to build by user to
> get asciidoctor library. And the way to get content, so either git
> clone or download the whole source and unpack and figure out the
> directory locations. It feels messy. Then, it may as well be an
> external package or even an external independent project. And
> therefore, it would lose value as a shipped tutorial material.
> 
> We could also discuss actually shipping the Solr Reference Guide with
> Solr now that the release cycles align, but that would actually not
> help my sub-project too much, again because of adoc vs. html formats.
> 
> In terms of other datasets:
> *) I could just stay with limited full-text in the one I am thinking
> of. The bulk download mode allows for fields such as Occupation,
> Company and Vehicle model which are 2-7 words long. That's about the
> same length as current examples we ship. It does not allow for a
> meaningful discussion about longer-text issues such as
> length-normalization, but we don't have those now anyway.
> *) I could use a public domain book and break it into parts. From
> somewhere like https://standardebooks.org/ . But there is a question
> about licensing and also whether we will be able to show interesting
> effects with that.
> *) I was also told that there is Wikipedia, but again, would we just
> include a couple of articles at random? What's the license?
> *) It is possible to index Stack Overflow questions, either from the
> feed (DIH was doing that) or as a download. I think the license was
> compatible.
> *) I could augment the dataset with some mix of the above, like a
> "favourite quote" field with random book sentences. This feels like
> fun, but possibly a whole separate project of its own.
> 
> Anyway, I am open to further thoughts. It is quite likely I missed something.
> 
> Regards,
>   Alex.
> 
> T
> 
> On Tue, 1 Sep 2020 at 03:10, Jan Høydahl <[email protected]> wrote:
>> 
>> I’d rather ship a tutorial and tooling that explains how to index the 
>> ref-guide, than shipping a binary index.
>> What other full-text datasets have you considered as candidates for 
>> getting-started examples?
>> 
>> Jan
>> 
>> 1. sep. 2020 kl. 05:53 skrev Alexandre Rafalovitch <[email protected]>:
>> 
>> I did not say it was trivial, but I also did not quite mention the previous 
>> research.
>> 
>> https://github.com/arafalov/solr-refguide-indexing/blob/master/src/com/solrstart/refguide/Indexer.java
>> 
>> Uses official AsciidoctorJ library directory. Not sure if that's just JRuby 
>> version of Asciidoctor we currently use to build. But this should only 
>> affect the development process, not the final built package.
>> 
>> I think I am more trying to figure out what people think about shipping an 
>> actual core with the distribution. That is something I haven't seen done 
>> before. And may have issues I did not think of.
>> 
>> Regards,
>>    Alex
>> 
>> On Mon., Aug. 31, 2020, 10:11 p.m. Gus Heck, <[email protected]> wrote:
>>> 
>>> Some background to consider before committing to that... it might not be as 
>>> trivial as you think. (I've often thought it ironic that we don't have real 
>>> search for our ref guide... )
>>> 
>>> https://www.youtube.com/watch?v=DixlnxAk08s
>>> 
>>> -Gus
>>> 
>>> On Mon, Aug 31, 2020 at 2:06 PM Ishan Chattopadhyaya 
>>> <[email protected]> wrote:
>>>> 
>>>> I love the idea of making the ref guide itself as an example dataset. That 
>>>> way, we won't need to ship anything separately. Python's beautiful soup 
>>>> can extract text from the html pages. I'm sure there maybe such things in 
>>>> Java too (can Tika do this?).
>>>> 
>>>> On Mon, 31 Aug, 2020, 11:18 pm Alexandre Rafalovitch, <[email protected]> 
>>>> wrote:
>>>>> 
>>>>> Hi,
>>>>> I need a sanity check.
>>>>> 
>>>>> I am in the planning stages for the new example datasets to ship with
>>>>> Solr 9. The one I am looking at is great for structured information,
>>>>> but is quite light on full-text content. So, I am thinking of how
>>>>> important that is and what other sources could be used.
>>>>> 
>>>>> One - only slightly - crazy idea is to use Solr Reference Guide itself
>>>>> as a document source. I am not saying we need to include the guide
>>>>> with Solr distribution, but:
>>>>> 1) I could include a couple of sample pages
>>>>> 2) I could index the whole guide (with custom Java-code) during the
>>>>> final build and we could ship the full index (with stored=false) with
>>>>> Solr, which then basically becomes a local search for the remote guide
>>>>> (with absolute URLs).
>>>>> 
>>>>> Either way would allow us to also explore what a good search
>>>>> configuration could look like for the Ref Guide for when we are
>>>>> actually ready to move beyond its current "headings-only" javascript
>>>>> search. Actually, done right, same/similar tool could also feed
>>>>> subheadings into the javascript search.
>>>>> 
>>>>> Like I said, sanity check?
>>>>> 
>>>>> Regards,
>>>>>   Alex.
>>>>> 
>>>>> ---------------------------------------------------------------------
>>>>> To unsubscribe, e-mail: [email protected]
>>>>> For additional commands, e-mail: [email protected]
>>>>> 
>>> 
>>> 
>>> --
>>> http://www.needhamsoftware.com (work)
>>> http://www.the111shift.com (play)
>> 
>> 
> 
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: [email protected]
> For additional commands, e-mail: [email protected]
> 

Reply via email to