It's tempting to accomplish two goals at once (tutorial & searchable ref
guide) but I think the realities of making a *good* searchable ref guide
may distract someone from learning as it tries to do both well.  A
searchable ref-guide could very well be its own project that we point
people learning at who move beyond some of the very early basics.

~ David Smiley
Apache Lucene/Solr Search Developer
http://www.linkedin.com/in/davidwsmiley


On Tue, Sep 1, 2020 at 1:23 PM Alexandre Rafalovitch <arafa...@gmail.com>
wrote:

> That Jeopardy set reads very dubious. Content that was collected by
> scraping and available on various sharing sites (including Mega!). I
> would not feel comfortable working with that in our context.
>
> There are other dataset sources. I like the ones that Data is Plural
> newsletter collects: https://tinyletter.com/data-is-plural (full list
> at:
> https://docs.google.com/spreadsheets/d/1wZhPLMCHKJvwOkP4juclhjFgqIY8fQFMemwKL2c64vk/edit#gid=0
> ). Again, copyright is important and I think having a local copy is
> important too, for at least tutorial purposes.
>
> But I wish we could figure out a way to include the RefGuide. It is
> just so much more triple-bottom line solution than just random other
> dataset. We could do a graph of cross-references in the guide, figure
> out how to extract java path references, etc.
>
> Anyway, it is not something that is super-urgent. I don't even know
> whether our new build processes can be augmented to do this. I guess
> it is a bit similar to how we run tests.
>
> I just wanted to get a strong yay/nay on the idea. So far it feels
> like I got one strong yay, one caution and one soft nay.
>
> Regards,
>    Alex.
>
>
>
> On Tue, 1 Sep 2020 at 12:28, Jan Høydahl <jan....@cominvent.com> wrote:
> >
> > What about 200.000 Jeopardy questions in JSON format?
> >
> https://www.reddit.com/r/datasets/comments/1uyd0t/200000_jeopardy_questions_in_a_json_file/
> > I downloaded the file in a few seconds, and it also has some structured
> content, e.g.
> >
> >   {
> >     "category": "NOVELS",
> >     "air_date": "2005-01-27",
> >     "question": "'Even the epilogue is lengthy in this 1869 Tolstoy
> epic; it comes out in 2 parts &, in our copy, is 105 pages long'",
> >     "value": "$400",
> >     "answer": "War and Peace",
> >     "round": "Jeopardy!",
> >     "show_number": "4699"
> >   },
> >   {
> >     "category": "BRIGHT IDEAS",
> >     "air_date": "2005-01-27",
> >     "question": "'In 1948 scientists at Bristol-Meyers \"buffered\" this
> medicine for the first time'",
> >     "value": "$400",
> >     "answer": "aspirin",
> >     "round": "Jeopardy!",
> >     "show_number": "4699"
> >   },
> >
> > Lots of docs. Enough free-text to learn some analysis, enough metadata
> for some meaningful facets / filters…
> >
> > As long as we only provide a URL and not re-distribute the content,
> licensing is less of a concern.
> >
> > Jan
> >
> > 1. sep. 2020 kl. 15:59 skrev Alexandre Rafalovitch <arafa...@gmail.com>:
> >
> > I've thought of providing instructions. But for good indexing, we
> > should use adoc format as source, rather than html (as Cassandra's
> > presentation showed), so that means dependencies to build by user to
> > get asciidoctor library. And the way to get content, so either git
> > clone or download the whole source and unpack and figure out the
> > directory locations. It feels messy. Then, it may as well be an
> > external package or even an external independent project. And
> > therefore, it would lose value as a shipped tutorial material.
> >
> > We could also discuss actually shipping the Solr Reference Guide with
> > Solr now that the release cycles align, but that would actually not
> > help my sub-project too much, again because of adoc vs. html formats.
> >
> > In terms of other datasets:
> > *) I could just stay with limited full-text in the one I am thinking
> > of. The bulk download mode allows for fields such as Occupation,
> > Company and Vehicle model which are 2-7 words long. That's about the
> > same length as current examples we ship. It does not allow for a
> > meaningful discussion about longer-text issues such as
> > length-normalization, but we don't have those now anyway.
> > *) I could use a public domain book and break it into parts. From
> > somewhere like https://standardebooks.org/ . But there is a question
> > about licensing and also whether we will be able to show interesting
> > effects with that.
> > *) I was also told that there is Wikipedia, but again, would we just
> > include a couple of articles at random? What's the license?
> > *) It is possible to index Stack Overflow questions, either from the
> > feed (DIH was doing that) or as a download. I think the license was
> > compatible.
> > *) I could augment the dataset with some mix of the above, like a
> > "favourite quote" field with random book sentences. This feels like
> > fun, but possibly a whole separate project of its own.
> >
> > Anyway, I am open to further thoughts. It is quite likely I missed
> something.
> >
> > Regards,
> >   Alex.
> >
> > T
> >
> > On Tue, 1 Sep 2020 at 03:10, Jan Høydahl <jan....@cominvent.com> wrote:
> >
> >
> > I’d rather ship a tutorial and tooling that explains how to index the
> ref-guide, than shipping a binary index.
> > What other full-text datasets have you considered as candidates for
> getting-started examples?
> >
> > Jan
> >
> > 1. sep. 2020 kl. 05:53 skrev Alexandre Rafalovitch <arafa...@gmail.com>:
> >
> > I did not say it was trivial, but I also did not quite mention the
> previous research.
> >
> >
> https://github.com/arafalov/solr-refguide-indexing/blob/master/src/com/solrstart/refguide/Indexer.java
> >
> > Uses official AsciidoctorJ library directory. Not sure if that's just
> JRuby version of Asciidoctor we currently use to build. But this should
> only affect the development process, not the final built package.
> >
> > I think I am more trying to figure out what people think about shipping
> an actual core with the distribution. That is something I haven't seen done
> before. And may have issues I did not think of.
> >
> > Regards,
> >    Alex
> >
> > On Mon., Aug. 31, 2020, 10:11 p.m. Gus Heck, <gus.h...@gmail.com> wrote:
> >
> >
> > Some background to consider before committing to that... it might not be
> as trivial as you think. (I've often thought it ironic that we don't have
> real search for our ref guide... )
> >
> > https://www.youtube.com/watch?v=DixlnxAk08s
> >
> > -Gus
> >
> > On Mon, Aug 31, 2020 at 2:06 PM Ishan Chattopadhyaya <
> ichattopadhy...@gmail.com> wrote:
> >
> >
> > I love the idea of making the ref guide itself as an example dataset.
> That way, we won't need to ship anything separately. Python's beautiful
> soup can extract text from the html pages. I'm sure there maybe such things
> in Java too (can Tika do this?).
> >
> > On Mon, 31 Aug, 2020, 11:18 pm Alexandre Rafalovitch, <
> arafa...@gmail.com> wrote:
> >
> >
> > Hi,
> > I need a sanity check.
> >
> > I am in the planning stages for the new example datasets to ship with
> > Solr 9. The one I am looking at is great for structured information,
> > but is quite light on full-text content. So, I am thinking of how
> > important that is and what other sources could be used.
> >
> > One - only slightly - crazy idea is to use Solr Reference Guide itself
> > as a document source. I am not saying we need to include the guide
> > with Solr distribution, but:
> > 1) I could include a couple of sample pages
> > 2) I could index the whole guide (with custom Java-code) during the
> > final build and we could ship the full index (with stored=false) with
> > Solr, which then basically becomes a local search for the remote guide
> > (with absolute URLs).
> >
> > Either way would allow us to also explore what a good search
> > configuration could look like for the Ref Guide for when we are
> > actually ready to move beyond its current "headings-only" javascript
> > search. Actually, done right, same/similar tool could also feed
> > subheadings into the javascript search.
> >
> > Like I said, sanity check?
> >
> > Regards,
> >   Alex.
> >
> > ---------------------------------------------------------------------
> > To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
> > For additional commands, e-mail: dev-h...@lucene.apache.org
> >
> >
> >
> > --
> > http://www.needhamsoftware.com (work)
> > http://www.the111shift.com (play)
> >
> >
> >
> >
> > ---------------------------------------------------------------------
> > To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
> > For additional commands, e-mail: dev-h...@lucene.apache.org
> >
> >
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
> For additional commands, e-mail: dev-h...@lucene.apache.org
>
>

Reply via email to