Github user felixcheung commented on the issue:
https://github.com/apache/spark/pull/17161
Firstly, I see this as slightly different from Python, in that in R it is
common to have built-in datasets and possibly users are used to having them and
having examples using them.
And as of now, many of our examples are not meant currently to be runnable
and they are clearly indicated as such.
I have done a pass on the changes in this PR and I'm happy with changing
from non-existing json file to `mtcars`. I'm slightly concerned with the few
cases of artificial 3 rows data (like
[here](https://github.com/apache/spark/pull/17161/files#diff-508641a8bd6c6b59f3e77c80cdcfa6a9R2483))
- more on that below on small dataset.
That said, I wonder about the verbosity of adding to examples like this,
similarly as in the Python discussions, and, since we have more than 300 pages
of API doc, this is not a simple task to change them all.
But I do agree that not having broken or incorrect examples is very
important.
My concerns are:
- how much work and how much change is it to change all examples (this is
only 1 .R out of 20-something files we have, in a total of 300+ methods which
is on the high side for R packages)
- how much churn will it be to keep them up-to-date when we are having
changes to API (eg. `sparkR.session()`); especially since in order to have
examples self-contained we tend to add additional calls to manipulate data and
thereby increasing the number of references of API calls
- perhaps more importantly, how practical or useful it would be to use
built-in datasets or native R data.frame (`mtcars`, `cars`, `Titanic`, `iris`,
or make up some; that are super small) on a scalable data platform like Spark?
perhaps it is better to demonstrate, in examples, how to work with external
data sources, multiple file formats etc.?
- and lastly, we still have about a dozen methods that are without example
that are being flagged by CRAN checks (but not enough to fail it yet)
Couple of *random* thoughts (would be interested to see how they look
first!):
- group smaller functions into a single page and sharing a longer, more
concrete example (need to check if it messes up parameter documentation or make
them more confusing! or, how it might affect method help discoverability, like
with `?predict`) (btw, this is the approach we have for ML methods)
- reference external example files
- have examples using datasets that come with Spark (like [this
one](https://github.com/apache/spark/blob/master/examples/src/main/resources/people.json))
- have examples in templates and reuse them
- keep existing page breakdown but instead of scattering examples around in
each, link to a special group of pages (via `@seealso`) with longer, more
concrete examples (eg. column manipulation set)
- make example run (ie. remove dontrun) this, of course, would need to make
sure examples are self-contained and are correct (this is a bigger effort; this
could possibly extend build time and/or make build fails more often, as example
will then run as a part of CRAN check) (?!)
I suspect we would likely need a combination or subset of these techniques.
To me, the high-level priority would be in order i) example correctness;
ii) example coverage - we should have some examples for every method; iii)
better, richer, self-contained examples in strategic places
Thoughts?
---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at [email protected] or file a JIRA ticket
with INFRA.
---
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]