Danielle Navarro created ARROW-17887:
----------------------------------------
Summary: [R] [Doc] Improve readability of the "get started" page
Key: ARROW-17887
URL: https://issues.apache.org/jira/browse/ARROW-17887
Project: Apache Arrow
Issue Type: Improvement
Components: R
Reporter: Danielle Navarro
Assignee: Danielle Navarro
In its current form the pkgdown Get Started and Read Me pages are a little hard
for new users to follow. I would argue that both pages are written in a way
that makes sense to someone who is already familiar with core Arrow concepts,
but is potentially intimidating to an R user who is curious about Arrow but has
never used it. The issue is perhaps most severe on the main [README
page](https://arrow.apache.org/docs/r/index.html) and the [Get
Started](https://arrow.apache.org/docs/r/articles/arrow.html) page. A few
examples:
- The README page opens with the sentence **"Apache Arrow is a cross-language
development platform for in-memory data".** This is a problem for multiple
reasons. Firstly it's not really true anymore, because we encourage users to
rely on `Dataset` for on-disk datasets. Secondly, the sentence simply *assumes*
the user has a clear mental model of the difference between in-memory and
on-disk data. I don't think that's true for data scientists in general. A data
engineer likely has a more precise mental model here, but R users are typically
focused on analytics. Unless they have extensive experience working with large
data sets this isn't something we can assume. Thirdly, and maybe most
importantly, it doesn't explain to the user why they should care about arrow:
it doesn't say what the arrow package *does*. It's too vague.
- There are (IMO) too many boldfaced sections in the README page, and it's very
cluttered. It gives the page an intensity and feeling of "denseness" that I
think we should avoid at all costs. Arrow already has a reputation for being a
complicated project (because it is!) but we don't want our documentation to
have that feeling. I think we ought to be aiming for something gentler and
welcoming. If that means pushing more details into vignettes, that's totally
okay. Readers don't need to be told all the things on the very first page: it's
probably better to give a simpler description and then push the details onto
additional vignettes.
- The "get started" page has some of the same problems as the main README. The
"object hierarchy" and "data object" tables only make sense once you already
understand core Arrow concepts. What needs to happen in both cases is the
tables need to be wrapped with some explanatory text that provide the missing
context for users, and then additional details are pushed out to vignettes that
explain it in more detail.
- The data types mapping section on the get started page has the same issue. A
novice user doesn't necessarily even have a clear understanding of how
fundamental types are represented in R, much less how they are represented in
Arrow. A section that simply assumes that these types are meaningful concepts
and gives a lookup table with various footnotes isn't at all helpful to that
kind of user. I think it makes more sense to again split the work: on the "get
started" page we should have something simple, and a longer discussion of these
mappings should be pushed to a vignette
The concrete proposal here is to restructure the content of these two pages to
be more novice-friendly: specifically, to add more "Arrow 101" explanatory
notes to these pages, and to move more of the technical information to new
vignettes (e.g., there should be a new "data types" vignette)
--
This message was sent by Atlassian Jira
(v8.20.10#820010)