Danielle Navarro created ARROW-17887:
----------------------------------------

             Summary: [R] [Doc] Improve readability of the "get started" page
                 Key: ARROW-17887
                 URL: https://issues.apache.org/jira/browse/ARROW-17887
             Project: Apache Arrow
          Issue Type: Improvement
          Components: R
            Reporter: Danielle Navarro
            Assignee: Danielle Navarro


In its current form the pkgdown Get Started and Read Me pages are a little hard 
for new users to follow. I would argue that both pages are written in a way 
that makes sense to someone who is already familiar with core Arrow concepts, 
but is potentially intimidating to an R user who is curious about Arrow but has 
never used it. The issue is perhaps most severe on the main [README 
page](https://arrow.apache.org/docs/r/index.html) and the [Get 
Started](https://arrow.apache.org/docs/r/articles/arrow.html) page. A few 
examples:

- The README page opens with the sentence **"Apache Arrow is a cross-language 
development platform for in-memory data".** This is a problem for multiple 
reasons. Firstly it's not really true anymore, because we encourage users to 
rely on `Dataset` for on-disk datasets. Secondly, the sentence simply *assumes* 
the user has a clear mental model of the difference between in-memory and 
on-disk data. I don't think that's true for data scientists in general. A data 
engineer likely has a more precise mental model here, but R users are typically 
focused on analytics. Unless they have extensive experience working with large 
data sets this isn't something we can assume. Thirdly, and maybe most 
importantly, it doesn't explain to the user why they should care about arrow: 
it doesn't say what the arrow package *does*. It's too vague.

- There are (IMO) too many boldfaced sections in the README page, and it's very 
cluttered. It gives the page an intensity and feeling of "denseness" that I 
think we should avoid at all costs. Arrow already has a reputation for being a 
complicated project (because it is!) but we don't want our documentation to 
have that feeling. I think we ought to be aiming for something gentler and 
welcoming. If that means pushing more details into vignettes, that's totally 
okay. Readers don't need to be told all the things on the very first page: it's 
probably better to give a simpler description and then push the details onto 
additional vignettes.

- The "get started" page has some of the same problems as the main README. The 
"object hierarchy" and "data object" tables only make sense once you already 
understand core Arrow concepts. What needs to happen in both cases is the 
tables need to be wrapped with some explanatory text that provide the missing 
context for users, and then additional details are pushed out to vignettes that 
explain it in more detail. 

- The data types mapping section on the get started page has the same issue. A 
novice user doesn't necessarily even have a clear understanding of how 
fundamental types are represented in R, much less how they are represented in 
Arrow. A section that simply assumes that these types are meaningful concepts 
and gives a lookup table with various footnotes isn't at all helpful to that 
kind of user. I think it makes more sense to again split the work: on the "get 
started" page we should have something simple, and a longer discussion of these 
mappings should be pushed to a vignette

The concrete proposal here is to restructure the content of these two pages to 
be more novice-friendly: specifically, to add more "Arrow 101" explanatory 
notes to these pages, and to move more of the technical information to new 
vignettes (e.g., there should be a new "data types" vignette)



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

Reply via email to