[
https://issues.apache.org/jira/browse/ARROW-17887?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
ASF GitHub Bot updated ARROW-17887:
-----------------------------------
Labels: pull-request-available (was: )
> [R] [Doc] Improve readability of the Get Started and README pages
> -----------------------------------------------------------------
>
> Key: ARROW-17887
> URL: https://issues.apache.org/jira/browse/ARROW-17887
> Project: Apache Arrow
> Issue Type: Improvement
> Components: R
> Reporter: Danielle Navarro
> Assignee: Danielle Navarro
> Priority: Major
> Labels: pull-request-available
> Time Spent: 10m
> Remaining Estimate: 0h
>
> In its current form the pkgdown Get Started and Read Me pages are a little
> hard for new users to follow. I would argue that both pages are written in a
> way that makes sense to someone who is already familiar with core Arrow
> concepts, but is potentially intimidating to an R user who is curious about
> Arrow but has never used it. The issue is perhaps most severe on the main
> [README page](https://arrow.apache.org/docs/r/index.html) and the [Get
> Started](https://arrow.apache.org/docs/r/articles/arrow.html) page. A few
> examples:
> - The README page opens with the sentence **"Apache Arrow is a cross-language
> development platform for in-memory data".** This is a problem for multiple
> reasons. Firstly it's not really true anymore, because we encourage users to
> rely on `Dataset` for on-disk datasets. Secondly, the sentence simply
> *assumes* the user has a clear mental model of the difference between
> in-memory and on-disk data. I don't think that's true for data scientists in
> general. A data engineer likely has a more precise mental model here, but R
> users are typically focused on analytics. Unless they have extensive
> experience working with large data sets this isn't something we can assume.
> Thirdly, and maybe most importantly, it doesn't explain to the user why they
> should care about arrow: it doesn't say what the arrow package *does*. It's
> too vague.
> - There are (IMO) too many boldfaced sections in the README page, and it's
> very cluttered. It gives the page an intensity and feeling of "denseness"
> that I think we should avoid at all costs. Arrow already has a reputation for
> being a complicated project (because it is!) but we don't want our
> documentation to have that feeling. I think we ought to be aiming for
> something gentler and welcoming. If that means pushing more details into
> vignettes, that's totally okay. Readers don't need to be told all the things
> on the very first page: it's probably better to give a simpler description
> and then push the details onto additional vignettes.
> - The "get started" page has some of the same problems as the main README.
> The "object hierarchy" and "data object" tables only make sense once you
> already understand core Arrow concepts. What needs to happen in both cases is
> the tables need to be wrapped with some explanatory text that provide the
> missing context for users, and then additional details are pushed out to
> vignettes that explain it in more detail.
> - The data types mapping section on the get started page has the same issue.
> A novice user doesn't necessarily even have a clear understanding of how
> fundamental types are represented in R, much less how they are represented in
> Arrow. A section that simply assumes that these types are meaningful concepts
> and gives a lookup table with various footnotes isn't at all helpful to that
> kind of user. I think it makes more sense to again split the work: on the
> "get started" page we should have something simple, and a longer discussion
> of these mappings should be pushed to a vignette
> The concrete proposal here is to restructure the content of these two pages
> to be more novice-friendly: specifically, to add more "Arrow 101" explanatory
> notes to these pages, and to move more of the technical information to new
> vignettes (e.g., there should be a new "data types" vignette)
--
This message was sent by Atlassian Jira
(v8.20.10#820010)