Re: [R] Help request: Parsing docx files for key words and appending to a spreadsheet

Andy Thu, 04 Jan 2024 05:01:57 -0800

Hi folks

Thanks for your help and suggestions - very much appreciated.

I now have some working code, using this file I uploaded for publicaccess:https://docs.google.com/document/d/1QwuaWZk6tYlWQXJ3WLczxC8Cda6zVERk/edit?usp=sharing&ouid=103065135255080058813&rtpof=true&sd=true


The small code segment that now works is as follows:

###########

# Load libraries
library(textreadr)
library(tcltk)
library(tidyverse)
#library(officer)
#library(stringr) #for splitting and trimming raw data
#library(tidyr) #for converting to wide format

# I'd like to keep this as it enables more control over the selecteddirectories

filepath <- setwd(tk_choose.dir())

# The following correctly lists the names of all 9 files in my testdirectory

files <- list.files(filepath, ".docx")
files
length(files)

# Ideally, I'd like to skip this step by being able to automaticallyread in the name of each file, but one step at a time:filename <- "Now they want us to charge our electric cars from litterbins.docx"

# This produces the file content as output when run, and identifies thefields that I want to extract.

read_docx(filename) %>%
  str_split(",") %>%
  unlist() %>%
  str_trim()

###########

What I'd like to try and accomplish next is to extract the data fromselected fields and append to a spreadsheet (Calc or Excel) underspecific columns, or if it is easier to write a CSV which I can then uselater.

The fields I want to extract are illustrated with reference to the abovefile, viz.:


The title: "Now they want us to charge our electric cars from litter bins"
The name of the newspaper: "Mail on Sunday (London)"

The publication date: "September 24, 2023" (in date format, preferablyseparated into month and year (day is not important))

The section: "NEWS"
The page number(s): "16" (as numeric)
The length: "515" (as numeric)
The author: "Anna Mikhailova"

The subject: from the Subject section, but this is to match a value e.g.GREENWASHING >= 50% (here this value is 51% so would be included). Amatch moves onto select the highest value under the section "Industry"(here it is ELECTRIC MOBILITY (91%)) and appends this text and % value.If no match with 'Greenwashing', then appends 'Null' and moves onto thenext file in the directory.


###########

The theory I am working with is if I can figure out how to extract thesefields and append correctly, then the rest should just be wrapping thisup in a for loop.

However, I am struggling to get my head around the extraction and appendpart. If I can get it to work for one of these fields, I suspect that Ican repeat the basic syntax to extract and append the remaining fields.

Therefore, if someone can either suggest a syntax or point me to auseful tutorial, that would be splendid.


Thank you in anticipation.

Best wishes
Andy

<snip>

______________________________________________
R-help@r-project.org mailing list -- To UNSUBSCRIBE and more, see
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.

Re: [R] Help request: Parsing docx files for key words and appending to a spreadsheet

Reply via email to