Hi folks

Thanks for your help and suggestions - very much appreciated.

I now have some working code, using this file I uploaded for public access: https://docs.google.com/document/d/1QwuaWZk6tYlWQXJ3WLczxC8Cda6zVERk/edit?usp=sharing&ouid=103065135255080058813&rtpof=true&sd=true

The small code segment that now works is as follows:

###########

# Load libraries
library(textreadr)
library(tcltk)
library(tidyverse)
#library(officer)
#library(stringr) #for splitting and trimming raw data
#library(tidyr) #for converting to wide format

# I'd like to keep this as it enables more control over the selected directories
filepath <- setwd(tk_choose.dir())

# The following correctly lists the names of all 9 files in my test directory
files <- list.files(filepath, ".docx")
files
length(files)

# Ideally, I'd like to skip this step by being able to automatically read in the name of each file, but one step at a time: filename <- "Now they want us to charge our electric cars from litter bins.docx"

# This produces the file content as output when run, and identifies the fields that I want to extract.
read_docx(filename) %>%
  str_split(",") %>%
  unlist() %>%
  str_trim()

###########

What I'd like to try and accomplish next is to extract the data from selected fields and append to a spreadsheet (Calc or Excel) under specific columns, or if it is easier to write a CSV which I can then use later.

The fields I want to extract are illustrated with reference to the above file, viz.:

The title: "Now they want us to charge our electric cars from litter bins"
The name of the newspaper: "Mail on Sunday (London)"
The publication date: "September 24, 2023" (in date format, preferably separated into month and year (day is not important))
The section: "NEWS"
The page number(s): "16" (as numeric)
The length: "515" (as numeric)
The author: "Anna Mikhailova"
The subject: from the Subject section, but this is to match a value e.g. GREENWASHING >= 50% (here this value is 51% so would be included). A match moves onto select the highest value under the section "Industry" (here it is ELECTRIC MOBILITY (91%)) and appends this text and % value. If no match with 'Greenwashing', then appends 'Null' and moves onto the next file in the directory.

###########

The theory I am working with is if I can figure out how to extract these fields and append correctly, then the rest should just be wrapping this up in a for loop.

However, I am struggling to get my head around the extraction and append part. If I can get it to work for one of these fields, I suspect that I can repeat the basic syntax to extract and append the remaining fields.

Therefore, if someone can either suggest a syntax or point me to a useful tutorial, that would be splendid.

Thank you in anticipation.

Best wishes
Andy

<snip>

______________________________________________
R-help@r-project.org mailing list -- To UNSUBSCRIBE and more, see
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.

Reply via email to