Re: [CODE4LIB] scraping or extracting structured data from a pdf

2022-05-13 Thread MJ Ray
Le 12 mai 2022 20:44:22 GMT+01:00, "Hammer, Erich F" a écrit  : >Danielle, > >.DOCX files are just a collection of zipped xml and image files. You can see >this by changing the extension (on a copy) on the file and then exploring. It >should be possible to parse out the data from the XML

Re: [CODE4LIB] scraping or extracting structured data from a pdf

2022-05-12 Thread Dr Vinit Kumar
There are several Citation parsers available, you may try exploring which one works best for you. I am listing some of them, as per my knowledge and experience: 1. Anystyle.io 2. GROBID 3. Excite 4. Outside 5. biblio-glutton 6. CERMINE Hope it helps. On Fri, May 13, 2022 at

Re: [CODE4LIB] scraping or extracting structured data from a pdf

2022-05-12 Thread Kevin Hawkins
And for going beyond the bibliographic citations to include abstracts as well, https://grobid.readthedocs.io/en/latest/ might be useful.  --Kevin On 5/12/22 1:49 PM, Julia Bauder wrote: Hi, Danielle, Have you taken a look at https://text2bib.economics.utoronto.ca/ ? If it works for you,

Re: [CODE4LIB] scraping or extracting structured data from a pdf

2022-05-12 Thread Hammer, Erich F
Danielle, .DOCX files are just a collection of zipped xml and image files. You can see this by changing the extension (on a copy) on the file and then exploring. It should be possible to parse out the data from the XML file(s) and build a structure from it. Erich On Thursday, May 12, 2022

Re: [CODE4LIB] scraping or extracting structured data from a pdf

2022-05-12 Thread Joe Hourclé
Let’s try this again without my hitting ‘send’ when I want to send it to drafts. (Yay, mystery meat navigation in cell phone interfaces) >> On May 12, 2022, at 2:40 PM, Danielle Reay wrote: >> >> Hello, >> >> We have a faculty member looking to create a dataset from an annotated >>

Re: [CODE4LIB] scraping or extracting structured data from a pdf

2022-05-12 Thread Joe Hourclé
> > On May 12, 2022, at 2:40 PM, Danielle Reay wrote: > > Hello, > > We have a faculty member looking to create a dataset from an annotated > bibliography she compiled. Right now it exists as a word file and as a pdf. > The entries are relatively structured with a citation and an abstract,

Re: [CODE4LIB] scraping or extracting structured data from a pdf

2022-05-12 Thread Miles Fidelman
Danielle Reay wrote: Hello, We have a faculty member looking to create a dataset from an annotated bibliography she compiled. Right now it exists as a word file and as a pdf. The entries are relatively structured with a citation and an abstract, but the document is about 150 pages long with

Re: [CODE4LIB] scraping or extracting structured data from a pdf

2022-05-12 Thread Julia Bauder
Hi, Danielle, Have you taken a look at https://text2bib.economics.utoronto.ca/ ? If it works for you, that's likely to be one of the easiest methods to convert the list into structured data. Best, Julia _ Julia Bauder Social Studies and Data

[CODE4LIB] scraping or extracting structured data from a pdf

2022-05-12 Thread Danielle Reay
Hello, We have a faculty member looking to create a dataset from an annotated bibliography she compiled. Right now it exists as a word file and as a pdf. The entries are relatively structured with a citation and an abstract, but the document is about 150 pages long with multiple entries per page.