[CODE4LIB] Spark in the Dark Call: Tues Nov 21 4 PM eastern / 1 PM Pacific

Christina Marie Harlow Thu, 16 Nov 2017 15:57:59 -0800

Hi all-

Just a ping to the list about our upcoming, informal, totally rad Code4Lib 
Slack's "Spark in the Dark” (#sparkinthedark) talk next week.


We have a wildly informal, super fun and mad informative call next week, 
Tuesday, November 21st at 4 PM Eastern / 1 PM Pacific, on Text Analysis at 
Scale by Corey Harper & Jessica Cox (see their blurb below). You can join the 
call here if you’re interested https://stanford.zoom.us/j/4167209074.

And if you haven’t yet, join us on our Code4Lib slack channel, #sparkinthedark

— — Talk details — —

Spark at Elsevier: Tools for Text Analysis at Scale

This talk is a hybrid of a talk on Citing Sentences analysis given at PyGotham 
2017 and a second talk on AnnotationQuery Use Cases presented internally to 
Elsevier.

The first half of the talk will be focused on doing Natural Language Processing 
(NLP) in a Python-based Spark environment using PySpark. Examples will be drawn 
from a Citing Sentences project underway within Elsevier Labs 
(http://labs.elsevier.com/). The goal of this project is to build and analyze 
citation networks to understand the diffusion and flow of ideas through the 
scientific research landscape. Much like a social network, scientists want to 
understand how others are ‘talking’ about their papers. Are they supporting 
their work? Disagreeing with it? Is it being referred to as a discovery? 
PySpark code will be demoed using the Community Edition of DataBricks, and the 
talk will cover using the DataBricks environment to manage Spark clusters. A 
DataBricks notebook and sample dataset will be provided at the end of the talk.

The second half of the talk will introduce AnnotationQuery. Recently Open 
Sourced by Elsevier Labs, AnnotationQuery is designed as a set of composable 
(and extensible) functions that allows users to query annotations generated 
from full-text content at scale. We will introduce our internal Content 
Analysis Toolbench (CAT3) annotation format. We will then use another set of 
DataBricks notebooks, this time in Scala, to show how AnnotationQuery allows 
combining structural and natural language content to allow for powerful text 
mining pipelines. We will focus on a Use Case about extracting units and 
measures contained within article text. These measurements can then be used in 
a variety of analyses of experimental conditions and entity properties, from 
mouse bioterium temperatures to compressive strengths of concrete.

————

Thanks!
Christina

Christina Harlow
Data Operations Engineer
Digital Library Systems and Services
Stanford, CA 94305
[email protected]<mailto:[email protected]>

[CODE4LIB] Spark in the Dark Call: Tues Nov 21 4 PM eastern / 1 PM Pacific

Reply via email to