Description

The goal of the Coleridge Initiative at NYU is to use data to transform the way 
governments access and use data for the social good.  We are a fast-growing 
university-based startup that has already created dozens of pilot projects, 
worked with over 100 agencies – federal, state and local - and trained over 450 
agency staff.  Our program directors – Julia Lane, Rayid Ghani, and Frauke 
Kreuter – have designed and implemented training programs, research projects 
and a secure data facility that are attracting national attention, including 
the Commission on Evidence Based Policy and the Federal Data Strategy.



Our Team

Our team works with government agencies to break down data barriers around the 
secure use of confidential data.   We do this in two ways.  We have developed a 
secure environment for data (the Administrative Data Research Facility, or ADRF 
https://coleridgeinitiative.org/computing ), and are building new tools for 
data stewardship, data discovery and collaboration with some of the top 
scientists in the nation.   We work with government agencies to (1) identify 
critical agency problems, (2) train staff to solve them, and (3) create 
products that have value.  You can read more about our work at 
https://coleridgeinitiative.org.



Role & Responsibilities

We are seeking an enthusiastic, analytically minded Research Information 
Scientist with extensive experience working with data and research processes, 
as well as demonstrated experience in information or content management.   The 
Research Information Scientist will be the lead on the full life cycle of data 
ingestion and storage in the ADRF. This is detail-oriented work, and the 
successful candidate will have complementary technical skills in data 
management, programming, and user experience as well as knowledge of current 
technologies, metadata standards and encoding standards (e.g. XML).



The Research Information Scientist will design and develop highly robust, 
repeatable and scalable workflow patterns to ingest, integrate and publish a 
wide variety of data from internal and external sources. The successful 
candidate will be responsible for ensuring that the ADRF’s data workflows and 
pipelines are enterprise-grade – reliable, scalable and secure – and for 
maintaining infrastructure and operations to support data science activities.  
The Research Information Scientist will focus on performance tuning, quickly 
identifying bottlenecks through review of SQL execution plans to maximize ADRF 
resource utilization and system performance. The successful candidate will also 
work directly with ADRF development and operations team-members, as well as 
collaborators and clients, to build out semi-automated approaches to data 
management, with an emphasis on data quality automation as the Coleridge 
Initiative builds to scale.



The Research Information Scientist’s responsibilities will include:



Managing data ingestion process and troubleshooting/resolving any resulting 
issues, ensuring the integrity and security of data housed in the ADRF

Performing preliminary quality assessment on data files, correcting obvious 
issues and then formatting files for ingestion

Contributing, as part of a team, to ADRF platform enhancement projects using 
appropriate technologies in research and large-scale data management (e.g., 
Hadoop and contemporaries, parallel databases, cloud services), and/or 
interactive visualization and specialized data presentation interfaces.

Implementing and documenting data ingestion best practices



Qualifications

Credentials

Master’s Degree in Information Science, Library Science, or Computer Science



Skills

Proven experience successfully managing the full ETL and data preparation life 
cycle of large datasets in a data warehouse

Proficient in programming; required: ETL, Metadata harvesting, ETL distributed 
programming, ETL distributed debugging, PySpark, AWS Glue Jobs, AWS Glue 
Development Endpoints

Experience with relational and non-relational databases and other data storage 
and access technologies, such as MySQL, PostgreSQL, Aurora, Citus Data, Oracle, 
Hadoop, Spark, and/or AWS Athena.

Strong communication skills, team player



Additional Desired Experience & Skills

Proficient in programming; Java, Javascript, HTML

Experience with development of web applications and APIs using open source 
software

Experience working with large scale administrative datasets

Knowledge of key open source software resources

Prior experience in SQL and working with database technologies like Postgres

Demonstrated ability to write analytical reports



Application Instructions

Please include a resume and cover letter. 


----
Brought to you by code4lib jobs: 
https://jobs.code4lib.org/jobs/42488-research-information-scientist

Reply via email to