Context

The NanoBubbles ERC Synergy project’s objective 
(https://nanobubbles.hypotheses.org<https://nanobubbles.hypotheses.%20org>) is 
to understand how, when and why science fails to correct itself. The project 
focuses on claims made within the field of nanobiology. Project members combine 
approaches from the natural sciences, computer science, and the social sciences 
and humanities (Science and Technology Studies) to understand how error 
correction in science works and what obstacles it faces. For this purpose, we 
aim to trace claims and corrections through various channels of scientific 
communication (journals, social media, advertisements, conference programs, 
etc.) via both qualitative and digital methods.


Intership objectifs

Entity recognition is an important step for downstream treatment in natural 
language processing. It consists in identifying the entities in a corpus 
belonging to a specific domain and in their labeling. Training methods relying 
on large annotated corpora are usually used for this purpose. However, such 
resource are not always available for specific domains, and alternative methods 
have to be employed (Hedderich 2020).
Distant supervision (Mintz 2009) is a technique used to automatically label 
textual data using an external resource such as dictionaries (Shang 2018), 
gazetteers, ontologies (Wang 2021) and knowledge bases (Sun 2019). This enable 
the construction of a training corpus without the need of manual annotation. In 
specialized domains, this is especially useful in order to annotate complex and 
discontinuous entities with which human annotators may struggle (Khandelwal 
2022).

The objective of this internship is to implement a method to automatically 
annotate a corpus of scientific documents, using existing resources, in the 
nanobiology domain. After it, they will employ existing deep learning 
approaches (Liang 2020) to train an entity extraction model for entities in the 
nanobiology domain.


Skills

• Being enrolled in a Master in Natural Language Processing, computer science 
or data science.
• Good programming skills in Python, including experiences with natural 
language processing tools
and methods, knowledge of machine learning and deep learning frameworks and 
semantic web.
• Ability to communicate and write in English is a plus.

Scientific environment

The work will be conducted within the Sigma team of the LIG laboratory 
(http://sigma.imag.fr). The recruited person will be welcomed within the team 
which offer a stimulating, multinational and pleasant working environment.


Instructions for applying

Applications must contain a CV + letter/message of motivation + master grades + 
letter(s) of recommendation (or names for potential letters), and be addressed 
to Cyril Labbé ([email protected]) and Amira Barhoumi 
([email protected]). Applications will be considered on the 
fly. It is therefore advisable to apply as soon as possible.


References

• Mintz, M., Bills, S., Snow, R., & Jurafsky, D. (2009, August). Distant 
supervision for relation extraction without labeled data. In Proceedings of the 
Joint Conference of the 47th Annual Meeting of the ACL and the 4th 
International Joint Conference on Natural Language Processing of the AFNLP (pp. 
1003-1011).
• Shang, J., Liu, L., Ren, X., Gu, X., Ren, T., & Han, J. (2018). Learning 
named entity tagger using domain-specific dictionary. arXiv preprint 
arXiv:1809.03599.

• Sun, Y., & Loparo, K. (2019, July). Information extraction from free text in 
clinical trials with knowledge-based distant supervision. In 2019 IEEE 43rd 
Annual Computer Software and Applications Conference (COMPSAC) (Vol. 1, pp. 
954-955). IEEE.
• Wang, X., Hu, V., Song, X., Garg, S., Xiao, J., & Han, J. (2021, November). 
CHEMNER: Fine-Grained Chemistry Named Entity Recognition with Ontology-Guided 
Distant Supervision. In Proceedings of the 2021 Conference on Empirical Methods 
in Natural Language Processing (pp. 5227-5240).
• Liang, C., Yu, Y., Jiang, H., Er, S., Wang, R., Zhao, T., & Zhang, C. (2020, 
August). Bond: Bert-assisted open-domain named entity recognition with distant 
supervision. In Proceedings of the 26th ACM SIGKDD International Conference on 
Knowledge Discovery & Data Mining (pp. 1054-1064).
• Hedderich, M. A., Lange, L., Adel, H., Str ?otgen, J., & Klakow, D. (2020). A 
survey on recent approaches for natural language processing in low-resource 
scenarios. arXiv preprint arXiv:2010.12309.
• Khandelwal, A., Kar, A., Chikka, V. R., & Karlapalem, K. (2022, May). 
Biomedical NER using Novel Schema and Distant Supervision. In Proceedings of 
the 21st Workshop on Biomedical Language Processing (pp. 155-160)

_______________________________________________
Corpora mailing list -- [email protected]
https://list.elra.info/mailman3/postorius/lists/corpora.list.elra.info/
To unsubscribe send an email to [email protected]

Reply via email to