Hello everyone, Quick reminder that we'll be starting this month's showcase focused on *Curation of Wikimedia AI Datasets *in about 45 minutes at https://youtube.com/live/USzLGJ5LLC8?feature=share.
Best, Kinneret On Fri, Sep 13, 2024 at 1:56 PM Kinneret Gordon <[email protected]> wrote: > Hi all, > > The next Research Showcase will be live-streamed next Wednesday, September > 18, at 9:30 AM PST / 16:30 UTC. Find your local time here > <https://zonestamp.toolforge.org/1726677000>. The theme for this showcase > is *Curation of Wikimedia AI Datasets*. > > You are welcome to watch via the YouTube stream: > https://youtube.com/live/USzLGJ5LLC8?feature=share. As usual, you can > join the conversation in the YouTube chat as soon as the showcase goes > live. > > This month's presentations: > Supporting Community-Driven Data Curation for AI Evaluation on Wikipedia > through WikibenchBy *Tzu-Sheng Kuo, Carnegie Mellon University*AI tools > are increasingly deployed in community contexts. However, datasets used to > evaluate AI are typically created by developers and annotators outside a > given community, which can yield misleading conclusions about AI > performance. How might we empower communities to drive the intentional > design and curation of evaluation datasets for AI that impacts them? We > investigate this question on Wikipedia, an online community with multiple > AI-based content moderation tools deployed. We introduce Wikibench, a > system that enables communities to collaboratively curate AI evaluation > datasets, while navigating ambiguities and differences in perspective > through discussion. A field study on Wikipedia shows that datasets curated > using Wikibench can effectively capture community consensus, disagreement, > and uncertainty. Furthermore, study participants used Wikibench to shape > the overall data curation process, including refining label definitions, > determining data inclusion criteria, and authoring data statements. Based > on our findings, we propose future directions for systems that support > community-driven data curation.WikiContradict: A Benchmark for Evaluating > LLMs on Real-World Knowledge Conflicts from WikipediaBy *Yufang Hou, IBM > Research Europe - Ireland*Retrieval-augmented generation (RAG) has > emerged as a promising solution to mitigate the limitations of large > language models (LLMs), such as hallucinations and outdated information. > However, it remains unclear how LLMs handle knowledge conflicts arising > from different augmented retrieved passages, especially when these passages > originate from the same source and have equal trustworthiness. In this > work, we conduct a comprehensive evaluation of LLM-generated answers to > questions that have varying answers based on contradictory passages from > Wikipedia, a dataset widely regarded as a high-quality pre-training > resource for most LLMs. Specifically, we introduce WikiContradict, a > benchmark consisting of 253 high-quality, human-annotated instances > designed to assess LLM performance when augmented with retrieved passages > containing real-world knowledge conflicts. We benchmark a diverse range of > both closed and open-source LLMs under different QA scenarios, including > RAG with a single passage, and RAG with 2 contradictory passages. Through > rigorous human evaluations on a subset of WikiContradict instances > involving 5 LLMs and over 3,500 judgements, we shed light on the behaviour > and limitations of these models. For instance, when provided with two > passages containing contradictory facts, all models struggle to generate > answers that accurately reflect the conflicting nature of the context, > especially for implicit conflicts requiring reasoning. Since human > evaluation is costly, we also introduce an automated model that estimates > LLM performance using a strong open-source language model, achieving an > F-score > of 0.8. Using this automated metric, we evaluate more than 1,500 answers > from seven LLMs across all WikiContradict instances. To facilitate future > work, we release WikiContradict on: this https URL > <https://ibm.biz/wikicontradict>. > Best,Kinneret > > -- > > Kinneret Gordon > > Lead Research Community Officer > > Wikimedia Foundation <https://wikimediafoundation.org/> > > > >
_______________________________________________ Analytics mailing list -- [email protected] To unsubscribe send an email to [email protected]
