[Analytics] Re: [Wikimedia Research Showcase] Curation of Wikimedia AI Datasets- September 18 at 16:30 UTC

Kinneret Gordon Wed, 18 Sep 2024 08:47:00 -0700

Hello everyone,

Quick reminder that we'll be starting this month's showcase focused on
*Curation
of Wikimedia AI Datasets *in about 45 minutes at
https://youtube.com/live/USzLGJ5LLC8?feature=share.


Best,
Kinneret


On Fri, Sep 13, 2024 at 1:56 PM Kinneret Gordon <[email protected]>
wrote:

> Hi all,
>
> The next Research Showcase will be live-streamed next Wednesday, September
> 18, at 9:30 AM PST / 16:30 UTC. Find your local time here
> <https://zonestamp.toolforge.org/1726677000>. The theme for this showcase
> is *Curation of Wikimedia AI Datasets*.
>
> You are welcome to watch via the YouTube stream:
> https://youtube.com/live/USzLGJ5LLC8?feature=share. As usual, you can
> join the conversation in the YouTube chat as soon as the showcase goes
> live.
>
> This month's presentations:
> Supporting Community-Driven Data Curation for AI Evaluation on Wikipedia
> through WikibenchBy *Tzu-Sheng Kuo, Carnegie Mellon University*AI tools
> are increasingly deployed in community contexts. However, datasets used to
> evaluate AI are typically created by developers and annotators outside a
> given community, which can yield misleading conclusions about AI
> performance. How might we empower communities to drive the intentional
> design and curation of evaluation datasets for AI that impacts them? We
> investigate this question on Wikipedia, an online community with multiple
> AI-based content moderation tools deployed. We introduce Wikibench, a
> system that enables communities to collaboratively curate AI evaluation
> datasets, while navigating ambiguities and differences in perspective
> through discussion. A field study on Wikipedia shows that datasets curated
> using Wikibench can effectively capture community consensus, disagreement,
> and uncertainty. Furthermore, study participants used Wikibench to shape
> the overall data curation process, including refining label definitions,
> determining data inclusion criteria, and authoring data statements. Based
> on our findings, we propose future directions for systems that support
> community-driven data curation.WikiContradict: A Benchmark for Evaluating
> LLMs on Real-World Knowledge Conflicts from WikipediaBy *Yufang Hou, IBM
> Research Europe - Ireland*Retrieval-augmented generation (RAG) has
> emerged as a promising solution to mitigate the limitations of large
> language models (LLMs), such as hallucinations and outdated information.
> However, it remains unclear how LLMs handle knowledge conflicts arising
> from different augmented retrieved passages, especially when these passages
> originate from the same source and have equal trustworthiness. In this
> work, we conduct a comprehensive evaluation of LLM-generated answers to
> questions that have varying answers based on contradictory passages from
> Wikipedia, a dataset widely regarded as a high-quality pre-training
> resource for most LLMs. Specifically, we introduce WikiContradict, a
> benchmark consisting of 253 high-quality, human-annotated instances
> designed to assess LLM performance when augmented with retrieved passages
> containing real-world knowledge conflicts. We benchmark a diverse range of
> both closed and open-source LLMs under different QA scenarios, including
> RAG with a single passage, and RAG with 2 contradictory passages. Through
> rigorous human evaluations on a subset of WikiContradict instances
> involving 5 LLMs and over 3,500 judgements, we shed light on the behaviour
> and limitations of these models. For instance, when provided with two
> passages containing contradictory facts, all models struggle to generate
> answers that accurately reflect the conflicting nature of the context,
> especially for implicit conflicts requiring reasoning. Since human
> evaluation is costly, we also introduce an automated model that estimates
> LLM performance using a strong open-source language model, achieving an 
> F-score
> of 0.8. Using this automated metric, we evaluate more than 1,500 answers
> from seven LLMs across all WikiContradict instances. To facilitate future
> work, we release WikiContradict on: this https URL
> <https://ibm.biz/wikicontradict>.
> Best,Kinneret
>
> --
>
> Kinneret Gordon
>
> Lead Research Community Officer
>
> Wikimedia Foundation <https://wikimediafoundation.org/>
>
>
>
>

_______________________________________________
Analytics mailing list -- [email protected]
To unsubscribe send an email to [email protected]

[Analytics] Re: [Wikimedia Research Showcase] Curation of Wikimedia AI Datasets- September 18 at 16:30 UTC

Reply via email to