Hello, tl;dr: This is a proof of concept about using an LLM to enrich CVE feeds. Links are somewhere down below to see the difference. If you think it's useful, please say so. If you think it's not useful, please say so too.
Motivation: the CVE checker associates CVEs with recipes based on the CPE information in the CVE entry. Unfortunately there are quite a few CVE entries missing this information entirely, making it impossible to associate them with any recipes. Looking at this year so far there are over 66000 CVE's opened, of which over 15000 are missing CPEs. Though older entries seem to have better CPE-to-CVE ratio, but for this PoC I'm mostly interested in the latest vulnerabilities. The idea: in case CPE information is missing, try to derive it from the human language description and the reference links of the CVE, using an LLM. The intuition would be that a good portion of the derived data would be usable, and even though it wouldn't be perfect, it would catch more valid CVEs than without it. Here I would like to describe my experiments and experiences with the above idea. The setup: I used a local LLM with ollama, running llama3.1:8b model. The hardware is Ryzen 5950X, 32GB RAM, RTX3060 with 12GB VRAM, running EndeavourOS. The prompt is part of the patch. The initial load takes a very long time (that's why I restricted the amount for this PoC). Deriving a missing CPE ID takes about a second. My initial approach was to work on all post-2020 CVEs. It took about an hour to process up to the start of 2024, but it failed during a test run (due to a processing bug in my code, the performance remained unchanged). After fixing my code I rectricted the CVEs to process only the entries from 2025 - I wanted to see finally some results to see if it makes any difference. Processing the current year ultimately took 280 minutes. (I have also skipped the CVEs associated with the linux kernel to speed up the process.). It was using 2 CPU cores completely, indicating that it could be parallelized, but looking further it turned out that my GPU is the bottleneck, it was showing 96-98% utilization with nvidia-smi, and ollama was using ~5.5GB VRAM. (Another idea would be to group the CVEs before sending them to the LLM, but it's untested.) During processing the entries I noticed that the model produced some strange hallucinations - I did expect incorrect vendor/product details, but strangely it also produced incorrect data structures in its response, like instead of "CPE_ID" key, it returns "OVP_ID" sometimes. This seems to be a problem with a very small portion, and currently I drop these. It tends to miss some asterisks also from the final CPE id. This seems to happen quite frequently, around 20% of the cases. I've heard that intimidating the LLM with some F-bombs in the prompt helps, but haven't tried that yet - for now I try to salvage the ones that only miss 1 asterisk. This also means that this CPE-derivation is not 100% reproducible - but the idea is still that only a small portion is not reproducible, and we are also only interested in a small portion. And in the code you might also notice that I do some postprocessing of the response... well, LLMs are like that today, I guess. There is no free lunch. There are some links with the actual results, executed on a few days old oe-core. Only CVE-2025-* entries were processed with LLM: Without LLM, cve-summary.json (~30MB uncompressed): https://sarvari.me/yocto/llm/without_llm.json.tar.gz Without LLM, human readable unpatched extract: https://sarvari.me/yocto/llm/no-llm.txt With LLM, cve-summary.json (~30MB uncompressed): https://sarvari.me/yocto/llm/with_llm.json.tar.gz With LLM, human readable unpatched extract: https://sarvari.me/yocto/llm/llm.txt With LLM + commentary about each LLM-guessed entry (LibreOffice Calc, I recommend this): https://sarvari.me/yocto/llm/llm_cves_with_comments.ods This patch is just a proof of concept. I'm not sure if/how it could be integrated in the project's infra - especially the initial load is very heavy, and the patch requires GPU(s). As a personal opinion, the final result turned out to be better than I initially expected. But I can't deny that it also brings noise, especially due to missing/incorrectly extracted version info. And of course it doesn't make the CVE info complete - it just extends what we have now. If you would like to try it out, I recommend to delete/backup the downloads/CVE_CHECK2/nvdfkie_1-1.db file before trying. (Delete it after trying also, because the patch changes the schema a bit) What do you think? If you are not completely averse to LLMs and this idea, I could spend more time on this, and submit something that's more than a PoC. Looking forward to all kind of feedback, opinion and recommendations, be it positive or negative. Thank you for coming to my TED talk. --- Gyorgy Sarvari (1): cve-update-db: bolt LLM on top of it meta/classes/cve-check.bbclass | 6 +- .../recipes-core/meta/cve-update-db-native.bb | 138 ++++++++++++++++-- 2 files changed, 131 insertions(+), 13 deletions(-)
-=-=-=-=-=-=-=-=-=-=-=- Links: You receive all messages sent to this group. View/Reply Online (#227343): https://lists.openembedded.org/g/openembedded-core/message/227343 Mute This Topic: https://lists.openembedded.org/mt/116627447/21656 Group Owner: [email protected] Unsubscribe: https://lists.openembedded.org/g/openembedded-core/unsub [[email protected]] -=-=-=-=-=-=-=-=-=-=-=-
