[FFmpeg-devel] [PATCH v2 FFmpeg 0/20] Zero-Shot Classification Support for FFMPEG (CLIP and CLAP)

m.kaindl0208 Tue, 11 Mar 2025 08:52:04 -0700

Hi,

I'm excited to propose a series of patches adding support for modern zero-shot 
classification models to FFmpeg. These patches enable FFmpeg to leverage CLIP 
(Contrastive Language-Image Pre-training) and CLAP (Contrastive Language-Audio 
Pre-training) models for media classification.


Key Features:
Zero-shot classification support: Use text prompts to classify media without 
training specific models Audio classification with CLAP: Extend FFmpeg's DNN 
capabilities to audio content Hierarchical classification: Group classification 
categories with a new category file format Stream classification averaging: New 
avgclass filter for averaging classification results

Implementation Details:
The implementation adds tokenizer support to the LibTorch backend using the 
tokenizers-cpp library The existing dnn_classify filter has been transformed 
from a video-only filter to a multimedia filter, now supporting both video and 
audio inputs based on a configuration flag.
For video, the implementation supports both standard/original classification 
(OpenVINO backend) and CLIP (Torch backend). For audio, it adds CLAP support 
via the Torch backend.

For further details, please refer to the documentation.

For model conversion/scripting or step-by-step installation, see my GitHub 
project: https://github.com/MaximilianKaindl/DeepFFMPEGVideoClassification

Regarding CLAP models, they unfortunately need to be traced due to NumPy weak 
references, which seems to lock in the device used for tracing.

For audio preprocessing, I've implemented two functions: handle_long_audio and 
handle_short_audio, which imitate the original CLAP Preprocessor. These 
functions aren't used by default since classify automatically buffers frames to 
the desired length, but they might improve performance, especially 
handle_short_audio which repeats parts of the audio. That's why I've kept them 
in place.

I could use help ensuring my implementation doesn't interfere with the original 
dnn_classification or dnn_processing functionality. Thanks!

Furthermore, should I upload tests for this functionality? Model sizes are big 
around >500 Mb. 

This time the patches should be fine, I could apply them on my machine.

Signed-off-by: MaximilianKaindl <[email protected]>

_______________________________________________
ffmpeg-devel mailing list
[email protected]
https://ffmpeg.org/mailman/listinfo/ffmpeg-devel

To unsubscribe, visit link above, or email
[email protected] with subject "unsubscribe".

[FFmpeg-devel] [PATCH v2 FFmpeg 0/20] Zero-Shot Classification Support for FFMPEG (CLIP and CLAP)

Reply via email to