Hi all, Friendly ping on the patch below (sent 23 May, link in Patchwork: https://patchwork.ffmpeg.org/project/ffmpeg/patch/20250523215814.365246-1-f1k2fa...@gmail.com/ ).
Patch summary ------------- * Adds a CUDA implementation of the existing stack_* filter family (parallels stack_qsv / stack_vaapi). * Supports up to 16 inputs and both horizontal/vertical layouts. If anything needs adjustment (coding-style, fate naming, etc.) please let me know and I’ll resend an updated v2. Many thanks for your time! Best regards, Faeez Kadiri On Sat, May 24, 2025 at 3:28 AM Faeez Kadiri <f1k2fa...@gmail.com> wrote: > Add hardware-accelerated stack filters for CUDA that provide equivalent > functionality to the software stack filters but with GPU acceleration. > > Features: > - Support for hstack, vstack, and xstack operations > - Compatible pixel formats such as: > yuv420p, nv12, yuv444p, p010le, p016le, yuv444p16le, rgb0, bgr0, rgba, > bgra > - Fill color support with automatic RGB to YUV conversion for YUV formats > - Proper chroma subsampling handling for all supported formats > - Integration with existing stack filter infrastructure via > stack_internal.h > > The implementation follows the established CUDA filter pattern from > vf_scale_cuda.c, using PTX modules for kernel execution and proper > CUDA context management. Copy operations handle frame placement while > color operations fill background areas when using fill colors. > > This enables efficient video composition workflows entirely on GPU > without CPU-GPU memory transfers, significantly improving performance > for multi-input video processing pipelines. > > Examples: > $ ffmpeg -hwaccel cuda -i input.h265 -filter_complex > "[0:v][0:v]hstack_cuda" -c:v hevc_nvenc out.h265 > > $ ffmpeg \ > -hwaccel cuda -i input1.mp4 \ > -hwaccel cuda -i input2.mp4 \ > -hwaccel cuda -i input3.mp4 \ > -hwaccel cuda -i input4.mp4 \ > -filter_complex > "[0:v]hwupload_cuda[0v];[1:v]hwupload_cuda[1v];[2:v]hwupload_cuda[2v];[3:v]hwupload_cuda[3v];[0v][1v][2v][3v]xstack_cuda=inputs=4:fill=black:layout=0_0|w0_0|0_h0|w0_h0" > \ > -c:v hevc_nvenc out.mp4 > > Signed-off-by: Faeez Kadiri <f1k2fa...@gmail.com> > --- > Changelog | 1 + > configure | 6 + > doc/filters.texi | 78 +++++ > libavfilter/Makefile | 3 + > libavfilter/allfilters.c | 3 + > libavfilter/vf_stack_cuda.c | 589 +++++++++++++++++++++++++++++++++++ > libavfilter/vf_stack_cuda.cu | 389 +++++++++++++++++++++++ > 7 files changed, 1069 insertions(+) > create mode 100644 libavfilter/vf_stack_cuda.c > create mode 100644 libavfilter/vf_stack_cuda.cu > > diff --git a/Changelog b/Changelog > index 4217449438..0dec3443d4 100644 > --- a/Changelog > +++ b/Changelog > @@ -18,6 +18,7 @@ version <next>: > - APV encoding support through a libopenapv wrapper > - VVC decoder supports all content of SCC (Screen Content Coding): > IBC (Inter Block Copy), Palette Mode and ACT (Adaptive Color Transform > +- hstack_cuda, vstack_cuda and xstack_cuda filters > > > version 7.1: > diff --git a/configure b/configure > index 3730b0524c..5c2d6e132d 100755 > --- a/configure > +++ b/configure > @@ -4033,6 +4033,12 @@ xfade_vulkan_filter_deps="vulkan spirv_compiler" > yadif_cuda_filter_deps="ffnvcodec" > yadif_cuda_filter_deps_any="cuda_nvcc cuda_llvm" > yadif_videotoolbox_filter_deps="metal corevideo videotoolbox" > +hstack_cuda_filter_deps="ffnvcodec" > +hstack_cuda_filter_deps_any="cuda_nvcc cuda_llvm" > +vstack_cuda_filter_deps="ffnvcodec" > +vstack_cuda_filter_deps_any="cuda_nvcc cuda_llvm" > +xstack_cuda_filter_deps="ffnvcodec" > +xstack_cuda_filter_deps_any="cuda_nvcc cuda_llvm" > hstack_vaapi_filter_deps="vaapi_1" > vstack_vaapi_filter_deps="vaapi_1" > xstack_vaapi_filter_deps="vaapi_1" > diff --git a/doc/filters.texi b/doc/filters.texi > index 6d2df07508..1c9afac9eb 100644 > --- a/doc/filters.texi > +++ b/doc/filters.texi > @@ -26850,6 +26850,84 @@ Only deinterlace frames marked as interlaced. > The default value is @code{all}. > @end table > > +@section hstack_cuda > +Stack input videos horizontally. > + > +This is the CUDA variant of the @ref{vstack} filter, each input stream may > +have different width, this filter will scale down/up each input stream > while > +keeping the orignal aspect. > + > +It accepts the following options: > + > +@table @option > +@item inputs > +See @ref{hstack}. > + > +@item shortest > +See @ref{hstack}. > + > +@item height > +Set height of output. If set to 0, this filter will set height of output > to > +height of the first input stream. Default value is 0. > +@end table > + > +@section vstack_cuda > +Stack input videos vertically. > + > +This is the CUDA variant of the @ref{vstack} filter, each input stream may > +have different width, this filter will scale down/up each input stream > while > +keeping the orignal aspect. > + > +It accepts the following options: > + > +@table @option > +@item inputs > +See @ref{vstack}. > + > +@item shortest > +See @ref{vstack}. > + > +@item width > +Set width of output. If set to 0, this filter will set width of output to > +width of the first input stream. Default value is 0. > +@end table > + > +@section xstack_cuda > +Stack video inputs into custom layout. > + > +This is the CUDA variant of the @ref{xstack} filter, each input stream > may > +have different size, this filter will scale down/up each input stream to > the > +given output size, or the size of the first input stream. > + > +It accepts the following options: > + > +@table @option > +@item inputs > +See @ref{xstack}. > + > +@item shortest > +See @ref{xstack}. > + > +@item layout > +See @ref{xstack}. > +Moreover, this permits the user to supply output size for each input > stream. > +@example > > +xstack_cuda=inputs=4:layout=0_0_1920x1080|0_h0_1920x1080|w0_0_1920x1080|w0_h0_1920x1080 > +@end example > + > +@item grid > +See @ref{xstack}. > + > +@item grid_tile_size > +Set output size for each input stream when @option{grid} is set. If this > option > +is not set, this filter will set output size by default to the size of the > +first input stream. For the syntax of this option, check the > +@ref{video size syntax,,"Video size" section in the ffmpeg-utils > manual,ffmpeg-utils}. > + > +@item fill > +See @ref{xstack}. > +@end table > + > @anchor{CUDA NPP} > @section CUDA NPP > Below is a description of the currently available NVIDIA Performance > Primitives (libnpp) video filters. > diff --git a/libavfilter/Makefile b/libavfilter/Makefile > index 0effe4127f..ad876ccd53 100644 > --- a/libavfilter/Makefile > +++ b/libavfilter/Makefile > @@ -583,6 +583,9 @@ OBJS-$(CONFIG_YAEPBLUR_FILTER) += > vf_yaepblur.o > OBJS-$(CONFIG_ZMQ_FILTER) += f_zmq.o > OBJS-$(CONFIG_ZOOMPAN_FILTER) += vf_zoompan.o > OBJS-$(CONFIG_ZSCALE_FILTER) += vf_zscale.o > +OBJS-$(CONFIG_HSTACK_CUDA_FILTER) += vf_stack_cuda.o > framesync.o vf_stack_cuda.ptx.o cuda/load_helper.o > +OBJS-$(CONFIG_VSTACK_CUDA_FILTER) += vf_stack_cuda.o > framesync.o vf_stack_cuda.ptx.o cuda/load_helper.o > +OBJS-$(CONFIG_XSTACK_CUDA_FILTER) += vf_stack_cuda.o > framesync.o vf_stack_cuda.ptx.o cuda/load_helper.o > OBJS-$(CONFIG_HSTACK_VAAPI_FILTER) += vf_stack_vaapi.o > framesync.o vaapi_vpp.o > OBJS-$(CONFIG_VSTACK_VAAPI_FILTER) += vf_stack_vaapi.o > framesync.o vaapi_vpp.o > OBJS-$(CONFIG_XSTACK_VAAPI_FILTER) += vf_stack_vaapi.o > framesync.o vaapi_vpp.o > diff --git a/libavfilter/allfilters.c b/libavfilter/allfilters.c > index 5ea33cdf01..89a7fb9277 100644 > --- a/libavfilter/allfilters.c > +++ b/libavfilter/allfilters.c > @@ -545,6 +545,9 @@ extern const FFFilter ff_vf_yaepblur; > extern const FFFilter ff_vf_zmq; > extern const FFFilter ff_vf_zoompan; > extern const FFFilter ff_vf_zscale; > +extern const FFFilter ff_vf_hstack_cuda; > +extern const FFFilter ff_vf_vstack_cuda; > +extern const FFFilter ff_vf_xstack_cuda; > extern const FFFilter ff_vf_hstack_vaapi; > extern const FFFilter ff_vf_vstack_vaapi; > extern const FFFilter ff_vf_xstack_vaapi; > diff --git a/libavfilter/vf_stack_cuda.c b/libavfilter/vf_stack_cuda.c > new file mode 100644 > index 0000000000..002602b2bf > --- /dev/null > +++ b/libavfilter/vf_stack_cuda.c > @@ -0,0 +1,589 @@ > +/* > + * Copyright (c) 2025, Faeez Kadiri < f1k2faeez at gmail dot com> > + * > + * This file is part of FFmpeg. > + * > + * Permission is hereby granted, free of charge, to any person obtaining a > + * copy of this software and associated documentation files (the > "Software"), > + * to deal in the Software without restriction, including without > limitation > + * the rights to use, copy, modify, merge, publish, distribute, > sublicense, > + * and/or sell copies of the Software, and to permit persons to whom the > + * Software is furnished to do so, subject to the following conditions: > + * > + * The above copyright notice and this permission notice shall be > included in > + * all copies or substantial portions of the Software. > + * > + * THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, > EXPRESS OR > + * IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF > MERCHANTABILITY, > + * FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT > SHALL > + * THE AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR > OTHER > + * LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING > + * FROM, OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER > + * DEALINGS IN THE SOFTWARE. > + */ > + > +/** > + * @file > + * Hardware accelerated hstack, vstack and xstack filters based on CUDA > + */ > + > +#include "config_components.h" > + > +#include "libavutil/opt.h" > +#include "libavutil/common.h" > +#include "libavutil/pixdesc.h" > +#include "libavutil/eval.h" > +#include "libavutil/hwcontext.h" > +#include "libavutil/hwcontext_cuda_internal.h" > +#include "libavutil/cuda_check.h" > +#include "libavutil/avstring.h" > +#include "libavutil/avassert.h" > +#include "libavutil/imgutils.h" > +#include "libavutil/mathematics.h" > +#include "libavutil/parseutils.h" > +#include "libavutil/colorspace.h" > +#include "libavutil/mem.h" > + > +#include "filters.h" > +#include "formats.h" > +#include "video.h" > + > +#include "framesync.h" > +#include "cuda/load_helper.h" > + > +static const enum AVPixelFormat supported_formats[] = { > + AV_PIX_FMT_YUV420P, > + AV_PIX_FMT_NV12, > + AV_PIX_FMT_YUV444P, > + AV_PIX_FMT_P010, > + AV_PIX_FMT_P016, > + AV_PIX_FMT_YUV444P16, > + AV_PIX_FMT_0RGB32, > + AV_PIX_FMT_0BGR32, > + AV_PIX_FMT_RGB32, > + AV_PIX_FMT_BGR32, > +}; > + > +#define DIV_UP(a, b) ( ((a) + (b) - 1) / (b) ) > +#define BLOCKX 32 > +#define BLOCKY 16 > + > +#define CHECK_CU(x) FF_CUDA_CHECK_DL(ctx, s->hwctx->internal->cuda_dl, x) > + > +typedef struct CUDAStackContext { > + AVCUDADeviceContext *hwctx; > + CudaFunctions *cuda_dl; > + > + CUcontext cu_ctx; > + CUmodule cu_module; > + CUstream cu_stream; > + > + // For copy operations > + CUfunction cu_func_copy; > + CUfunction cu_func_copy_uv; > + > + // For color operations > + CUfunction cu_func_color; > + CUfunction cu_func_color_uv; > + > + enum AVPixelFormat in_fmt; > + const AVPixFmtDescriptor *in_desc; > + int in_planes; > + int in_plane_depths[4]; > + int in_plane_channels[4]; > + > + uint8_t fillcolor_rgba[4]; > + uint8_t fillcolor_yuv[4]; > +} CUDAStackContext; > + > +#define HSTACK_NAME "hstack_cuda" > +#define VSTACK_NAME "vstack_cuda" > +#define XSTACK_NAME "xstack_cuda" > +#define HWContext CUDAStackContext > +#define StackHWContext StackCudaContext > +#include "stack_internal.h" > + > +typedef struct StackCudaContext { > + StackBaseContext base; > + CUDAStackContext cuda; > +} StackCudaContext; > + > +static int format_is_supported(enum AVPixelFormat fmt) > +{ > + int i; > + > + for (i = 0; i < FF_ARRAY_ELEMS(supported_formats); i++) > + if (supported_formats[i] == fmt) > + return 1; > + return 0; > +} > + > +static void rgb2yuv(float r, float g, float b, int *y, int *u, int *v, > int depth) > +{ > + *y = ((0.21260*219.0/255.0) * r + (0.71520*219.0/255.0) * g + > + (0.07220*219.0/255.0) * b) * ((1 << depth) - 1); > + *u = (-(0.11457*224.0/255.0) * r - (0.38543*224.0/255.0) * g + > + (0.50000*224.0/255.0) * b + 0.5) * ((1 << depth) - 1); > + *v = ((0.50000*224.0/255.0) * r - (0.45415*224.0/255.0) * g - > + (0.04585*224.0/255.0) * b + 0.5) * ((1 << depth) - 1); > +} > + > +static av_cold int cuda_stack_load_functions(AVFilterContext *ctx, enum > AVPixelFormat format) > +{ > + StackCudaContext *sctx = ctx->priv; > + CUDAStackContext *s = &sctx->cuda; > + CUcontext dummy, cuda_ctx = s->hwctx->cuda_ctx; > + CudaFunctions *cu = s->cuda_dl; > + int ret; > + char buf[128]; > + > + const char *fmt_name = av_get_pix_fmt_name(format); > + > + extern const unsigned char ff_vf_stack_cuda_ptx_data[]; > + extern const unsigned int ff_vf_stack_cuda_ptx_len; > + > + ret = CHECK_CU(cu->cuCtxPushCurrent(cuda_ctx)); > + if (ret < 0) > + return ret; > + > + ret = ff_cuda_load_module(ctx, s->hwctx, &s->cu_module, > + ff_vf_stack_cuda_ptx_data, > ff_vf_stack_cuda_ptx_len); > + if (ret < 0) > + goto fail; > + > + // Load copy functions > + snprintf(buf, sizeof(buf), "StackCopy_%s_%s", fmt_name, fmt_name); > + ret = CHECK_CU(cu->cuModuleGetFunction(&s->cu_func_copy, > s->cu_module, buf)); > + if (ret < 0) { > + av_log(ctx, AV_LOG_FATAL, "Unsupported format for copy: %s\n", > fmt_name); > + ret = AVERROR(ENOSYS); > + goto fail; > + } > + > + snprintf(buf, sizeof(buf), "StackCopy_%s_%s_uv", fmt_name, fmt_name); > + ret = CHECK_CU(cu->cuModuleGetFunction(&s->cu_func_copy_uv, > s->cu_module, buf)); > + if (ret < 0) > + goto fail; > + > + // Load color functions > + snprintf(buf, sizeof(buf), "SetColor_%s", fmt_name); > + ret = CHECK_CU(cu->cuModuleGetFunction(&s->cu_func_color, > s->cu_module, buf)); > + if (ret < 0) { > + av_log(ctx, AV_LOG_FATAL, "Unsupported format for color: %s\n", > fmt_name); > + ret = AVERROR(ENOSYS); > + goto fail; > + } > + > + snprintf(buf, sizeof(buf), "SetColor_%s_uv", fmt_name); > + ret = CHECK_CU(cu->cuModuleGetFunction(&s->cu_func_color_uv, > s->cu_module, buf)); > + if (ret < 0) > + goto fail; > + > +fail: > + CHECK_CU(cu->cuCtxPopCurrent(&dummy)); > + return ret; > +} > + > +static av_cold int cuda_stack_color_kernel(AVFilterContext *ctx, > CUfunction func, > + AVFrame *out_frame, const uint8_t *color, > + int width, int height, > + int dst_x, int dst_y, > + int dst_width, int dst_height, int dst_pitch) > +{ > + StackCudaContext *sctx = ctx->priv; > + CUDAStackContext *s = &sctx->cuda; > + CudaFunctions *cu = s->cuda_dl; > + > + CUdeviceptr dst_devptr[4] = { > + (CUdeviceptr)out_frame->data[0], (CUdeviceptr)out_frame->data[1], > + (CUdeviceptr)out_frame->data[2], (CUdeviceptr)out_frame->data[3] > + }; > + > + void *args[] = { > + &dst_devptr[0], &dst_devptr[1], &dst_devptr[2], &dst_devptr[3], > + &width, &height, &dst_pitch, > + &dst_x, &dst_y, > + (void *)&color[0], (void *)&color[1], (void *)&color[2], (void > *)&color[3], > + &dst_width, &dst_height, > + }; > + > + return CHECK_CU(cu->cuLaunchKernel(func, > + DIV_UP(width, BLOCKX), > DIV_UP(height, BLOCKY), 1, > + BLOCKX, BLOCKY, 1, > + 0, s->cu_stream, args, NULL)); > +} > + > +static av_cold int cuda_stack_copy_kernel(AVFilterContext *ctx, > CUfunction func, > + CUtexObject src_tex[4], > + AVFrame *out_frame, > + int width, int height, > + int dst_x, int dst_y, int dst_pitch, > + int src_width, int src_height) > +{ > + StackCudaContext *sctx = ctx->priv; > + CUDAStackContext *s = &sctx->cuda; > + CudaFunctions *cu = s->cuda_dl; > + > + CUdeviceptr dst_devptr[4] = { > + (CUdeviceptr)out_frame->data[0], (CUdeviceptr)out_frame->data[1], > + (CUdeviceptr)out_frame->data[2], (CUdeviceptr)out_frame->data[3] > + }; > + > + void *args[] = { > + &src_tex[0], &src_tex[1], &src_tex[2], &src_tex[3], > + &dst_devptr[0], &dst_devptr[1], &dst_devptr[2], &dst_devptr[3], > + &width, &height, &dst_pitch, > + &dst_x, &dst_y, > + &src_width, &src_height, > + &out_frame->width, &out_frame->height > + }; > + > + return CHECK_CU(cu->cuLaunchKernel(func, > + DIV_UP(width, BLOCKX), > DIV_UP(height, BLOCKY), 1, > + BLOCKX, BLOCKY, 1, > + 0, s->cu_stream, args, NULL)); > +} > + > +static int cuda_stack_color_op(AVFilterContext *ctx, StackItemRegion > *region, AVFrame *out, const uint8_t *color) { > + StackCudaContext *sctx = ctx->priv; > + CUDAStackContext *s = &sctx->cuda; > + CudaFunctions *cu = s->cuda_dl; > + int ret = 0; > + CUcontext dummy; > + > + // Push CUDA context > + ret = CHECK_CU(cu->cuCtxPushCurrent(s->hwctx->cuda_ctx)); > + if (ret < 0) > + return ret; > + > + ret = cuda_stack_color_kernel(ctx, s->cu_func_color, > + out, color, region->width, region->height, > + region->x, region->y, > + out->width, out->height, > + out->linesize[0]); > + if (ret < 0) { > + av_log(ctx, AV_LOG_ERROR, "Error during color operation: %d\n", > ret); > + goto fail; > + } > + > + if (s->in_planes > 1) { > + ret = cuda_stack_color_kernel(ctx, s->cu_func_color_uv, > + out, color, > + AV_CEIL_RSHIFT(region->width, > s->in_desc->log2_chroma_w), > + AV_CEIL_RSHIFT(region->height, > s->in_desc->log2_chroma_h), > + AV_CEIL_RSHIFT(region->x, > s->in_desc->log2_chroma_w), > + AV_CEIL_RSHIFT(region->y, > s->in_desc->log2_chroma_h), > + out->width, out->height, > + out->linesize[1]); > + if (ret < 0) > + av_log(ctx, AV_LOG_ERROR, "Error during color UV operation: > %d\n", ret); > + } > + > +fail: > + CHECK_CU(cu->cuCtxPopCurrent(&dummy)); > + return ret; > +} > + > +static int cuda_stack_copy_op(AVFilterContext *ctx, StackItemRegion > *region, AVFrame *in, AVFrame *out) { > + StackCudaContext *sctx = ctx->priv; > + CUDAStackContext *s = &sctx->cuda; > + CudaFunctions *cu = s->cuda_dl; > + CUtexObject tex[4] = { 0, 0, 0, 0 }; > + int ret = 0; > + int i; > + CUcontext dummy; > + > + // Push CUDA context > + ret = CHECK_CU(cu->cuCtxPushCurrent(s->hwctx->cuda_ctx)); > + if (ret < 0) > + return ret; > + > + for (i = 0; i < s->in_planes; i++) { > + CUDA_TEXTURE_DESC tex_desc = { > + .filterMode = CU_TR_FILTER_MODE_POINT, > + .flags = CU_TRSF_READ_AS_INTEGER, > + }; > + > + CUDA_RESOURCE_DESC res_desc = { > + .resType = CU_RESOURCE_TYPE_PITCH2D, > + .res.pitch2D.format = s->in_plane_depths[i] <= 8 ? > + CU_AD_FORMAT_UNSIGNED_INT8 : > + CU_AD_FORMAT_UNSIGNED_INT16, > + .res.pitch2D.numChannels = s->in_plane_channels[i], > + .res.pitch2D.pitchInBytes = in->linesize[i], > + .res.pitch2D.devPtr = (CUdeviceptr)in->data[i], > + }; > + > + if (i == 1 || i == 2) { > + res_desc.res.pitch2D.width = AV_CEIL_RSHIFT(in->width, > s->in_desc->log2_chroma_w); > + res_desc.res.pitch2D.height = AV_CEIL_RSHIFT(in->height, > s->in_desc->log2_chroma_h); > + } else { > + res_desc.res.pitch2D.width = in->width; > + res_desc.res.pitch2D.height = in->height; > + } > + > + ret = CHECK_CU(cu->cuTexObjectCreate(&tex[i], &res_desc, > &tex_desc, NULL)); > + if (ret < 0) > + goto fail; > + } > + > + ret = cuda_stack_copy_kernel(ctx, s->cu_func_copy, > + tex, out, region->width, region->height, > + region->x, region->y, out->linesize[0], > + in->width, in->height); > + if (ret < 0) { > + av_log(ctx, AV_LOG_ERROR, "Error during copy operation: %d\n", > ret); > + goto fail; > + } > + > + if (s->in_planes > 1) { > + ret = cuda_stack_copy_kernel(ctx, s->cu_func_copy_uv, tex, out, > + AV_CEIL_RSHIFT(region->width, > s->in_desc->log2_chroma_w), > + AV_CEIL_RSHIFT(region->height, > s->in_desc->log2_chroma_h), > + AV_CEIL_RSHIFT(region->x, > s->in_desc->log2_chroma_w), > + AV_CEIL_RSHIFT(region->y, > s->in_desc->log2_chroma_h), > + out->linesize[1], > + AV_CEIL_RSHIFT(in->width, > s->in_desc->log2_chroma_w), > + AV_CEIL_RSHIFT(in->height, > s->in_desc->log2_chroma_h)); > + if (ret < 0) > + av_log(ctx, AV_LOG_ERROR, "Error during copy UV operation: > %d\n", ret); > + } > + > +fail: > + for (i = 0; i < FF_ARRAY_ELEMS(tex); i++) > + if (tex[i]) > + CHECK_CU(cu->cuTexObjectDestroy(tex[i])); > + > + CHECK_CU(cu->cuCtxPopCurrent(&dummy)); > + return ret; > +} > + > +static int process_frame(FFFrameSync *fs) > +{ > + AVFilterContext *ctx = fs->parent; > + StackCudaContext *sctx = fs->opaque; > + CUDAStackContext *s = &sctx->cuda; > + AVFilterLink *outlink = ctx->outputs[0]; > + AVFrame *out_frame = NULL; > + AVFrame *in_frame = NULL; > + int ret = 0; > + > + out_frame = ff_get_video_buffer(outlink, outlink->w, outlink->h); > + if (!out_frame) > + return AVERROR(ENOMEM); > + > + // Fill the entire output frame with fill color if enabled > + if (sctx->base.fillcolor_enable) { > + StackItemRegion full_region = { > + .x = 0, > + .y = 0, > + .width = outlink->w, > + .height = outlink->h > + }; > + > + ret = cuda_stack_color_op(ctx, &full_region, out_frame, > s->fillcolor_yuv); > + if (ret < 0) { > + av_log(ctx, AV_LOG_ERROR, "Failed to fill background > color\n"); > + goto fail; > + } > + } > + > + for (int i = 0; i < ctx->nb_inputs; i++) { > + ret = ff_framesync_get_frame(fs, i, &in_frame, 0); > + if (ret) > + goto fail; > + > + if (i == 0) { > + ret = av_frame_copy_props(out_frame, in_frame); > + if (ret < 0) > + goto fail; > + } > + > + ret = cuda_stack_copy_op(ctx, &sctx->base.regions[i], in_frame, > out_frame); > + if (ret < 0) > + goto fail; > + } > + > + out_frame->pts = av_rescale_q(sctx->base.fs.pts, > sctx->base.fs.time_base, outlink->time_base); > + out_frame->sample_aspect_ratio = outlink->sample_aspect_ratio; > + > + return ff_filter_frame(outlink, out_frame); > + > +fail: > + av_frame_free(&out_frame); > + return ret; > +} > + > +static int config_output(AVFilterLink *outlink) > +{ > + AVFilterContext *ctx = outlink->src; > + StackCudaContext *sctx = ctx->priv; > + CUDAStackContext *s = &sctx->cuda; > + AVFilterLink *inlink0 = ctx->inputs[0]; > + FilterLink *inl0 = ff_filter_link(inlink0); > + FilterLink *outl = ff_filter_link(outlink); > + enum AVPixelFormat in_format; > + int depth = 8, ret; > + AVHWFramesContext *in_frames_ctx; > + AVBufferRef *hw_frames_ctx; > + AVHWFramesContext *out_frames_ctx; > + > + if (inlink0->format != AV_PIX_FMT_CUDA || !inl0->hw_frames_ctx || > !inl0->hw_frames_ctx->data) { > + av_log(ctx, AV_LOG_ERROR, "Software pixel format is not > supported.\n"); > + return AVERROR(EINVAL); > + } > + > + in_frames_ctx = (AVHWFramesContext*)inl0->hw_frames_ctx->data; > + in_format = in_frames_ctx->sw_format; > + > + if (!format_is_supported(in_format)) { > + av_log(ctx, AV_LOG_ERROR, "Unsupported input format: %s\n", > + av_get_pix_fmt_name(in_format)); > + return AVERROR(ENOSYS); > + } > + > + s->in_fmt = in_format; > + s->in_desc = av_pix_fmt_desc_get(s->in_fmt); > + s->in_planes = av_pix_fmt_count_planes(s->in_fmt); > + > + // Set up plane information > + for (int i = 0; i < s->in_desc->nb_components; i++) { > + int d = (s->in_desc->comp[i].depth + 7) / 8; > + int p = s->in_desc->comp[i].plane; > + s->in_plane_channels[p] = FFMAX(s->in_plane_channels[p], > s->in_desc->comp[i].step / d); > + s->in_plane_depths[p] = s->in_desc->comp[i].depth; > + } > + > + s->hwctx = in_frames_ctx->device_ctx->hwctx; > + s->cuda_dl = s->hwctx->internal->cuda_dl; > + s->cu_stream = s->hwctx->stream; > + > + for (int i = 1; i < sctx->base.nb_inputs; i++) { > + AVFilterLink *inlink = ctx->inputs[i]; > + FilterLink *inl = ff_filter_link(inlink); > + AVHWFramesContext *hwfc = NULL; > + > + if (inlink->format != AV_PIX_FMT_CUDA || !inl->hw_frames_ctx || > !inl->hw_frames_ctx->data) { > + av_log(ctx, AV_LOG_ERROR, "Software pixel format is not > supported.\n"); > + return AVERROR(EINVAL); > + } > + > + hwfc = (AVHWFramesContext *)inl->hw_frames_ctx->data; > + > + if (in_frames_ctx->sw_format != hwfc->sw_format) { > + av_log(ctx, AV_LOG_ERROR, "All inputs should have the same > underlying software pixel format.\n"); > + return AVERROR(EINVAL); > + } > + } > + > + if (in_format == AV_PIX_FMT_P010) > + depth = 10; > + > + if (sctx->base.fillcolor_enable) { > + // Check if this is an RGB format > + if (s->in_desc->flags & AV_PIX_FMT_FLAG_RGB) { > + // For RGB formats, use RGB values directly > + s->fillcolor_yuv[0] = sctx->base.fillcolor[0]; // R > + s->fillcolor_yuv[1] = sctx->base.fillcolor[1]; // G > + s->fillcolor_yuv[2] = sctx->base.fillcolor[2]; // B > + s->fillcolor_yuv[3] = sctx->base.fillcolor[3]; // A > + } else { > + // For YUV formats, convert RGB to YUV > + int Y, U, V; > + > + rgb2yuv(sctx->base.fillcolor[0] / 255.0, > sctx->base.fillcolor[1] / 255.0, > + sctx->base.fillcolor[2] / 255.0, &Y, &U, &V, depth); > + s->fillcolor_yuv[0] = Y; > + s->fillcolor_yuv[1] = U; > + s->fillcolor_yuv[2] = V; > + s->fillcolor_yuv[3] = sctx->base.fillcolor[3]; > + } > + } > + > + ret = config_comm_output(outlink); > + if (ret < 0) > + return ret; > + > + ret = cuda_stack_load_functions(ctx, in_format); > + if (ret < 0) > + return ret; > + > + // Initialize hardware frames context for output > + hw_frames_ctx = av_hwframe_ctx_alloc(in_frames_ctx->device_ref); > + if (!hw_frames_ctx) > + return AVERROR(ENOMEM); > + > + out_frames_ctx = (AVHWFramesContext*)hw_frames_ctx->data; > + out_frames_ctx->format = AV_PIX_FMT_CUDA; > + out_frames_ctx->sw_format = in_format; > + out_frames_ctx->width = outlink->w; > + out_frames_ctx->height = outlink->h; > + > + ret = av_hwframe_ctx_init(hw_frames_ctx); > + if (ret < 0) { > + av_buffer_unref(&hw_frames_ctx); > + return ret; > + } > + > + av_buffer_unref(&outl->hw_frames_ctx); > + outl->hw_frames_ctx = hw_frames_ctx; > + > + return 0; > +} > + > +static int cuda_stack_init(AVFilterContext *ctx) > +{ > + int ret; > + > + ret = stack_init(ctx); > + if (ret) > + return ret; > + > + return 0; > +} > + > +static av_cold void cuda_stack_uninit(AVFilterContext *ctx) > +{ > + StackCudaContext *sctx = ctx->priv; > + CUDAStackContext *s = &sctx->cuda; > + > + if (s->hwctx && s->cu_module) { > + CudaFunctions *cu = s->cuda_dl; > + CUcontext dummy; > + > + CHECK_CU(cu->cuCtxPushCurrent(s->hwctx->cuda_ctx)); > + CHECK_CU(cu->cuModuleUnload(s->cu_module)); > + s->cu_module = NULL; > + CHECK_CU(cu->cuCtxPopCurrent(&dummy)); > + } > + > + stack_uninit(ctx); > +} > + > +static const enum AVPixelFormat cuda_stack_pix_fmts[] = { > + AV_PIX_FMT_CUDA, > + AV_PIX_FMT_NONE, > +}; > + > +#include "stack_internal.c" > + > +#if CONFIG_HSTACK_CUDA_FILTER > + > +DEFINE_HSTACK_OPTIONS(cuda); > +DEFINE_STACK_FILTER(hstack, cuda, "CUDA", 0); > + > +#endif > + > +#if CONFIG_VSTACK_CUDA_FILTER > + > +DEFINE_VSTACK_OPTIONS(cuda); > +DEFINE_STACK_FILTER(vstack, cuda, "CUDA", 0); > + > +#endif > + > +#if CONFIG_XSTACK_CUDA_FILTER > + > +DEFINE_XSTACK_OPTIONS(cuda); > +DEFINE_STACK_FILTER(xstack, cuda, "CUDA", 0); > + > +#endif > \ No newline at end of file > diff --git a/libavfilter/vf_stack_cuda.cu b/libavfilter/vf_stack_cuda.cu > new file mode 100644 > index 0000000000..c19595e0a6 > --- /dev/null > +++ b/libavfilter/vf_stack_cuda.cu > @@ -0,0 +1,389 @@ > +/* > + * Copyright (c) 2025, Faeez Kadiri < f1k2faeez at gmail dot com> > + * > + * This file is part of FFmpeg. > + * > + * Permission is hereby granted, free of charge, to any person obtaining a > + * copy of this software and associated documentation files (the > "Software"), > + * to deal in the Software without restriction, including without > limitation > + * the rights to use, copy, modify, merge, publish, distribute, > sublicense, > + * and/or sell copies of the Software, and to permit persons to whom the > + * Software is furnished to do so, subject to the following conditions: > + * > + * The above copyright notice and this permission notice shall be > included in > + * all copies or substantial portions of the Software. > + * > + * THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, > EXPRESS OR > + * IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF > MERCHANTABILITY, > + * FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT > SHALL > + * THE AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR > OTHER > + * LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING > + * FROM, OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER > + * DEALINGS IN THE SOFTWARE. > + */ > + > +#include "cuda/vector_helpers.cuh" > + > +// --- CONVERSION LOGIC --- > + > +static const ushort mask_10bit = 0xFFC0; > +static const ushort mask_16bit = 0xFFFF; > + > +static inline __device__ ushort conv_8to16(uchar in, ushort mask) > +{ > + return ((ushort)in | ((ushort)in << 8)) & mask; > +} > + > +// FFmpeg passes pitch in bytes, CUDA uses potentially larger types > +#define FIXED_PITCH \ > + (dst_pitch/sizeof(*dst[0])) > + > +#define DEFAULT_DST(n) \ > + dst[n][yo*FIXED_PITCH+xo] > + > +#define OFFSET_DST(n) \ > + dst[n][(yo+dst_y)*FIXED_PITCH+(xo+dst_x)] > + > +// --- COMMON BOUNDS CHECKING --- > + > +#define BOUNDS_CHECK() \ > + if (xo >= width || yo >= height) \ > + return; \ > + int target_x = xo + dst_x; \ > + int target_y = yo + dst_y; \ > + if (target_x < 0 || target_y < 0 || target_x >= frame_width || > target_y >= frame_height) \ > + return; > + > +#define BOUNDS_CHECK_UV(chroma_shift) \ > + if (xo >= width || yo >= height) \ > + return; \ > + int target_x = xo + dst_x; \ > + int target_y = yo + dst_y; \ > + int frame_uv_height = frame_height >> chroma_shift; \ > + if (target_x < 0 || target_y < 0 || target_x >= frame_width || > target_y >= frame_uv_height) \ > + return; > + > +#define COPY_BOUNDS_CHECK() \ > + int target_x = xo + dst_x; \ > + int target_y = yo + dst_y; \ > + if (target_x < 0 || target_y < 0 || target_x >= frame_width || > target_y >= frame_height) \ > + return; > + > +#define COPY_BOUNDS_CHECK_UV(chroma_shift) \ > + int target_x = xo + dst_x; \ > + int target_y = yo + dst_y; \ > + int frame_uv_width = frame_width >> chroma_shift; \ > + int frame_uv_height = frame_height >> chroma_shift; \ > + if (target_x < 0 || target_y < 0 || target_x >= frame_uv_width || > target_y >= frame_uv_height) \ > + return; > + > +// --- COLOR OPERATIONS --- > + > +#define COLOR_DEF_F(N, T) \ > + __device__ static inline void N(T *dst[4], int xo, int yo, \ > + int width, int height, int > dst_pitch, \ > + int dst_x, int dst_y, \ > + unsigned char y_color, unsigned char > u_color, \ > + unsigned char v_color, unsigned char > a_color, \ > + int frame_width, int frame_height) > + > +// Macro for YUV planar formats (420p, 444p, etc.) > +#define DEFINE_SETCOLOR_YUV_PLANAR(name, out_type, out_type_uv, y_assign, > uv_assign) \ > +struct SetColor_##name \ > +{ \ > + typedef out_type out_T; \ > + typedef out_type_uv out_T_uv; \ > + \ > + COLOR_DEF_F(SetColor, out_T) \ > + { \ > + BOUNDS_CHECK(); \ > + OFFSET_DST(0) = y_assign; \ > + } \ > + \ > + COLOR_DEF_F(SetColor_uv, out_T_uv) \ > + { \ > + BOUNDS_CHECK(); \ > + uv_assign; \ > + } \ > +}; > + > +// Macro for NV12-style formats (interleaved UV) > +#define DEFINE_SETCOLOR_NV(name, out_type, out_type_uv, y_assign, > uv_assign) \ > +struct SetColor_##name \ > +{ \ > + typedef out_type out_T; \ > + typedef out_type_uv out_T_uv; \ > + \ > + COLOR_DEF_F(SetColor, out_T) \ > + { \ > + BOUNDS_CHECK(); \ > + OFFSET_DST(0) = y_assign; \ > + } \ > + \ > + COLOR_DEF_F(SetColor_uv, out_T_uv) \ > + { \ > + BOUNDS_CHECK_UV(1); \ > + OFFSET_DST(1) = uv_assign; \ > + } \ > +}; > + > +// Macro for RGB formats > +#define DEFINE_SETCOLOR_RGB(name, out_type, color_assign) \ > +struct SetColor_##name \ > +{ \ > + typedef out_type out_T; \ > + typedef uchar out_T_uv; \ > + \ > + COLOR_DEF_F(SetColor, out_T) \ > + { \ > + BOUNDS_CHECK(); \ > + OFFSET_DST(0) = color_assign; \ > + } \ > + \ > + COLOR_DEF_F(SetColor_uv, out_T_uv) \ > + { \ > + /* No UV plane for RGB formats */ \ > + } \ > +}; > + > +// Define all SetColor structs using macros > +DEFINE_SETCOLOR_YUV_PLANAR(yuv420p, uchar, uchar, y_color, > + OFFSET_DST(1) = u_color; OFFSET_DST(2) = v_color) > + > +DEFINE_SETCOLOR_NV(nv12, uchar, uchar2, y_color, > + make_uchar2(u_color, v_color)) > + > +DEFINE_SETCOLOR_YUV_PLANAR(yuv444p, uchar, uchar, y_color, > + OFFSET_DST(1) = u_color; OFFSET_DST(2) = v_color) > + > +DEFINE_SETCOLOR_NV(p010le, ushort, ushort2, conv_8to16(y_color, > mask_10bit), > + make_ushort2(conv_8to16(u_color, mask_10bit), conv_8to16(v_color, > mask_10bit))) > + > +DEFINE_SETCOLOR_NV(p016le, ushort, ushort2, conv_8to16(y_color, > mask_16bit), > + make_ushort2(conv_8to16(u_color, mask_16bit), conv_8to16(v_color, > mask_16bit))) > + > +DEFINE_SETCOLOR_YUV_PLANAR(yuv444p16le, ushort, ushort, > conv_8to16(y_color, mask_16bit), > + OFFSET_DST(1) = conv_8to16(u_color, mask_16bit); OFFSET_DST(2) = > conv_8to16(v_color, mask_16bit)) > + > +DEFINE_SETCOLOR_RGB(rgb0, uchar4, make_uchar4(y_color, u_color, v_color, > 0)) > +DEFINE_SETCOLOR_RGB(bgr0, uchar4, make_uchar4(v_color, u_color, y_color, > 0)) > +DEFINE_SETCOLOR_RGB(rgba, uchar4, make_uchar4(y_color, u_color, v_color, > a_color)) > +DEFINE_SETCOLOR_RGB(bgra, uchar4, make_uchar4(v_color, u_color, y_color, > a_color)) > + > +// --- COPY OPERATIONS --- > + > +template<typename T> > +using copy_function_t = T (*)(cudaTextureObject_t tex, int xo, int yo, > + int dst_width, int dst_height, > + int src_width, int src_height, > + int bit_depth); > + > +#define COPY_DEF_F(N, T) \ > + template<copy_function_t<in_T> copy_func_y, > \ > + copy_function_t<in_T_uv> copy_func_uv> > \ > + __device__ static inline void N(cudaTextureObject_t src_tex[4], T > *dst[4], int xo, int yo, \ > + int dst_width, int dst_height, > int dst_pitch, \ > + int dst_x, int dst_y, > \ > + int src_width, int src_height, > \ > + int frame_width, int frame_height) > + > +#define COPY_SUB_F(m, plane) \ > + copy_func_##m(src_tex[plane], xo, yo, \ > + dst_width, dst_height, \ > + src_width, src_height, \ > + in_bit_depth) > + > +// Macro for YUV planar copy operations > +#define DEFINE_STACKCOPY_YUV_PLANAR(name, bit_depth, in_type, in_type_uv, > out_type, out_type_uv, chroma_shift) \ > +struct StackCopy_##name \ > +{ \ > + static const int in_bit_depth = bit_depth; \ > + typedef in_type in_T; \ > + typedef in_type_uv in_T_uv; \ > + typedef out_type out_T; \ > + typedef out_type_uv out_T_uv; \ > + \ > + COPY_DEF_F(StackCopy, out_T) \ > + { \ > + COPY_BOUNDS_CHECK(); \ > + OFFSET_DST(0) = COPY_SUB_F(y, 0); \ > + } \ > + \ > + COPY_DEF_F(StackCopy_uv, out_T_uv) \ > + { \ > + COPY_BOUNDS_CHECK_UV(chroma_shift); \ > + OFFSET_DST(1) = COPY_SUB_F(uv, 1); \ > + OFFSET_DST(2) = COPY_SUB_F(uv, 2); \ > + } \ > +}; > + > +// Macro for NV12-style copy operations > +#define DEFINE_STACKCOPY_NV(name, bit_depth, in_type, in_type_uv, > out_type, out_type_uv) \ > +struct StackCopy_##name \ > +{ \ > + static const int in_bit_depth = bit_depth; \ > + typedef in_type in_T; \ > + typedef in_type_uv in_T_uv; \ > + typedef out_type out_T; \ > + typedef out_type_uv out_T_uv; \ > + \ > + COPY_DEF_F(StackCopy, out_T) \ > + { \ > + COPY_BOUNDS_CHECK(); \ > + OFFSET_DST(0) = COPY_SUB_F(y, 0); \ > + } \ > + \ > + COPY_DEF_F(StackCopy_uv, out_T_uv) \ > + { \ > + COPY_BOUNDS_CHECK_UV(1); \ > + OFFSET_DST(1) = COPY_SUB_F(uv, 1); \ > + } \ > +}; > + > +// Macro for RGB copy operations > +#define DEFINE_STACKCOPY_RGB(name, bit_depth, in_type, out_type) \ > +struct StackCopy_##name \ > +{ \ > + static const int in_bit_depth = bit_depth; \ > + typedef in_type in_T; \ > + typedef uchar in_T_uv; \ > + typedef out_type out_T; \ > + typedef uchar out_T_uv; \ > + \ > + COPY_DEF_F(StackCopy, out_T) \ > + { \ > + COPY_BOUNDS_CHECK(); \ > + OFFSET_DST(0) = COPY_SUB_F(y, 0); \ > + } \ > + \ > + COPY_DEF_F(StackCopy_uv, out_T_uv) \ > + { \ > + /* No UV plane for RGB formats */ \ > + } \ > +}; > + > +// Define all StackCopy structs using macros > +DEFINE_STACKCOPY_YUV_PLANAR(yuv420p_yuv420p, 8, uchar, uchar, uchar, > uchar, 1) > +DEFINE_STACKCOPY_NV(nv12_nv12, 8, uchar, uchar2, uchar, uchar2) > +DEFINE_STACKCOPY_YUV_PLANAR(yuv444p_yuv444p, 8, uchar, uchar, uchar, > uchar, 0) > +DEFINE_STACKCOPY_NV(p010le_p010le, 10, ushort, ushort2, ushort, ushort2) > +DEFINE_STACKCOPY_NV(p016le_p016le, 16, ushort, ushort2, ushort, ushort2) > +DEFINE_STACKCOPY_YUV_PLANAR(yuv444p16le_yuv444p16le, 16, ushort, ushort, > ushort, ushort, 0) > +DEFINE_STACKCOPY_RGB(rgb0_rgb0, 8, uchar4, uchar4) > +DEFINE_STACKCOPY_RGB(bgr0_bgr0, 8, uchar4, uchar4) > +DEFINE_STACKCOPY_RGB(rgba_rgba, 8, uchar4, uchar4) > +DEFINE_STACKCOPY_RGB(bgra_bgra, 8, uchar4, uchar4) > + > +// --- COPY LOGIC --- > + > +template<typename T> > +__device__ static inline T StackCopyPixel(cudaTextureObject_t tex, > + int xo, int yo, > + int dst_width, int dst_height, > + int src_width, int src_height, > + int bit_depth) > +{ > + float hscale = (float)src_width / (float)dst_width; > + float vscale = (float)src_height / (float)dst_height; > + float xi = (xo + 0.5f) * hscale; > + float yi = (yo + 0.5f) * vscale; > + > + return tex2D<T>(tex, xi, yi); > +} > + > +/// --- FUNCTION EXPORTS --- > + > +#define COLOR_KERNEL_ARGS(T) \ > + T *dst_0, T *dst_1, T *dst_2, T *dst_3, \ > + int width, int height, int dst_pitch, \ > + int dst_x, int dst_y, \ > + unsigned char y_color, unsigned char u_color, \ > + unsigned char v_color, unsigned char a_color, \ > + int frame_width, int frame_height > + > +#define COLOR_FUNC(SetColorFunc, T) \ > + T *dst[4] = { dst_0, dst_1, dst_2, dst_3 }; \ > + int xo = blockIdx.x * blockDim.x + threadIdx.x; \ > + int yo = blockIdx.y * blockDim.y + threadIdx.y; \ > + if (yo >= height || xo >= width) return; \ > + SetColorFunc( \ > + dst, xo, yo, \ > + width, height, dst_pitch, \ > + dst_x, dst_y, y_color, u_color, v_color, a_color, \ > + frame_width, frame_height); > + > +#define COPY_KERNEL_ARGS(T) \ > + cudaTextureObject_t src_tex_0, cudaTextureObject_t src_tex_1, \ > + cudaTextureObject_t src_tex_2, cudaTextureObject_t src_tex_3, \ > + T *dst_0, T *dst_1, T *dst_2, T *dst_3, \ > + int dst_width, int dst_height, int dst_pitch, \ > + int dst_x, int dst_y, \ > + int src_width, int src_height, \ > + int frame_width, int frame_height > + > +#define COPY_FUNC(StackCopyFunc, T) \ > + cudaTextureObject_t src_tex[4] = \ > + { src_tex_0, src_tex_1, src_tex_2, src_tex_3 }; \ > + T *dst[4] = { dst_0, dst_1, dst_2, dst_3 }; \ > + int xo = blockIdx.x * blockDim.x + threadIdx.x; \ > + int yo = blockIdx.y * blockDim.y + threadIdx.y; \ > + if (yo >= dst_height || xo >= dst_width) return; \ > + StackCopyFunc( \ > + src_tex, dst, xo, yo, \ > + dst_width, dst_height, dst_pitch, \ > + dst_x, dst_y, \ > + src_width, src_height, \ > + frame_width, frame_height); > + > +extern "C" { > + > +#define COLOR_KERNEL(C, S) \ > + __global__ void SetColor_##C##S( \ > + COLOR_KERNEL_ARGS(SetColor_##C::out_T##S)) \ > + { \ > + COLOR_FUNC(SetColor_##C::SetColor##S, SetColor_##C::out_T##S) \ > + } > + > +#define COLOR_KERNEL_RAW(C) \ > + COLOR_KERNEL(C,) \ > + COLOR_KERNEL(C,_uv) > + > +// Define color kernels for all supported formats > +COLOR_KERNEL_RAW(yuv420p) > +COLOR_KERNEL_RAW(nv12) > +COLOR_KERNEL_RAW(yuv444p) > +COLOR_KERNEL_RAW(p010le) > +COLOR_KERNEL_RAW(p016le) > +COLOR_KERNEL_RAW(yuv444p16le) > +COLOR_KERNEL_RAW(rgb0) > +COLOR_KERNEL_RAW(bgr0) > +COLOR_KERNEL_RAW(rgba) > +COLOR_KERNEL_RAW(bgra) > + > +#define COPY_KERNEL(C, S) \ > + __global__ void StackCopy_##C##S( \ > + COPY_KERNEL_ARGS(StackCopy_##C::out_T##S)) \ > + { \ > + COPY_FUNC((StackCopy_##C::StackCopy##S< \ > + StackCopyPixel<StackCopy_##C::in_T>, \ > + StackCopyPixel<StackCopy_##C::in_T_uv> >), \ > + StackCopy_##C::out_T##S) \ > + } > + > +#define COPY_KERNEL_RAW(C) \ > + COPY_KERNEL(C,) \ > + COPY_KERNEL(C,_uv) > + > +// Define copy kernels for all supported formats > +COPY_KERNEL_RAW(yuv420p_yuv420p) > +COPY_KERNEL_RAW(nv12_nv12) > +COPY_KERNEL_RAW(yuv444p_yuv444p) > +COPY_KERNEL_RAW(p010le_p010le) > +COPY_KERNEL_RAW(p016le_p016le) > +COPY_KERNEL_RAW(yuv444p16le_yuv444p16le) > +COPY_KERNEL_RAW(rgb0_rgb0) > +COPY_KERNEL_RAW(bgr0_bgr0) > +COPY_KERNEL_RAW(rgba_rgba) > +COPY_KERNEL_RAW(bgra_bgra) > + > +} > \ No newline at end of file > -- > 2.34.1 > > _______________________________________________ ffmpeg-devel mailing list ffmpeg-devel@ffmpeg.org https://ffmpeg.org/mailman/listinfo/ffmpeg-devel To unsubscribe, visit link above, or email ffmpeg-devel-requ...@ffmpeg.org with subject "unsubscribe".