Re: [FFmpeg-devel] [PATCH] Add NVENC encoder

Timo Rothenpieler Thu, 27 Nov 2014 03:37:16 -0800

Is it necessary to split the _api part in a separate file? The whole code is
a bit large, but still manageable, and merging the files would avoid some
headers overhead.

Done that primarily to keep things cleaned up and easier to read. Can as well put it all in one huge file.

I think moving the frei0r rules is supposed to belong in a separate patch.


Propably, will split that out when i get to it.

The usual coding style for structure members and variables in ffmpeg is
names_separated_with_underscodes, not uglyCamelCase. (But I believe the
person who will end up maintaining the file should have last word on this.)

Most of this code is ported from a C++ Project where everything had to be camel case, so those and the C++ malloc casts are some of the leftovers.

+        (*head)->surface = surface;
+        return;
+    }
+
+    while ((*head)->next)
+        head = &((*head)->next);


This looks inefficient. Do you have an estimate of the usual size of the
queue?

The maximum is the number of adjacent B Frames plus one. So 3 in the case of NVENC, unless they change the supported gop patterns.

I suggest you have a look at the dynarray (in libavutil/mem.h and
dynarray.h) API.

If you really need linked lists, you could probably keep the final pointer
to head in the structure to avoid walking the list every time.

I basically need a fifo like structure, where i can queue output surfaces until NVENC is done filling them. An array doesn't realy reflect that usage, as new elements are added to the front, and taken from the back.

+    (*head)->next = av_malloc(sizeof(NvencOutputSurfaceList));


av_malloc() return value needs to be checked. Other similar cases below.

Is a simple assert on the return value enough? Can't continue in a sane way anyway if it ever fails.

+    (*head)->next->next = 0;
+    (*head)->next->surface = surface;
+}
+

+static NvencOutputSurface *out_surf_queue_pop(NvencOutputSurfaceList** head)


If you call this one pop instead of shift, people used to Perl will be very
confused.

push/pop is propably not ideal naming, as it reminds too much of a stack, which it isn't.

I renamed it to enqueue/dequeue.

+static void timestamp_list_insert_sorted(NvencTimestampList** head, int64_t 
timestamp)


Same as before: maybe dynarray would be more efficient, avoiding malloc()
with its huge overhead for every insertion.

Also, if the list is expected to be large, you may consider using a heap
instead of a sorted list.

This definitely can't be replaced, its purpose isn't just a plain list, but sorting of the input timestamps, so the dts is still monotonic after re-ordering for B frames.

+        av_log(avctx, AV_LOG_FATAL, "Failed creating CUDA context for 
NVENC\n");


Is there a chance of getting a more detailed error reason?


Only adding the CUDA error code, which then has to be looked up manualy.

+            av_log(avctx, AV_LOG_ERROR, "Preset \"%s\" is unknown!\n", 
ctx->preset);


Should return an error. And if you use a table with the list of presets, you
can dump the list.

What do you mean by that? Printing which presets are available in the error message?

+    ctx->initEncodeParams.darHeight = avctx->height;
+    ctx->initEncodeParams.darWidth = avctx->width;


Was this tested with anamorphic videos?


At least i didn't test that. So i don't think anyone did.

+
+    if (!ctx->profile) {
+        switch (avctx->profile) {

+            case FF_PROFILE_H264_BASELINE:
+            ctx->profile = av_strdup("baseline");


Need to check the return value.

But it seems you have the private option "profile" conflicting with the
global option "profile", which is confusing, and possibly problematic, for
users.

Not entirely sure why i did that this way. Copied it straight from the libx264 encoder, without thinking too much about it. I can just set the profileGUID straight from that switch and can remove the second profile variable(which the libx264 encoder has in exactly the same conflicting way) entirely.

+    switch (lockParams.pictureType) {
+        case NV_ENC_PIC_TYPE_IDR:
+        pkt->flags |= AV_PKT_FLAG_KEY;
+        case NV_ENC_PIC_TYPE_I:
+        avctx->coded_frame->pict_type = AV_PICTURE_TYPE_I;
+        break;
+
+        case NV_ENC_PIC_TYPE_P:
+        avctx->coded_frame->pict_type = AV_PICTURE_TYPE_P;
+        break;
+
+        case NV_ENC_PIC_TYPE_B:
+        avctx->coded_frame->pict_type = AV_PICTURE_TYPE_B;
+        break;
+
+        case NV_ENC_PIC_TYPE_BI:
+        avctx->coded_frame->pict_type = AV_PICTURE_TYPE_BI;
+        break;
+

+        default:
+        avctx->coded_frame->pict_type = AV_PICTURE_TYPE_NONE;


Does this happen normally?


Not that I'm aware of. But i don't know what else to assume in this case.

+        break;
+    }

+        for (i = 0; i < ctx->maxSurfaceCount; ++i)
+            if (!ctx->inputSurfaces[i].lockCount)
+                inSurf = &ctx->inputSurfaces[i];


Maybe a break here.

Yes

+        av_assert0(inSurf);


Are you positively sure that an input surface will always be available?

The maximum supported number of surfaces is allocated, if it'd ever run out, there'd be a bug in the code managing the surfaces.

+            uint8_t *buf = lockBufferParams.bufferDataPtr;
+
+            av_image_copy_plane(buf, lockBufferParams.pitch,
+                frame->data[0], frame->linesize[0],
+                avctx->width, avctx->height);
+
+            buf += inSurf->height * lockBufferParams.pitch;


Could be factored out, unless I am missing something.

I could do the absolute calculation in each copy_plane call, but it would be way harder to read then. The compiler should take care of optimizing this out.

+        if (i == ctx->maxSurfaceCount) {
+            inSurf->lockCount = 0;

+            av_log(avctx, AV_LOG_ERROR, "No free output surface found!\n");
+            return 0;


Proper error code?


Not intended to be a fatal error, the frame would just be dropped.

This case should never happen anyway, and if it does, something is very wrong. So probably better a fatal error here.

+        }
+
+        ctx->outputSurfaces[i].inputSurface = inSurf;
+
+        picParams.inputBuffer = inSurf->inputSurface;
+        picParams.bufferFmt = inSurf->format;
+        picParams.inputWidth = avctx->width;
+        picParams.inputHeight = avctx->height;
+        picParams.outputBitstream = ctx->outputSurfaces[i].outputSurface;
+        picParams.completionEvent = 0;
+
+        if (avctx->flags & CODEC_FLAG_INTERLACED_DCT) {
+            if (frame->top_field_first) {
+                picParams.pictureStruct = NV_ENC_PIC_STRUCT_FIELD_TOP_BOTTOM;
+            } else {
+                picParams.pictureStruct = NV_ENC_PIC_STRUCT_FIELD_BOTTOM_TOP;
+            }
+        } else {
+            picParams.pictureStruct = NV_ENC_PIC_STRUCT_FRAME;
+        }
+
+        picParams.encodePicFlags = 0;
+        picParams.inputTimeStamp = frame->pts;
+        picParams.inputDuration = 0;
+        picParams.codecPicParams.h264PicParams.sliceMode = 
ctx->encodeConfig.encodeCodecConfig.h264Config.sliceMode;
+        picParams.codecPicParams.h264PicParams.sliceModeData = 
ctx->encodeConfig.encodeCodecConfig.h264Config.sliceModeData;
+        memcpy(&picParams.rcParams, &ctx->encodeConfig.rcParams, 
sizeof(NV_ENC_RC_PARAMS));
+
+        timestamp_list_insert_sorted(&ctx->timestampList, frame->pts);
+    } else {
+        picParams.encodePicFlags = NV_ENC_PIC_FLAG_EOS;
+    }
+
+    nvStatus = ff_pNvEnc->nvEncEncodePicture(ctx->nvencoder, &picParams);
+
+    if (frame && nvStatus == NV_ENC_ERR_NEED_MORE_INPUT) {
+        out_surf_queue_push(&ctx->outputSurfaceQueue, &ctx->outputSurfaces[i]);
+        ctx->outputSurfaces[i].busy = 1;
+    }
+
+    if (nvStatus != NV_ENC_SUCCESS && nvStatus != NV_ENC_ERR_NEED_MORE_INPUT) {
+        av_log(avctx, AV_LOG_ERROR, "EncodePicture failed!\n");
+        return AVERROR_EXTERNAL;
+    }
+
+    if (nvStatus != NV_ENC_ERR_NEED_MORE_INPUT) {
+        while (ctx->outputSurfaceQueue) {
+            tmpoutsurf = out_surf_queue_pop(&ctx->outputSurfaceQueue);
+            out_surf_queue_push(&ctx->outputSurfaceReadyQueue, tmpoutsurf);
+        }
+
+        if (frame) {
+            out_surf_queue_push(&ctx->outputSurfaceReadyQueue, 
&ctx->outputSurfaces[i]);
+            ctx->outputSurfaces[i].busy = 1;
+        }
+    }
+
+    if (ctx->outputSurfaceReadyQueue) {
+        tmpoutsurf = out_surf_queue_pop(&ctx->outputSurfaceReadyQueue);
+
+        *got_packet = process_output_surface(avctx, pkt, avctx->coded_frame, 
tmpoutsurf);
+
+        tmpoutsurf->busy = 0;
+        av_assert0(tmpoutsurf->inputSurface->lockCount);
+        tmpoutsurf->inputSurface->lockCount--;
+    }
+
+    return 0;
+}
+
+static int pix_fmts_nvenc_initialized;
+
+static enum AVPixelFormat pix_fmts_nvenc[] = {
+    AV_PIX_FMT_NV12,
+    AV_PIX_FMT_NONE,
+    AV_PIX_FMT_NONE,
+    AV_PIX_FMT_NONE
+};
+
+static av_cold void nvenc_init_static(AVCodec *codec)
+{
+    NV_ENC_OPEN_ENCODE_SESSION_EX_PARAMS stEncodeSessionParams = { 0 };
+    CUcontext cuctxcur = 0, cuctx = 0;
+    NVENCSTATUS nvStatus;
+    void *nvencoder = 0;
+    GUID encodeGuid = NV_ENC_CODEC_H264_GUID;
+    GUID license = dummy_license;
+    int i = 0, pos = 0;
+    int gotnv12 = 0, got420 = 0, got444 = 0;
+    uint32_t inputFmtCount = 32;
+    NV_ENC_BUFFER_FORMAT inputFmts[32];
+
+    for (i = 0; i < 32; ++i)
+        inputFmts[i] = (NV_ENC_BUFFER_FORMAT)0;
+    i = 0;
+
+    if (pix_fmts_nvenc_initialized) {
+        codec->pix_fmts = pix_fmts_nvenc;
+        return;
+    }
+
+    if (!ff_nvenc_dyload_nvenc(0)) {
+        pix_fmts_nvenc_initialized = 1;
+        return;
+    }
+
+    stEncodeSessionParams.version = NV_ENC_OPEN_ENCODE_SESSION_EX_PARAMS_VER;
+    stEncodeSessionParams.apiVersion = NVENCAPI_VERSION;
+    stEncodeSessionParams.clientKeyPtr = &license;
+
+    cuctx = 0;

+    if (ff_cuCtxCreate(&cuctx, 0, ff_pNvencDevices[ff_iNvencUseDeviceID]) != 
CUDA_SUCCESS) {


It would probably be better to get ff_cuCtxCreate() return an AVERROR code
instead of a CUDA error code. Same for all ff_ helper functions.


ff_cuCtxCreate is a library function loaded from the CUDA dll/so.

Same for all the other ff_cu* functions, there is no way to change what it returns, as it's not my function.

Some of these options are redundant with global ones; "profile" already
cited, "2pass" = -flags +pass1/+pass2; "cqp" = "global_quality".

Profile is entirely obsolete, just copied it from x264, which also has that redundant option.

The twopass for nvenc means something entirely diffrent. Using the global option would be bad. It does not support the normal twopass encoding, like for example libx264 does. TWO_PASS is just another CBR rate control mode, which i honestly have no idea of what it does internaly, but it boosts the quality quite a bit.


Didn't know about that global_quality option, will use that instead.

+#define ifav_log(...) if (avctx) { av_log(__VA_ARGS__); }


Looks strange: why no error message when there is no context?


Is it possible to call av_log without a context?

Only added that because of the static init function, which doesn't have a avctx yet.

+
+static int nvenc_dyload_cuda(AVCodecContext *avctx)
+{

+    if (cudaLib)
+        return 1;


Thread safe?

No, does it need to be? Can multiple threads create the coded at the same time?

If so, the nvenc init functions need a mutex.

+        ifav_log(avctx, AV_LOG_FATAL, ">> %s - failed with error code 0x%x\n", 
func, err);


The library does not provide error code -> string utility?


Unfortunately it doesn't. Same for CUDA and NVENC.

+    if (nvenc_init_count <= 0)
+        return;
+
+    nvenc_init_count--;


This looks not thread safe.

No, it isn't. The entire load/unlock function needs to be locked if it has to be.

Fixed and refactored this quite a bit now. Not all issues are addressed yet(Also didn't test if this still works yet, so don't use this patch for anything).

commit bd4ad9018bc86096de1991068b60fdde16b19543
Author: Timo Rothenpieler <[email protected]>
Date:   Wed Nov 26 11:08:11 2014 +0100

    Add NVENC encoder

diff --git a/Changelog b/Changelog
index 7172d0c..d26b7fa 100644
--- a/Changelog
+++ b/Changelog
@@ -17,6 +17,7 @@ version <next>:
 - WebP muxer with animated WebP support
 - zygoaudio decoding support
 - APNG demuxer
+- nvenc encoder
 
 
 version 2.4:
diff --git a/configure b/configure
index 38619c4..05bce5d 100755
--- a/configure
+++ b/configure
@@ -261,6 +261,7 @@ External library support:
   --enable-libzvbi         enable teletext support via libzvbi [no]
   --disable-lzma           disable lzma [autodetect]
   --enable-decklink        enable Blackmagick DeckLink I/O support [no]
+  --enable-nvenc           enable NVIDIA NVENC support [no]
   --enable-openal          enable OpenAL 1.1 capture support [no]
   --enable-opencl          enable OpenCL code
   --enable-opengl          enable OpenGL rendering [no]
@@ -1393,6 +1394,7 @@ EXTERNAL_LIBRARY_LIST="
     libzmq
     libzvbi
     lzma
+    nvenc
     openal
     opencl
     opengl
@@ -2389,6 +2391,7 @@ libxvid_encoder_deps="libxvid"
 libutvideo_decoder_deps="libutvideo"
 libutvideo_encoder_deps="libutvideo"
 libzvbi_teletext_decoder_deps="libzvbi"
+nvenc_encoder_deps="nvenc"
 
 # demuxers / muxers
 ac3_demuxer_select="ac3_parser"
@@ -2569,9 +2572,7 @@ drawtext_filter_deps="libfreetype"
 ebur128_filter_deps="gpl"
 flite_filter_deps="libflite"
 frei0r_filter_deps="frei0r dlopen"
-frei0r_filter_extralibs='$ldl'
 frei0r_src_filter_deps="frei0r dlopen"
-frei0r_src_filter_extralibs='$ldl'
 geq_filter_deps="gpl"
 histeq_filter_deps="gpl"
 hqdn3d_filter_deps="gpl"
@@ -4344,6 +4345,7 @@ die_license_disabled gpl x11grab
 
 die_license_disabled nonfree libaacplus
 die_license_disabled nonfree libfaac
+die_license_disabled nonfree nvenc
 enabled gpl && die_license_disabled_gpl nonfree libfdk_aac
 enabled gpl && die_license_disabled_gpl nonfree openssl
 
@@ -4650,6 +4652,11 @@ elif check_func dlopen -ldl; then
     ldl=-ldl
 fi
 
+# set a few flags which depend on ldl and can't be set earlier
+nvenc_encoder_extralibs='$ldl'
+frei0r_filter_extralibs='$ldl'
+frei0r_src_filter_extralibs='$ldl'
+
 if ! disabled network; then
     check_func getaddrinfo $network_extralibs
     check_func getservbyport $network_extralibs
@@ -4913,6 +4920,7 @@ enabled libxavs           && require libxavs xavs.h xavs_encoder_encode -lxavs
 enabled libxvid           && require libxvid xvid.h xvid_global -lxvidcore
 enabled libzmq            && require_pkg_config libzmq zmq.h zmq_ctx_new
 enabled libzvbi           && require libzvbi libzvbi.h vbi_decoder_new -lzvbi
+enabled nvenc             && { check_header nvEncodeAPI.h || die "ERROR: nvEncodeAPI.h not found."; }
 enabled openal            && { { for al_libs in "${OPENAL_LIBS}" "-lopenal" "-lOpenAL32"; do
                                check_lib 'AL/al.h' alGetError "${al_libs}" && break; done } ||
                                die "ERROR: openal not found"; } &&
diff --git a/libavcodec/Makefile b/libavcodec/Makefile
index fa0f53d..cc393f9 100644
--- a/libavcodec/Makefile
+++ b/libavcodec/Makefile
@@ -347,6 +347,7 @@ OBJS-$(CONFIG_MXPEG_DECODER)           += mxpegdec.o
 OBJS-$(CONFIG_NELLYMOSER_DECODER)      += nellymoserdec.o nellymoser.o
 OBJS-$(CONFIG_NELLYMOSER_ENCODER)      += nellymoserenc.o nellymoser.o
 OBJS-$(CONFIG_NUV_DECODER)             += nuv.o rtjpeg.o
+OBJS-$(CONFIG_NVENC_ENCODER)           += nvenc.o
 OBJS-$(CONFIG_ON2AVC_DECODER)          += on2avc.o on2avcdata.o
 OBJS-$(CONFIG_OPUS_DECODER)            += opusdec.o opus.o opus_celt.o \
                                           opus_imdct.o opus_silk.o     \
diff --git a/libavcodec/allcodecs.c b/libavcodec/allcodecs.c
index 0d39d33..8ceee2f 100644
--- a/libavcodec/allcodecs.c
+++ b/libavcodec/allcodecs.c
@@ -223,6 +223,7 @@ void avcodec_register_all(void)
     REGISTER_DECODER(MVC2,              mvc2);
     REGISTER_DECODER(MXPEG,             mxpeg);
     REGISTER_DECODER(NUV,               nuv);
+    REGISTER_ENCODER(NVENC,             nvenc);
     REGISTER_DECODER(PAF_VIDEO,         paf_video);
     REGISTER_ENCDEC (PAM,               pam);
     REGISTER_ENCDEC (PBM,               pbm);
diff --git a/libavcodec/nvenc.c b/libavcodec/nvenc.c
new file mode 100644
index 0000000..79c2497
--- /dev/null
+++ b/libavcodec/nvenc.c
@@ -0,0 +1,1203 @@
+/*
+ * H.264 hardware encoding using nvidia nvenc
+ * Copyright (c) 2014 Timo Rothenpieler <[email protected]>
+ *
+ * This file is part of FFmpeg.
+ *
+ * FFmpeg is free software; you can redistribute it and/or
+ * modify it under the terms of the GNU Lesser General Public
+ * License as published by the Free Software Foundation; either
+ * version 2.1 of the License, or (at your option) any later version.
+ *
+ * FFmpeg is distributed in the hope that it will be useful,
+ * but WITHOUT ANY WARRANTY; without even the implied warranty of
+ * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the GNU
+ * Lesser General Public License for more details.
+ *
+ * You should have received a copy of the GNU Lesser General Public
+ * License along with FFmpeg; if not, write to the Free Software
+ * Foundation, Inc., 51 Franklin Street, Fifth Floor, Boston, MA 02110-1301 USA
+ */
+
+#ifdef _WIN32
+#include <windows.h>
+#else
+#include <dlfcn.h>
+#endif
+
+#include <nvEncodeAPI.h>
+
+#include "libavutil/internal.h"
+#include "libavutil/imgutils.h"
+#include "libavutil/avassert.h"
+#include "libavutil/opt.h"
+#include "libavutil/mem.h"
+#include "avcodec.h"
+#include "internal.h"
+
+#ifdef _WIN32
+#define CUDAAPI __stdcall
+#else
+#define CUDAAPI
+#endif
+
+typedef enum cudaError_enum {
+    CUDA_SUCCESS = 0
+} CUresult;
+typedef int CUdevice;
+typedef void* CUcontext;
+
+typedef CUresult(CUDAAPI *PCUINIT)(unsigned int Flags);
+typedef CUresult(CUDAAPI *PCUDEVICEGETCOUNT)(int *count);
+typedef CUresult(CUDAAPI *PCUDEVICEGET)(CUdevice *device, int ordinal);
+typedef CUresult(CUDAAPI *PCUDEVICEGETNAME)(char *name, int len, CUdevice dev);
+typedef CUresult(CUDAAPI *PCUDEVICECOMPUTECAPABILITY)(int *major, int *minor, CUdevice dev);
+typedef CUresult(CUDAAPI *PCUCTXCREATE)(CUcontext *pctx, unsigned int flags, CUdevice dev);
+typedef CUresult(CUDAAPI *PCUCTXPOPCURRENT)(CUcontext *pctx);
+typedef CUresult(CUDAAPI *PCUCTXDESTROY)(CUcontext ctx);
+
+typedef NVENCSTATUS (NVENCAPI* PNVENCODEAPICREATEINSTANCE)(NV_ENCODE_API_FUNCTION_LIST *functionList);
+
+static const GUID dummy_license = { 0x0, 0x0, 0x0, { 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0 } };
+
+static PCUINIT cu_init = 0;
+static PCUDEVICEGETCOUNT cu_device_get_count = 0;
+static PCUDEVICEGET cu_device_get = 0;
+static PCUDEVICEGETNAME cu_device_get_name = 0;
+static PCUDEVICECOMPUTECAPABILITY cu_device_compute_capability = 0;
+static PCUCTXCREATE cu_ctx_create = 0;
+static PCUCTXPOPCURRENT cu_ctx_pop_current = 0;
+static PCUCTXDESTROY cu_ctx_destroy = 0;
+
+static int nvenc_init_count;
+static NV_ENCODE_API_FUNCTION_LIST nvenc_funcs;
+static NV_ENCODE_API_FUNCTION_LIST *p_nvenc = 0;
+static int nvenc_device_count = 0;
+static CUdevice nvenc_devices[16];
+static unsigned int nvenc_use_device_id = 0;
+
+#ifdef _WIN32
+#define LOAD_FUNC(l, s) GetProcAddress(l, s)
+#define DL_CLOSE_FUNC(l) FreeLibrary(l)
+static HMODULE cuda_lib;
+static HMODULE nvenc_lib;
+#else
+#define LOAD_FUNC(l, s) dlsym(l, s)
+#define DL_CLOSE_FUNC(l) dlclose(l)
+static void *cuda_lib;
+static void *nvenc_lib;
+#endif
+
+#define ifav_log(...) if (avctx) { av_log(__VA_ARGS__); }
+
+typedef struct NvencInputSurface
+{
+    NV_ENC_INPUT_PTR input_surface;
+    int width;
+    int height;
+
+    int lockCount;
+
+    NV_ENC_BUFFER_FORMAT format;
+} NvencInputSurface;
+
+typedef struct NvencOutputSurface
+{
+    NV_ENC_OUTPUT_PTR output_surface;
+    int size;
+
+    NvencInputSurface *input_surface;
+
+    int busy;
+} NvencOutputSurface;
+
+typedef struct NvencOutputSurfaceList
+{
+    NvencOutputSurface *surface;
+    struct NvencOutputSurfaceList *next;
+} NvencOutputSurfaceList;
+
+typedef struct NvencTimestampList
+{
+    int64_t timestamp;
+    struct NvencTimestampList *next;
+} NvencTimestampList;
+
+typedef struct NvencContext
+{
+    AVClass *avclass;
+
+    NV_ENC_INITIALIZE_PARAMS init_encode_params;
+    NV_ENC_CONFIG encode_config;
+    CUcontext cu_context;
+
+    int max_surface_count;
+    NvencInputSurface *input_surfaces;
+    NvencOutputSurface *output_surfaces;
+
+    NvencOutputSurfaceList *output_surface_queue;
+    NvencOutputSurfaceList *output_surface_ready_queue;
+    NvencTimestampList *timestamp_list;
+    int64_t last_dts;
+
+    void *nvencoder;
+
+    char *preset;
+    int cbr;
+    int twopass;
+    int gobpattern;
+} NvencContext;
+
+#define CHECK_LOAD_FUNC(t, f, s) \
+do { \
+    f = (t)LOAD_FUNC(cuda_lib, s); \
+    if (!f) { \
+        ifav_log(avctx, AV_LOG_FATAL, "Failed loading %s from CUDA library\n", s); \
+        goto error; \
+    } \
+} while(0)
+
+static int nvenc_dyload_cuda(AVCodecContext *avctx)
+{
+    if (cuda_lib)
+        return 1;
+
+#if defined(_WIN32)
+    cuda_lib = LoadLibrary(TEXT("nvcuda.dll"));
+#elif defined(__CYGWIN__)
+    cuda_lib = dlopen("nvcuda.dll", RTLD_LAZY);
+#else
+    cuda_lib = dlopen("libcuda.so", RTLD_LAZY);
+#endif
+
+    if (!cuda_lib) {
+        ifav_log(avctx, AV_LOG_FATAL, "Failed loading CUDA library\n");
+        goto error;
+    }
+
+    CHECK_LOAD_FUNC(PCUINIT, cu_init, "cuInit");
+    CHECK_LOAD_FUNC(PCUDEVICEGETCOUNT, cu_device_get_count, "cuDeviceGetCount");
+    CHECK_LOAD_FUNC(PCUDEVICEGET, cu_device_get, "cuDeviceGet");
+    CHECK_LOAD_FUNC(PCUDEVICEGETNAME, cu_device_get_name, "cuDeviceGetName");
+    CHECK_LOAD_FUNC(PCUDEVICECOMPUTECAPABILITY, cu_device_compute_capability, "cuDeviceComputeCapability");
+    CHECK_LOAD_FUNC(PCUCTXCREATE, cu_ctx_create, "cuCtxCreate_v2");
+    CHECK_LOAD_FUNC(PCUCTXPOPCURRENT, cu_ctx_pop_current, "cuCtxPopCurrent_v2");
+    CHECK_LOAD_FUNC(PCUCTXDESTROY, cu_ctx_destroy, "cuCtxDestroy_v2");
+
+    return 1;
+
+error:
+
+    if (cuda_lib)
+        DL_CLOSE_FUNC(cuda_lib);
+
+    cuda_lib = NULL;
+
+    return 0;
+}
+
+static int check_cuda_errors(AVCodecContext *avctx, CUresult err, const char *func)
+{
+    if (err != CUDA_SUCCESS) {
+        ifav_log(avctx, AV_LOG_FATAL, ">> %s - failed with error code 0x%x\n", func, err);
+        return 0;
+    }
+    return 1;
+}
+#define check_cuda_errors(f) if (!check_cuda_errors(avctx, f, #f)) goto error
+
+static int nvenc_check_cuda(AVCodecContext *avctx)
+{
+    int deviceCount = 0;
+    CUdevice cuDevice = 0;
+    char gpu_name[128];
+    int smminor = 0, smmajor = 0;
+    int i, smver;
+
+    if (!nvenc_dyload_cuda(avctx))
+        return 0;
+
+    if (nvenc_device_count > 0)
+        return 1;
+
+    check_cuda_errors(cu_init(0));
+
+    check_cuda_errors(cu_device_get_count(&deviceCount));
+
+    if (!deviceCount) {
+        ifav_log(avctx, AV_LOG_FATAL, "No CUDA capable devices found\n");
+        goto error;
+    }
+
+    ifav_log(avctx, AV_LOG_VERBOSE, "%d CUDA capable devices found\n", deviceCount);
+
+    nvenc_device_count = 0;
+
+    for (i = 0; i < deviceCount; ++i) {
+        check_cuda_errors(cu_device_get(&cuDevice, i));
+        check_cuda_errors(cu_device_get_name(gpu_name, sizeof(gpu_name), cuDevice));
+        check_cuda_errors(cu_device_compute_capability(&smmajor, &smminor, cuDevice));
+
+        smver = (smmajor << 4) | smminor;
+
+        ifav_log(avctx, AV_LOG_VERBOSE, "[ GPU #%d - < %s > has Compute SM %d.%d, NVENC %s ]\n", i, gpu_name, smmajor, smminor, (smver >= 0x30) ? "Available" : "Not Available");
+
+        if (smver >= 0x30)
+            nvenc_devices[nvenc_device_count++] = cuDevice;
+    }
+
+    if (!nvenc_device_count) {
+        ifav_log(avctx, AV_LOG_FATAL, "No NVENC capable devices found\n");
+        goto error;
+    }
+
+    return 1;
+
+error:
+
+    nvenc_device_count = 0;
+
+    return 0;
+}
+
+static int nvenc_dyload_nvenc(AVCodecContext *avctx)
+{
+    PNVENCODEAPICREATEINSTANCE nvEncodeAPICreateInstance = 0;
+    NVENCSTATUS nvstatus;
+
+    if (!nvenc_check_cuda(avctx))
+        return 0;
+
+    if (p_nvenc) {
+        nvenc_init_count++;
+        return 1;
+    }
+
+#if defined(_WIN32)
+    if (sizeof(void*) == 8) {
+        nvenc_lib = LoadLibrary(TEXT("nvEncodeAPI64.dll"));
+    } else {
+        nvenc_lib = LoadLibrary(TEXT("nvEncodeAPI.dll"));
+    }
+#elif defined(__CYGWIN__)
+    if (sizeof(void*) == 8) {
+        nvenc_lib = dlopen("nvEncodeAPI64.dll", RTLD_LAZY);
+    } else {
+        nvenc_lib = dlopen("nvEncodeAPI.dll", RTLD_LAZY);
+    }
+#else
+    nvenc_lib = dlopen("libnvidia-encode.so.1", RTLD_LAZY);
+#endif
+
+    if (!nvenc_lib) {
+        ifav_log(avctx, AV_LOG_FATAL, "Failed loading the nvenc library\n");
+        goto error;
+    }
+
+    nvEncodeAPICreateInstance = (PNVENCODEAPICREATEINSTANCE)LOAD_FUNC(nvenc_lib, "NvEncodeAPICreateInstance");
+
+    if (!nvEncodeAPICreateInstance) {
+        ifav_log(avctx, AV_LOG_FATAL, "Failed to load nvenc entrypoint\n");
+        goto error;
+    }
+
+    p_nvenc = &nvenc_funcs;
+    memset(p_nvenc, 0, sizeof(NV_ENCODE_API_FUNCTION_LIST));
+    p_nvenc->version = NV_ENCODE_API_FUNCTION_LIST_VER;
+
+    nvstatus = nvEncodeAPICreateInstance(p_nvenc);
+
+    if (nvstatus != NV_ENC_SUCCESS) {
+        ifav_log(avctx, AV_LOG_FATAL, "Failed to create nvenc instance\n");
+        goto error;
+    }
+
+    ifav_log(avctx, AV_LOG_VERBOSE, "Nvenc initialized successfully\n");
+
+    nvenc_init_count = 1;
+
+    return 1;
+
+error:
+    if (nvenc_lib)
+        DL_CLOSE_FUNC(nvenc_lib);
+
+    nvenc_lib = 0;
+    p_nvenc = 0;
+    nvenc_init_count = 0;
+
+    return 0;
+}
+
+static void nvenc_unload_nvenc(AVCodecContext *avctx)
+{
+    if (nvenc_init_count <= 0)
+        return;
+
+    nvenc_init_count--;
+
+    if (nvenc_init_count > 0)
+        return;
+
+    DL_CLOSE_FUNC(nvenc_lib);
+    nvenc_lib = 0;
+    p_nvenc = 0;
+
+    nvenc_device_count = 0;
+
+    DL_CLOSE_FUNC(cuda_lib);
+    cuda_lib = 0;
+
+    cu_init = 0;
+    cu_device_get_count = 0;
+    cu_device_get = 0;
+    cu_device_get_name = 0;
+    cu_device_compute_capability = 0;
+    cu_ctx_create = 0;
+    cu_ctx_pop_current = 0;
+    cu_ctx_destroy = 0;
+
+    ifav_log(avctx, AV_LOG_VERBOSE, "Nvenc unloaded\n");
+}
+
+static void out_surf_queue_enqueue(NvencOutputSurfaceList** head, NvencOutputSurface *surface)
+{
+    if (!*head) {
+        *head = av_malloc(sizeof(NvencOutputSurfaceList));
+        (*head)->next = NULL;
+        (*head)->surface = surface;
+        return;
+    }
+
+    while ((*head)->next)
+        head = &((*head)->next);
+
+    (*head)->next = av_malloc(sizeof(NvencOutputSurfaceList));
+    (*head)->next->next = NULL;
+    (*head)->next->surface = surface;
+}
+
+static NvencOutputSurface *out_surf_queue_dequeue(NvencOutputSurfaceList** head)
+{
+    NvencOutputSurfaceList *tmp;
+    NvencOutputSurface *res;
+
+    if (!*head)
+        return 0;
+
+    tmp = *head;
+    res = tmp->surface;
+    *head = tmp->next;
+    av_free(tmp);
+
+    return res;
+}
+
+static void timestamp_list_insert_sorted(NvencTimestampList** head, int64_t timestamp)
+{
+    NvencTimestampList *newelem;
+    NvencTimestampList *prev;
+
+    if (!*head) {
+        *head = av_malloc(sizeof(NvencTimestampList));
+        (*head)->next = 0;
+        (*head)->timestamp = timestamp;
+        return;
+    }
+
+    prev = 0;
+    while (*head && timestamp >= (*head)->timestamp) {
+        prev = *head;
+        head = &((*head)->next);
+    }
+
+    newelem = av_malloc(sizeof(NvencTimestampList));
+    newelem->next = *head;
+    newelem->timestamp = timestamp;
+
+    if (*head) {
+        *head = newelem;
+    } else {
+        prev->next = newelem;
+    }
+}
+
+static int64_t timestamp_list_get_lowest(NvencTimestampList** head)
+{
+    NvencTimestampList *tmp;
+    int64_t res;
+
+    if (!*head)
+        return 0;
+
+    tmp = *head;
+    res = tmp->timestamp;
+    *head = tmp->next;
+    av_free(tmp);
+
+    return res;
+}
+
+static int nvenc_encode_init(AVCodecContext *avctx)
+{
+    NV_ENC_OPEN_ENCODE_SESSION_EX_PARAMS stEncodeSessionParams = { 0 };
+    NV_ENC_PRESET_CONFIG presetConfig = { 0 };
+    CUcontext cu_context_curr;
+    CUresult cu_res;
+    GUID encoderPreset = NV_ENC_PRESET_HQ_GUID;
+    GUID license = dummy_license;
+    NVENCSTATUS nvStatus = NV_ENC_SUCCESS;
+    int surfaceCount = 0;
+    int i, numMBs;
+    int isLL = 0;
+
+    NvencContext *ctx = avctx->priv_data;
+
+    if (!nvenc_dyload_nvenc(avctx))
+        return AVERROR_EXTERNAL;
+
+    avctx->coded_frame = av_frame_alloc();
+    if (!avctx->coded_frame)
+        return AVERROR(ENOMEM);
+
+    ctx->output_surface_queue = 0;
+    ctx->output_surface_ready_queue = 0;
+    ctx->timestamp_list = 0;
+    ctx->last_dts = AV_NOPTS_VALUE;
+    ctx->nvencoder = 0;
+
+    ctx->encode_config.version = NV_ENC_CONFIG_VER;
+    ctx->init_encode_params.version = NV_ENC_INITIALIZE_PARAMS_VER;
+    presetConfig.version = NV_ENC_PRESET_CONFIG_VER;
+    presetConfig.presetCfg.version = NV_ENC_CONFIG_VER;
+    stEncodeSessionParams.version = NV_ENC_OPEN_ENCODE_SESSION_EX_PARAMS_VER;
+    stEncodeSessionParams.apiVersion = NVENCAPI_VERSION;
+    stEncodeSessionParams.clientKeyPtr = &license;
+
+    ctx->cu_context = 0;
+    cu_res = cu_ctx_create(&ctx->cu_context, 0, nvenc_devices[nvenc_use_device_id]);
+
+    if (cu_res != CUDA_SUCCESS) {
+        av_log(avctx, AV_LOG_FATAL, "Failed creating CUDA context for NVENC: 0x%x\n", (int)cu_res);
+        goto error;
+    }
+
+    cu_res = cu_ctx_pop_current(&cu_context_curr);
+
+    if (cu_res != CUDA_SUCCESS) {
+        av_log(avctx, AV_LOG_FATAL, "Failed popping CUDA context: 0x%x\n", (int)cu_res);
+        goto error;
+    }
+
+    stEncodeSessionParams.device = ctx->cu_context;
+    stEncodeSessionParams.deviceType = NV_ENC_DEVICE_TYPE_CUDA;
+
+    nvStatus = p_nvenc->nvEncOpenEncodeSessionEx(&stEncodeSessionParams, &ctx->nvencoder);
+    if (nvStatus != NV_ENC_SUCCESS) {
+        ctx->nvencoder = 0;
+        av_log(avctx, AV_LOG_FATAL, "OpenEncodeSessionEx failed: 0x%x - invalid license key?\n", (int)nvStatus);
+        goto error;
+    }
+
+    if (ctx->preset) {
+        if (!strcmp(ctx->preset, "hp")) {
+            encoderPreset = NV_ENC_PRESET_HP_GUID;
+        } else if (!strcmp(ctx->preset, "hq")) {
+            encoderPreset = NV_ENC_PRESET_HQ_GUID;
+        } else if (!strcmp(ctx->preset, "bd")) {
+            encoderPreset = NV_ENC_PRESET_BD_GUID;
+        } else if (!strcmp(ctx->preset, "ll")) {
+            encoderPreset = NV_ENC_PRESET_LOW_LATENCY_DEFAULT_GUID;
+            isLL = 1;
+        } else if (!strcmp(ctx->preset, "llhp")) {
+            encoderPreset = NV_ENC_PRESET_LOW_LATENCY_HP_GUID;
+            isLL = 1;
+        } else if (!strcmp(ctx->preset, "llhq")) {
+            encoderPreset = NV_ENC_PRESET_LOW_LATENCY_HQ_GUID;
+            isLL = 1;
+        } else if (!strcmp(ctx->preset, "default")) {
+            encoderPreset = NV_ENC_PRESET_DEFAULT_GUID;
+        } else {
+            av_log(avctx, AV_LOG_FATAL, "Preset \"%s\" is unknown! Supported presets: hp, hq, bd, ll, llhp, llhq, default\n", ctx->preset);
+            goto error;
+        }
+    }
+
+    nvStatus = p_nvenc->nvEncGetEncodePresetConfig(ctx->nvencoder, NV_ENC_CODEC_H264_GUID, encoderPreset, &presetConfig);
+    if (nvStatus != NV_ENC_SUCCESS) {
+        av_log(avctx, AV_LOG_FATAL, "GetEncodePresetConfig failed: 0x%x\n", (int)nvStatus);
+        goto error;
+    }
+
+    ctx->init_encode_params.encodeGUID = NV_ENC_CODEC_H264_GUID;
+    ctx->init_encode_params.encodeHeight = avctx->height;
+    ctx->init_encode_params.encodeWidth = avctx->width;
+    ctx->init_encode_params.darHeight = avctx->height;
+    ctx->init_encode_params.darWidth = avctx->width;
+    ctx->init_encode_params.frameRateNum = avctx->time_base.den;
+    ctx->init_encode_params.frameRateDen = avctx->time_base.num * avctx->ticks_per_frame;
+
+    numMBs = ((avctx->width + 15) >> 4) * ((avctx->height + 15) >> 4);
+    ctx->max_surface_count = (numMBs >= 8160) ? 16 : 32;
+
+    ctx->init_encode_params.enableEncodeAsync = 0;
+    ctx->init_encode_params.enablePTD = 1;
+
+    ctx->init_encode_params.presetGUID = encoderPreset;
+
+    ctx->init_encode_params.encodeConfig = &ctx->encode_config;
+    memcpy(&ctx->encode_config, &presetConfig.presetCfg, sizeof(ctx->encode_config));
+    ctx->encode_config.version = NV_ENC_CONFIG_VER;
+
+    if (avctx->gop_size >= 0) {
+        ctx->encode_config.gopLength = avctx->gop_size;
+        ctx->encode_config.encodeCodecConfig.h264Config.idrPeriod = avctx->gop_size;
+    }
+
+    if (avctx->bit_rate > 0)
+        ctx->encode_config.rcParams.averageBitRate = avctx->bit_rate;
+
+    if (avctx->rc_max_rate > 0)
+        ctx->encode_config.rcParams.maxBitRate = avctx->rc_max_rate;
+
+    if (ctx->cbr) {
+        if (!ctx->twopass) {
+            ctx->encode_config.rcParams.rateControlMode = NV_ENC_PARAMS_RC_CBR;
+        } else if (ctx->twopass == 1 || isLL) {
+            ctx->encode_config.rcParams.rateControlMode = NV_ENC_PARAMS_RC_2_PASS_QUALITY;
+
+            ctx->encode_config.encodeCodecConfig.h264Config.adaptiveTransformMode = NV_ENC_H264_ADAPTIVE_TRANSFORM_ENABLE;
+            ctx->encode_config.encodeCodecConfig.h264Config.fmoMode = NV_ENC_H264_FMO_DISABLE;
+
+            if (!isLL)
+                av_log(avctx, AV_LOG_WARNING, "Twopass mode is only known to work with low latency (ll, llhq, llhp) presets.\n");
+        } else {
+            ctx->encode_config.rcParams.rateControlMode = NV_ENC_PARAMS_RC_CBR;
+        }
+    } else if (avctx->global_quality > 0) {
+        ctx->encode_config.rcParams.rateControlMode = NV_ENC_PARAMS_RC_CONSTQP;
+        ctx->encode_config.rcParams.constQP.qpInterB = avctx->global_quality;
+        ctx->encode_config.rcParams.constQP.qpInterP = avctx->global_quality;
+        ctx->encode_config.rcParams.constQP.qpIntra = avctx->global_quality;
+
+        avctx->qmin = -1;
+        avctx->qmax = -1;
+    } else if (avctx->qmin >= 0 && avctx->qmax >= 0) {
+        ctx->encode_config.rcParams.rateControlMode = NV_ENC_PARAMS_RC_VBR;
+
+        ctx->encode_config.rcParams.enableMinQP = 1;
+        ctx->encode_config.rcParams.enableMaxQP = 1;
+
+        ctx->encode_config.rcParams.minQP.qpInterB = avctx->qmin;
+        ctx->encode_config.rcParams.minQP.qpInterP = avctx->qmin;
+        ctx->encode_config.rcParams.minQP.qpIntra = avctx->qmin;
+
+        ctx->encode_config.rcParams.maxQP.qpInterB = avctx->qmax;
+        ctx->encode_config.rcParams.maxQP.qpInterP = avctx->qmax;
+        ctx->encode_config.rcParams.maxQP.qpIntra = avctx->qmax;
+    }
+
+    if (avctx->rc_buffer_size > 0)
+        ctx->encode_config.rcParams.vbvBufferSize = avctx->rc_buffer_size;
+
+    if (avctx->flags & CODEC_FLAG_INTERLACED_DCT) {
+        ctx->encode_config.frameFieldMode = NV_ENC_PARAMS_FRAME_FIELD_MODE_FIELD;
+    } else {
+        ctx->encode_config.frameFieldMode = NV_ENC_PARAMS_FRAME_FIELD_MODE_FRAME;
+    }
+
+    switch (avctx->profile) {
+    case FF_PROFILE_H264_BASELINE:
+        ctx->encode_config.profileGUID = NV_ENC_H264_PROFILE_BASELINE_GUID;
+        break;
+    case FF_PROFILE_H264_MAIN:
+        ctx->encode_config.profileGUID = NV_ENC_H264_PROFILE_MAIN_GUID;
+        break;
+    case FF_PROFILE_H264_HIGH:
+        ctx->encode_config.profileGUID = NV_ENC_H264_PROFILE_HIGH_GUID;
+        break;
+    default:
+        av_log(avctx, AV_LOG_WARNING, "Unsupported h264 profile requested, falling back to high\n");
+        ctx->encode_config.profileGUID = NV_ENC_H264_PROFILE_HIGH_GUID;
+        break;
+    }
+
+    if (ctx->gobpattern >= 0) {
+        ctx->encode_config.frameIntervalP = 1;
+    }
+
+    ctx->encode_config.encodeCodecConfig.h264Config.h264VUIParameters.colourDescriptionPresentFlag = 1;
+    ctx->encode_config.encodeCodecConfig.h264Config.h264VUIParameters.videoSignalTypePresentFlag = 1;
+
+    ctx->encode_config.encodeCodecConfig.h264Config.h264VUIParameters.colourMatrix = avctx->colorspace;
+    ctx->encode_config.encodeCodecConfig.h264Config.h264VUIParameters.colourPrimaries = avctx->color_primaries;
+    ctx->encode_config.encodeCodecConfig.h264Config.h264VUIParameters.transferCharacteristics = avctx->color_trc;
+
+    ctx->encode_config.encodeCodecConfig.h264Config.h264VUIParameters.videoFullRangeFlag = avctx->color_range == AVCOL_RANGE_JPEG;
+
+    ctx->encode_config.encodeCodecConfig.h264Config.disableSPSPPS = (avctx->flags & CODEC_FLAG_GLOBAL_HEADER) ? 1 : 0;
+
+    nvStatus = p_nvenc->nvEncInitializeEncoder(ctx->nvencoder, &ctx->init_encode_params);
+    if (nvStatus != NV_ENC_SUCCESS) {
+        av_log(avctx, AV_LOG_FATAL, "InitializeEncoder failed: 0x%x\n", (int)nvStatus);
+        goto error;
+    }
+
+    ctx->input_surfaces = av_malloc(ctx->max_surface_count * sizeof(*ctx->input_surfaces));
+    ctx->output_surfaces = av_malloc(ctx->max_surface_count * sizeof(*ctx->output_surfaces));
+
+    for (surfaceCount = 0; surfaceCount < ctx->max_surface_count; ++surfaceCount) {
+        NV_ENC_CREATE_INPUT_BUFFER allocSurf = { 0 };
+        NV_ENC_CREATE_BITSTREAM_BUFFER allocOut = { 0 };
+        allocSurf.version = NV_ENC_CREATE_INPUT_BUFFER_VER;
+        allocOut.version = NV_ENC_CREATE_BITSTREAM_BUFFER_VER;
+
+        allocSurf.width = (avctx->width + 31) & ~31;
+        allocSurf.height = (avctx->height + 31) & ~31;
+
+        allocSurf.memoryHeap = NV_ENC_MEMORY_HEAP_SYSMEM_CACHED;
+
+        switch (avctx->pix_fmt) {
+        case AV_PIX_FMT_YUV420P:
+            allocSurf.bufferFmt = NV_ENC_BUFFER_FORMAT_YV12_PL;
+            break;
+
+        case AV_PIX_FMT_NV12:
+            allocSurf.bufferFmt = NV_ENC_BUFFER_FORMAT_NV12_PL;
+            break;
+
+            case AV_PIX_FMT_YUV444P:
+        allocSurf.bufferFmt = NV_ENC_BUFFER_FORMAT_YUV444_PL;
+            break;
+
+        default:
+            av_log(avctx, AV_LOG_FATAL, "Invalid input pixel format\n");
+            goto error;
+        }
+
+        nvStatus = p_nvenc->nvEncCreateInputBuffer(ctx->nvencoder, &allocSurf);
+        if (nvStatus = NV_ENC_SUCCESS){
+            av_log(avctx, AV_LOG_FATAL, "CreateInputBuffer failed\n");
+            goto error;
+        }
+
+        ctx->input_surfaces[surfaceCount].lockCount = 0;
+        ctx->input_surfaces[surfaceCount].input_surface = allocSurf.inputBuffer;
+        ctx->input_surfaces[surfaceCount].format = allocSurf.bufferFmt;
+        ctx->input_surfaces[surfaceCount].width = allocSurf.width;
+        ctx->input_surfaces[surfaceCount].height = allocSurf.height;
+
+        /* 1MB is large enough to hold most output frames.
+           NVENC increases this automaticaly if it's not enough. */
+        allocOut.size = 1024 * 1024;
+
+        allocOut.memoryHeap = NV_ENC_MEMORY_HEAP_SYSMEM_CACHED;
+
+        nvStatus = p_nvenc->nvEncCreateBitstreamBuffer(ctx->nvencoder, &allocOut);
+        if (nvStatus = NV_ENC_SUCCESS) {
+            av_log(avctx, AV_LOG_FATAL, "CreateBitstreamBuffer failed\n");
+            ctx->output_surfaces[surfaceCount++].output_surface = 0;
+            goto error;
+        }
+
+        ctx->output_surfaces[surfaceCount].output_surface = allocOut.bitstreamBuffer;
+        ctx->output_surfaces[surfaceCount].size = allocOut.size;
+        ctx->output_surfaces[surfaceCount].busy = 0;
+    }
+
+    if (avctx->flags & CODEC_FLAG_GLOBAL_HEADER) {
+        uint32_t outSize = 0;
+        char tmpHeader[256];
+        NV_ENC_SEQUENCE_PARAM_PAYLOAD payload = { 0 };
+        payload.version = NV_ENC_SEQUENCE_PARAM_PAYLOAD_VER;
+
+        payload.spsppsBuffer = tmpHeader;
+        payload.inBufferSize = 256;
+        payload.outSPSPPSPayloadSize = &outSize;
+
+        nvStatus = p_nvenc->nvEncGetSequenceParams(ctx->nvencoder, &payload);
+        if (nvStatus != NV_ENC_SUCCESS) {
+            av_log(avctx, AV_LOG_FATAL, "GetSequenceParams failed\n");
+            goto error;
+        }
+
+        avctx->extradata_size = outSize;
+        avctx->extradata = av_mallocz(outSize + FF_INPUT_BUFFER_PADDING_SIZE);
+
+        memcpy(avctx->extradata, tmpHeader, outSize);
+    }
+
+    if (ctx->encode_config.frameIntervalP > 1)
+        avctx->has_b_frames = 2;
+
+    if (ctx->encode_config.rcParams.averageBitRate > 0)
+        avctx->bit_rate = ctx->encode_config.rcParams.averageBitRate;
+
+    return 0;
+
+error:
+
+    for (i = 0; i < surfaceCount; ++i) {
+        p_nvenc->nvEncDestroyInputBuffer(ctx->nvencoder, ctx->input_surfaces[i].input_surface);
+        if (ctx->output_surfaces[i].output_surface)
+            p_nvenc->nvEncDestroyBitstreamBuffer(ctx->nvencoder, ctx->output_surfaces[i].output_surface);
+    }
+
+    if (ctx->nvencoder)
+        p_nvenc->nvEncDestroyEncoder(ctx->nvencoder);
+
+    if (ctx->cu_context)
+        cu_ctx_destroy(ctx->cu_context);
+
+    nvenc_unload_nvenc(avctx);
+
+    ctx->nvencoder = 0;
+    ctx->cu_context = 0;
+
+    return AVERROR_EXTERNAL;
+}
+
+static av_cold int nvenc_encode_close(AVCodecContext *avctx)
+{
+    NvencContext *ctx = avctx->priv_data;
+    int i;
+
+    while (ctx->timestamp_list)
+        timestamp_list_get_lowest(&ctx->timestamp_list);
+
+    while (ctx->output_surface_ready_queue)
+        out_surf_queue_dequeue(&ctx->output_surface_ready_queue);
+
+    while (ctx->output_surface_queue)
+        out_surf_queue_dequeue(&ctx->output_surface_queue);
+
+    for (i = 0; i < ctx->max_surface_count; ++i) {
+        p_nvenc->nvEncDestroyInputBuffer(ctx->nvencoder, ctx->input_surfaces[i].input_surface);
+        p_nvenc->nvEncDestroyBitstreamBuffer(ctx->nvencoder, ctx->output_surfaces[i].output_surface);
+    }
+    ctx->max_surface_count = 0;
+
+    p_nvenc->nvEncDestroyEncoder(ctx->nvencoder);
+    ctx->nvencoder = 0;
+
+    cu_ctx_destroy(ctx->cu_context);
+    ctx->cu_context = 0;
+
+    nvenc_unload_nvenc(avctx);
+
+    av_frame_free(&avctx->coded_frame);
+
+    return 0;
+}
+
+static int process_output_surface(AVCodecContext *avctx, AVPacket *pkt, AVFrame *coded_frame, NvencOutputSurface *tmpoutsurf)
+{
+    NvencContext *ctx = avctx->priv_data;
+    uint32_t *slice_offsets = av_mallocz(ctx->encode_config.encodeCodecConfig.h264Config.sliceModeData * sizeof(*slice_offsets));
+    NV_ENC_LOCK_BITSTREAM lock_params = { 0 };
+    NVENCSTATUS nvStatus;
+    int res = 0;
+
+    lock_params.version = NV_ENC_LOCK_BITSTREAM_VER;
+
+    lock_params.doNotWait = 0;
+    lock_params.outputBitstream = tmpoutsurf->output_surface;
+    lock_params.sliceOffsets = slice_offsets;
+
+    nvStatus = p_nvenc->nvEncLockBitstream(ctx->nvencoder, &lock_params);
+    if (nvStatus != NV_ENC_SUCCESS) {
+        av_log(avctx, AV_LOG_ERROR, "Failed locking bitstream buffer\n");
+        res = AVERROR_EXTERNAL;
+        goto error;
+    }
+
+    if (res = ff_alloc_packet2(avctx, pkt, lock_params.bitstreamSizeInBytes)) {
+        p_nvenc->nvEncUnlockBitstream(ctx->nvencoder, tmpoutsurf->output_surface);
+        goto error;
+    }
+
+    memcpy(pkt->data, lock_params.bitstreamBufferPtr, lock_params.bitstreamSizeInBytes);
+
+    nvStatus = p_nvenc->nvEncUnlockBitstream(ctx->nvencoder, tmpoutsurf->output_surface);
+    if (nvStatus != NV_ENC_SUCCESS)
+        av_log(avctx, AV_LOG_ERROR, "Failed unlocking bitstream buffer, expect the gates of mordor to open\n");
+
+    switch (lock_params.pictureType) {
+    case NV_ENC_PIC_TYPE_IDR:
+        pkt->flags |= AV_PKT_FLAG_KEY;
+    case NV_ENC_PIC_TYPE_I:
+        avctx->coded_frame->pict_type = AV_PICTURE_TYPE_I;
+        break;
+    case NV_ENC_PIC_TYPE_P:
+        avctx->coded_frame->pict_type = AV_PICTURE_TYPE_P;
+        break;
+    case NV_ENC_PIC_TYPE_B:
+        avctx->coded_frame->pict_type = AV_PICTURE_TYPE_B;
+        break;
+    case NV_ENC_PIC_TYPE_BI:
+        avctx->coded_frame->pict_type = AV_PICTURE_TYPE_BI;
+        break;
+    default:
+        avctx->coded_frame->pict_type = AV_PICTURE_TYPE_NONE;
+        break;
+    }
+
+    pkt->pts = lock_params.outputTimeStamp;
+    pkt->dts = timestamp_list_get_lowest(&ctx->timestamp_list);
+
+    if (pkt->dts > pkt->pts)
+        pkt->dts = pkt->pts;
+
+    if (ctx->last_dts != AV_NOPTS_VALUE && pkt->dts <= ctx->last_dts)
+        pkt->dts = ctx->last_dts + 1;
+
+    ctx->last_dts = pkt->dts;
+
+    av_free(slice_offsets);
+
+    return res;
+
+error:
+
+    av_free(slice_offsets);
+    timestamp_list_get_lowest(&ctx->timestamp_list);
+
+    return res;
+}
+
+static int nvenc_encode_frame(AVCodecContext *avctx, AVPacket *pkt,
+    const AVFrame *frame, int *got_packet)
+{
+    NVENCSTATUS nvStatus;
+    NvencContext *ctx = avctx->priv_data;
+    NvencOutputSurface *tmpoutsurf;
+    int i = 0;
+
+    NV_ENC_PIC_PARAMS picParams = { 0 };
+    picParams.version = NV_ENC_PIC_PARAMS_VER;
+
+    if (frame) {
+        NV_ENC_LOCK_INPUT_BUFFER lockBufferParams = { 0 };
+        NvencInputSurface *inSurf = 0;
+
+        for (i = 0; i < ctx->max_surface_count; ++i)
+        {
+            if (!ctx->input_surfaces[i].lockCount)
+            {
+                inSurf = &ctx->input_surfaces[i];
+                break;
+            }
+        }
+
+        av_assert0(inSurf);
+
+        inSurf->lockCount = 1;
+
+        lockBufferParams.version = NV_ENC_LOCK_INPUT_BUFFER_VER;
+        lockBufferParams.inputBuffer = inSurf->input_surface;
+
+        nvStatus = p_nvenc->nvEncLockInputBuffer(ctx->nvencoder, &lockBufferParams);
+        if (nvStatus != NV_ENC_SUCCESS) {
+            av_log(avctx, AV_LOG_ERROR, "Failed locking nvenc input buffer\n");
+            return 0;
+        }
+
+        if (avctx->pix_fmt == AV_PIX_FMT_YUV420P) {
+            uint8_t *buf = lockBufferParams.bufferDataPtr;
+
+            av_image_copy_plane(buf, lockBufferParams.pitch,
+                frame->data[0], frame->linesize[0],
+                avctx->width, avctx->height);
+
+            buf += inSurf->height * lockBufferParams.pitch;
+
+            av_image_copy_plane(buf, lockBufferParams.pitch >> 1,
+                frame->data[2], frame->linesize[2],
+                avctx->width >> 1, avctx->height >> 1);
+
+            buf += (inSurf->height * lockBufferParams.pitch) >> 2;
+
+            av_image_copy_plane(buf, lockBufferParams.pitch >> 1,
+                frame->data[1], frame->linesize[1],
+                avctx->width >> 1, avctx->height >> 1);
+        } else if (avctx->pix_fmt == AV_PIX_FMT_NV12) {
+            uint8_t *buf = lockBufferParams.bufferDataPtr;
+
+            av_image_copy_plane(buf, lockBufferParams.pitch,
+                frame->data[0], frame->linesize[0],
+                avctx->width, avctx->height);
+
+            buf += inSurf->height * lockBufferParams.pitch;
+
+            av_image_copy_plane(buf, lockBufferParams.pitch,
+                frame->data[1], frame->linesize[1],
+                avctx->width, avctx->height >> 1);
+        } else if (avctx->pix_fmt == AV_PIX_FMT_YUV444P) {
+            uint8_t *buf = lockBufferParams.bufferDataPtr;
+
+            av_image_copy_plane(buf, lockBufferParams.pitch,
+                frame->data[0], frame->linesize[0],
+                avctx->width, avctx->height);
+
+            buf += inSurf->height * lockBufferParams.pitch;
+
+            av_image_copy_plane(buf, lockBufferParams.pitch,
+                frame->data[1], frame->linesize[1],
+                avctx->width, avctx->height);
+
+            buf += inSurf->height * lockBufferParams.pitch;
+
+            av_image_copy_plane(buf, lockBufferParams.pitch,
+                frame->data[2], frame->linesize[2],
+                avctx->width, avctx->height);
+        } else {
+            av_log(avctx, AV_LOG_FATAL, "Invalid pixel format!\n");
+            return AVERROR(EINVAL);
+        }
+
+        nvStatus = p_nvenc->nvEncUnlockInputBuffer(ctx->nvencoder, inSurf->input_surface);
+        if (nvStatus != NV_ENC_SUCCESS) {
+            av_log(avctx, AV_LOG_FATAL, "Failed unlocking input buffer!\n");
+            return AVERROR_EXTERNAL;
+        }
+
+        for (i = 0; i < ctx->max_surface_count; ++i)
+            if (!ctx->output_surfaces[i].busy)
+                break;
+
+        if (i == ctx->max_surface_count) {
+            inSurf->lockCount = 0;
+            av_log(avctx, AV_LOG_FATAL, "No free output surface found!\n");
+            return AVERROR_EXTERNAL;
+        }
+
+        ctx->output_surfaces[i].input_surface = inSurf;
+
+        picParams.inputBuffer = inSurf->input_surface;
+        picParams.bufferFmt = inSurf->format;
+        picParams.inputWidth = avctx->width;
+        picParams.inputHeight = avctx->height;
+        picParams.outputBitstream = ctx->output_surfaces[i].output_surface;
+        picParams.completionEvent = 0;
+
+        if (avctx->flags & CODEC_FLAG_INTERLACED_DCT) {
+            if (frame->top_field_first) {
+                picParams.pictureStruct = NV_ENC_PIC_STRUCT_FIELD_TOP_BOTTOM;
+            } else {
+                picParams.pictureStruct = NV_ENC_PIC_STRUCT_FIELD_BOTTOM_TOP;
+            }
+        } else {
+            picParams.pictureStruct = NV_ENC_PIC_STRUCT_FRAME;
+        }
+
+        picParams.encodePicFlags = 0;
+        picParams.inputTimeStamp = frame->pts;
+        picParams.inputDuration = 0;
+        picParams.codecPicParams.h264PicParams.sliceMode = ctx->encode_config.encodeCodecConfig.h264Config.sliceMode;
+        picParams.codecPicParams.h264PicParams.sliceModeData = ctx->encode_config.encodeCodecConfig.h264Config.sliceModeData;
+        memcpy(&picParams.rcParams, &ctx->encode_config.rcParams, sizeof(NV_ENC_RC_PARAMS));
+
+        timestamp_list_insert_sorted(&ctx->timestamp_list, frame->pts);
+    } else {
+        picParams.encodePicFlags = NV_ENC_PIC_FLAG_EOS;
+    }
+
+    nvStatus = p_nvenc->nvEncEncodePicture(ctx->nvencoder, &picParams);
+
+    if (frame && nvStatus == NV_ENC_ERR_NEED_MORE_INPUT) {
+        out_surf_queue_enqueue(&ctx->output_surface_queue, &ctx->output_surfaces[i]);
+        ctx->output_surfaces[i].busy = 1;
+    }
+
+    if (nvStatus != NV_ENC_SUCCESS && nvStatus != NV_ENC_ERR_NEED_MORE_INPUT) {
+        av_log(avctx, AV_LOG_ERROR, "EncodePicture failed!\n");
+        return AVERROR_EXTERNAL;
+    }
+
+    if (nvStatus != NV_ENC_ERR_NEED_MORE_INPUT) {
+        while (ctx->output_surface_queue) {
+            tmpoutsurf = out_surf_queue_dequeue(&ctx->output_surface_queue);
+            out_surf_queue_enqueue(&ctx->output_surface_ready_queue, tmpoutsurf);
+        }
+
+        if (frame) {
+            out_surf_queue_enqueue(&ctx->output_surface_ready_queue, &ctx->output_surfaces[i]);
+            ctx->output_surfaces[i].busy = 1;
+        }
+    }
+
+    if (ctx->output_surface_ready_queue) {
+        tmpoutsurf = out_surf_queue_dequeue(&ctx->output_surface_ready_queue);
+
+        i = process_output_surface(avctx, pkt, avctx->coded_frame, tmpoutsurf);
+
+        if (i)
+            return i;
+
+        tmpoutsurf->busy = 0;
+        av_assert0(tmpoutsurf->input_surface->lockCount);
+        tmpoutsurf->input_surface->lockCount--;
+
+        *got_packet = 1;
+    } else {
+        *got_packet = 0;
+    }
+
+    return 0;
+}
+
+static int pix_fmts_nvenc_initialized;
+
+static enum AVPixelFormat pix_fmts_nvenc[] = {
+    AV_PIX_FMT_NV12,
+    AV_PIX_FMT_NONE,
+    AV_PIX_FMT_NONE,
+    AV_PIX_FMT_NONE
+};
+
+static av_cold void nvenc_init_static(AVCodec *codec)
+{
+    NV_ENC_OPEN_ENCODE_SESSION_EX_PARAMS stEncodeSessionParams = { 0 };
+    CUcontext cuctxcur = 0, cuctx = 0;
+    NVENCSTATUS nvStatus;
+    void *nvencoder = 0;
+    GUID encodeGuid = NV_ENC_CODEC_H264_GUID;
+    GUID license = dummy_license;
+    int i = 0, pos = 0;
+    int gotnv12 = 0, got420 = 0, got444 = 0;
+    uint32_t inputFmtCount = 32;
+    NV_ENC_BUFFER_FORMAT inputFmts[32];
+
+    for (i = 0; i < 32; ++i)
+        inputFmts[i] = (NV_ENC_BUFFER_FORMAT)0;
+    i = 0;
+
+    if (pix_fmts_nvenc_initialized) {
+        codec->pix_fmts = pix_fmts_nvenc;
+        return;
+    }
+
+    if (!nvenc_dyload_nvenc(0)) {
+        pix_fmts_nvenc_initialized = 1;
+        return;
+    }
+
+    stEncodeSessionParams.version = NV_ENC_OPEN_ENCODE_SESSION_EX_PARAMS_VER;
+    stEncodeSessionParams.apiVersion = NVENCAPI_VERSION;
+    stEncodeSessionParams.clientKeyPtr = &license;
+
+    cuctx = 0;
+    if (cu_ctx_create(&cuctx, 0, nvenc_devices[nvenc_use_device_id]) != CUDA_SUCCESS) {
+        cuctx = 0;
+        goto error;
+    }
+
+    if (cu_ctx_pop_current(&cuctxcur) != CUDA_SUCCESS)
+        goto error;
+
+    stEncodeSessionParams.device = (void*)cuctx;
+    stEncodeSessionParams.deviceType = NV_ENC_DEVICE_TYPE_CUDA;
+
+    nvStatus = p_nvenc->nvEncOpenEncodeSessionEx(&stEncodeSessionParams, &nvencoder);
+    if (nvStatus != NV_ENC_SUCCESS) {
+        nvencoder = 0;
+        goto error;
+    }
+
+    nvStatus = p_nvenc->nvEncGetInputFormats(nvencoder, encodeGuid, inputFmts, 32, &inputFmtCount);
+    if (nvStatus != NV_ENC_SUCCESS)
+        goto error;
+
+    pos = 0;
+    for (i = 0; i < inputFmtCount && pos < 3; ++i) {
+        if (!gotnv12 && (inputFmts[i] == NV_ENC_BUFFER_FORMAT_NV12_PL
+                || inputFmts[i] == NV_ENC_BUFFER_FORMAT_NV12_TILED16x16
+                || inputFmts[i] == NV_ENC_BUFFER_FORMAT_NV12_TILED64x16)) {
+
+            pix_fmts_nvenc[pos++] = AV_PIX_FMT_NV12;
+            gotnv12 = 1;
+        } else if (!got420 && (inputFmts[i] == NV_ENC_BUFFER_FORMAT_YV12_PL
+                || inputFmts[i] == NV_ENC_BUFFER_FORMAT_YV12_TILED16x16
+                || inputFmts[i] == NV_ENC_BUFFER_FORMAT_YV12_TILED64x16)) {
+
+            pix_fmts_nvenc[pos++] = AV_PIX_FMT_YUV420P;
+            got420 = 1;
+        } else if (!got444 && (inputFmts[i] == NV_ENC_BUFFER_FORMAT_YUV444_PL
+                || inputFmts[i] == NV_ENC_BUFFER_FORMAT_YUV444_TILED16x16
+                || inputFmts[i] == NV_ENC_BUFFER_FORMAT_YUV444_TILED64x16)) {
+
+            pix_fmts_nvenc[pos++] = AV_PIX_FMT_YUV444P;
+            got444 = 1;
+        }
+    }
+
+    pix_fmts_nvenc[pos] = AV_PIX_FMT_NONE;
+
+    pix_fmts_nvenc_initialized = 1;
+    codec->pix_fmts = pix_fmts_nvenc;
+
+    p_nvenc->nvEncDestroyEncoder(nvencoder);
+    cu_ctx_destroy(cuctx);
+
+    nvenc_unload_nvenc(0);
+
+    return;
+
+error:
+
+    if (nvencoder)
+        p_nvenc->nvEncDestroyEncoder(nvencoder);
+
+    if (cuctx)
+        cu_ctx_destroy(cuctx);
+
+    pix_fmts_nvenc_initialized = 1;
+    pix_fmts_nvenc[0] = AV_PIX_FMT_NV12;
+    pix_fmts_nvenc[1] = AV_PIX_FMT_NONE;
+
+    codec->pix_fmts = pix_fmts_nvenc;
+
+    nvenc_unload_nvenc(0);
+}
+
+#define OFFSET(x) offsetof(NvencContext, x)
+#define VE AV_OPT_FLAG_VIDEO_PARAM | AV_OPT_FLAG_ENCODING_PARAM
+static const AVOption options[] = {
+    { "preset", "Set the encoding preset (one of hq, hp, bd, ll, llhq, llhp, default)", OFFSET(preset), AV_OPT_TYPE_STRING, { .str = "hq" }, 0, 0, VE },
+    { "cbr", "Use cbr encoding mode", OFFSET(cbr), AV_OPT_TYPE_INT, { .i64 = 0 }, 0, 1, VE },
+    { "2pass", "Use 2pass cbr encoding mode (low latency mode only)", OFFSET(twopass), AV_OPT_TYPE_INT, { .i64 = -1 }, -1, 1, VE },
+    { "goppattern", "Specifies the GOP pattern as follows: 0: I, 1: IPP, 2: IBP, 3: IBBP", OFFSET(gobpattern), AV_OPT_TYPE_INT, { .i64 = -1 }, -1, 3, VE },
+    { NULL }
+};
+
+static const AVClass nvenc_class = {
+    .class_name = "nvenc",
+    .item_name = av_default_item_name,
+    .option = options,
+    .version = LIBAVUTIL_VERSION_INT,
+};
+
+static const AVCodecDefault nvenc_defaults[] = {
+    { "b", "0" },
+    { "qmin", "-1" },
+    { "qmax", "-1" },
+    { "qdiff", "-1" },
+    { "qblur", "-1" },
+    { "qcomp", "-1" },
+    { NULL },
+};
+
+AVCodec ff_nvenc_encoder = {
+    .name = "nvenc",
+    .long_name = NULL_IF_CONFIG_SMALL("Nvidia NVENC h264 encoder"),
+    .type = AVMEDIA_TYPE_VIDEO,
+    .id = AV_CODEC_ID_H264,
+    .priv_data_size = sizeof(NvencContext),
+    .init = nvenc_encode_init,
+    .encode2 = nvenc_encode_frame,
+    .close = nvenc_encode_close,
+    .capabilities = CODEC_CAP_DELAY,
+    .priv_class = &nvenc_class,
+    .defaults = nvenc_defaults,
+    .init_static_data = nvenc_init_static
+};

signature.asc
Description: OpenPGP digital signature

_______________________________________________
ffmpeg-devel mailing list
[email protected]
http://ffmpeg.org/mailman/listinfo/ffmpeg-devel

Re: [FFmpeg-devel] [PATCH] Add NVENC encoder

Reply via email to