Re: [FFmpeg-devel] have some major changes for nvenc support

2016-01-12 Thread Roger Pack
On 1/8/16, Andrey Turkin  wrote:
> In my opinion this proliferation of various filters which do the same thing
> in different way is a configuration headache. There's CPU filters: one for
> scaling/format conversion, one for padding, one  for cropping, like 5
> different filters for deinterlacing. And now there would be nvresize for
> scaling, and we gotta add CUDA-based nvpad and nvdeint if we want to do
> some transcoding. Maybe add overlaying too - so nvoverlay. Then there is
> OpenCL which can do everything - so 4 more filters for that. And there is
> quicksync which I think can do those things, so there would be qsvscale and
> qsvdeint. And there is D3D11 video processor which can do those things too
> (simultaneously in fact), so there's gotta be
> d3d11vascaledeintpadcropoverlay. And then we've got a whole bunch of video
> filters which can only do their job on a specific hwaccel platform. Want to
> try different hwaccel? Rewrite damn filter string. Want to do something
> generic that can be used over different platforms? Can't be done.
> Maybe it's just my wishful thinking, but I was wondering for some time if
> there can be one "smart" filter to do one specific thing?

Yes a smart wrapper might be nice.
I'm all for different ways to resize, though, I used to wonder why
there wasn't a filter option liek

-filter_complex "convert_to_opengl; opengl_resize=1000x1000;
opengl_rotate=90;convert_from_opengl;" that type of thing, I think
based on similar functionality gstreamer has.
Cheers!
___
ffmpeg-devel mailing list
ffmpeg-devel@ffmpeg.org
http://ffmpeg.org/mailman/listinfo/ffmpeg-devel


Re: [FFmpeg-devel] have some major changes for nvenc support

2016-01-08 Thread Andrey Turkin
2016-01-08 20:25 GMT+03:00 Michael Niedermayer :

> Also slightly orthogonal but if you have 4 filters each written for a
> different hwaccel you can write a generic filter that passes its
> stuff to the appropriate one
> If theres not much shareale code between hw specific filters then this
> might be cleaner than throwing all hw specifc code in one filter
> directly
>
> such seperate filters could use the existing conditional compilation
> code, could be used directly if the user wants to force a specific
> hw type, ...
>
> So we'd have a default filter and then specific ones just in case anyone
needs them. That is a great idea.
Not sure which way is cleaner - to have proxy filter or to lump everything
together in one filter and add extra filter definitions to force specific
hwaccel. Would work either way.

That I'd looking for in the end, though, is to have more "intelligent"
ffmpeg wrt hardware acceleration. I'd much rather prefer to have it
automagically use hwaccel whether possible, in the optimal way, and not
have to describe every detail.
For example, this thread started because NVidia had a specific scenario and
they had to offload scaling to CUDA - and they had to concoct something
arguably messy to make it happen. There's very specialized filter with many
outputs, and you have to use complex_filter and you have to map filter
outputs to output files etc. It would be awesome if they could just do
ffmpeg -i input -s 1920x1080 -c:v nvenc_h264 hd1080.mp4 -s 1280x720 -c:v
nvenc_h264 hd720.mp4 ..., and it'd do scaling on GPU without explicit
instructions.

In my opinion (which might be totally wrong), it would take 4 changes to
make that happen:
- make ffmpeg use avfilter for scaling - i.e. connect scale filter to
filterchain output (unless it already does)
- add another pixelformat for CUDA, add its support to scale, and add it as
a input format for nvenc_h264
- adjust pixelformat negotiation logic in avfilter to make sure it would
select CUDA pixelformat for scale (maybe just preferring hwaccel formats
over software ones would work?)
- add interop filter to perform CPU-GPU and GPU-CPU data transfer - i.e.
convert between hwaccel and corresponding software formats; avfilter would
insert it in appropriate places when negotiating pixel formats (I think it
does something similar with scale). This might be tricky - e.g. in my
example single interop filter needs to be added and its output has to be
shared between all the scale filters. If, say, there were 2 GPUs used in
encoding then there would have to be 2 interop filters. On the plus side
all existing host->device copying code in encoders can be thrown out (or
rather moved in that filter); as well as existing device->host copying code
from ffmpeg_*.c. Also it would make writing new hwaccel-enable filters
easier.

Actually there is one more thing to do - filters would somehow have to
share hwaccel contexts with their neighbour filters as well as filterchain
inputs and outputs.
___
ffmpeg-devel mailing list
ffmpeg-devel@ffmpeg.org
http://ffmpeg.org/mailman/listinfo/ffmpeg-devel


Re: [FFmpeg-devel] have some major changes for nvenc support

2016-01-08 Thread Michael Niedermayer
On Fri, Jan 08, 2016 at 03:04:26PM +0300, Andrey Turkin wrote:
> In my opinion this proliferation of various filters which do the same thing
> in different way is a configuration headache. There's CPU filters: one for
> scaling/format conversion, one for padding, one  for cropping, like 5
> different filters for deinterlacing. And now there would be nvresize for
> scaling, and we gotta add CUDA-based nvpad and nvdeint if we want to do
> some transcoding. Maybe add overlaying too - so nvoverlay. Then there is
> OpenCL which can do everything - so 4 more filters for that. And there is
> quicksync which I think can do those things, so there would be qsvscale and
> qsvdeint. And there is D3D11 video processor which can do those things too
> (simultaneously in fact), so there's gotta be
> d3d11vascaledeintpadcropoverlay. And then we've got a whole bunch of video
> filters which can only do their job on a specific hwaccel platform. Want to
> try different hwaccel? Rewrite damn filter string. Want to do something
> generic that can be used over different platforms? Can't be done.
> Maybe it's just my wishful thinking, but I was wondering for some time if
> there can be one "smart" filter to do one specific thing? Say, single
> deinterlace filter which can automatically use whichever hwaccel was used
> on its input (or whichever will be used on its output)? We've already got
> pixel formats which describe particular hwaccel - can't filters decide
> which code path to use based on that? And it can have a configuration

Also slightly orthogonal but if you have 4 filters each written for a
different hwaccel you can write a generic filter that passes its
stuff to the appropriate one
If theres not much shareale code between hw specific filters then this
might be cleaner than throwing all hw specifc code in one filter
directly

such seperate filters could use the existing conditional compilation
code, could be used directly if the user wants to force a specific
hw type, ...

[...]
-- 
Michael GnuPG fingerprint: 9FF2128B147EF6730BADF133611EC787040B0FAB

I am the wisest man alive, for I know one thing, and that is that I know
nothing. -- Socrates


signature.asc
Description: Digital signature
___
ffmpeg-devel mailing list
ffmpeg-devel@ffmpeg.org
http://ffmpeg.org/mailman/listinfo/ffmpeg-devel


Re: [FFmpeg-devel] have some major changes for nvenc support

2016-01-08 Thread Andrey Turkin
In my opinion this proliferation of various filters which do the same thing
in different way is a configuration headache. There's CPU filters: one for
scaling/format conversion, one for padding, one  for cropping, like 5
different filters for deinterlacing. And now there would be nvresize for
scaling, and we gotta add CUDA-based nvpad and nvdeint if we want to do
some transcoding. Maybe add overlaying too - so nvoverlay. Then there is
OpenCL which can do everything - so 4 more filters for that. And there is
quicksync which I think can do those things, so there would be qsvscale and
qsvdeint. And there is D3D11 video processor which can do those things too
(simultaneously in fact), so there's gotta be
d3d11vascaledeintpadcropoverlay. And then we've got a whole bunch of video
filters which can only do their job on a specific hwaccel platform. Want to
try different hwaccel? Rewrite damn filter string. Want to do something
generic that can be used over different platforms? Can't be done.
Maybe it's just my wishful thinking, but I was wondering for some time if
there can be one "smart" filter to do one specific thing? Say, single
deinterlace filter which can automatically use whichever hwaccel was used
on its input (or whichever will be used on its output)? We've already got
pixel formats which describe particular hwaccel - can't filters decide
which code path to use based on that? And it can have a configuration
option to choose which CPU-based fallback to use (in fact that option can
be used to tweak GPU-based algorithm for platforms which support it).
Same goes for encoders - can't there be "just" h264 encoder? This one's
tough though - you might have dxva decoder, cuda filters and nvenc encoder
and you probably want to keep them all on the same GPU. Or not - maybe you
do want to decode on qsv and encode on nvenc, or maybe vice versa. Probably
single-GPU case is more common so it should try to use same GPU what was
used for decoding and/or filtering, and allow to override with encoder
options (like nvenc allows to specify cuda device to use).
Interop will be a pain though (obviously we'd want to avoid device-host
frame transfers). I'm trying to share d3d11va-decoded frames (nv12 texture)
with OpenCL or CUDA right now,and I've been at it for days with no luck
whatsoever. My last resort now is to write a shader to "draw" (in fact just
copy pixels around) nv12 texture onto another texture in more "common"
format which can be used by OpenCL... There's got to be some easier way
(other than using cuvid to decode  the video), right?

Regards, Andrey

2016-01-08 11:29 GMT+03:00 Roger Pack :

> On 11/5/15, wm4  wrote:
> > On Thu, 5 Nov 2015 16:23:04 +0800
> > Agatha Hu  wrote:
> >
> >> 2) We use AVFrame::opaque field to store a customized ffnvinfo struture
> >> to prevent expensive CPU<->GPU transferration. Without it, the workflow
> >> will be like CPU AVFrame input-->copy to GPU-->do CUDA resizing-->copy
> >> to CPU AVFrame-->copy to GPU-->NVENC encoding. And now it becomes:
> >> CPU AVFrame input-->copy to GPU-->do CUDA resizing-->NVENC encoding.
> >> Our strategy is to check whether AVFrame::opaque is not null AND its
> >> first 128 bytes matches some particular GUID. If so, AVFrame::opaque is
> >> a valid ffnvinfo struture and we read GPU address directly from it
> >> instead of copying data from AVFrame.
> >
> > Please no, not another hack that makes the hw decoding API situation
> > worse. Do this properly and coordinate with Gwenole Beauchesne, who
> > plans improvements into this direction.
>
> Which part are you referring to (though I'll admit putting some stuff
> in libavutil seems a bit suspect).
> It would be nice to have the nvresize filter available anyway, and it
> looks like it mostly just deals with private context variables.
> Cheers!
> -roger-
> ___
> ffmpeg-devel mailing list
> ffmpeg-devel@ffmpeg.org
> http://ffmpeg.org/mailman/listinfo/ffmpeg-devel
>
___
ffmpeg-devel mailing list
ffmpeg-devel@ffmpeg.org
http://ffmpeg.org/mailman/listinfo/ffmpeg-devel


Re: [FFmpeg-devel] have some major changes for nvenc support

2016-01-08 Thread Roger Pack
On 11/5/15, wm4  wrote:
> On Thu, 5 Nov 2015 16:23:04 +0800
> Agatha Hu  wrote:
>
>> 2) We use AVFrame::opaque field to store a customized ffnvinfo struture
>> to prevent expensive CPU<->GPU transferration. Without it, the workflow
>> will be like CPU AVFrame input-->copy to GPU-->do CUDA resizing-->copy
>> to CPU AVFrame-->copy to GPU-->NVENC encoding. And now it becomes:
>> CPU AVFrame input-->copy to GPU-->do CUDA resizing-->NVENC encoding.
>> Our strategy is to check whether AVFrame::opaque is not null AND its
>> first 128 bytes matches some particular GUID. If so, AVFrame::opaque is
>> a valid ffnvinfo struture and we read GPU address directly from it
>> instead of copying data from AVFrame.
>
> Please no, not another hack that makes the hw decoding API situation
> worse. Do this properly and coordinate with Gwenole Beauchesne, who
> plans improvements into this direction.

Which part are you referring to (though I'll admit putting some stuff
in libavutil seems a bit suspect).
It would be nice to have the nvresize filter available anyway, and it
looks like it mostly just deals with private context variables.
Cheers!
-roger-
___
ffmpeg-devel mailing list
ffmpeg-devel@ffmpeg.org
http://ffmpeg.org/mailman/listinfo/ffmpeg-devel


Re: [FFmpeg-devel] have some major changes for nvenc support

2015-11-05 Thread Agatha Hu

在 2015/11/5 18:31, wm4 写道:

On Thu, 5 Nov 2015 16:23:04 +0800
Agatha Hu  wrote:


2) We use AVFrame::opaque field to store a customized ffnvinfo struture
to prevent expensive CPU<->GPU transferration. Without it, the workflow
will be like CPU AVFrame input-->copy to GPU-->do CUDA resizing-->copy
to CPU AVFrame-->copy to GPU-->NVENC encoding. And now it becomes:
CPU AVFrame input-->copy to GPU-->do CUDA resizing-->NVENC encoding.
Our strategy is to check whether AVFrame::opaque is not null AND its
first 128 bytes matches some particular GUID. If so, AVFrame::opaque is
a valid ffnvinfo struture and we read GPU address directly from it
instead of copying data from AVFrame.


Please no, not another hack that makes the hw decoding API situation
worse. Do this properly and coordinate with Gwenole Beauchesne, who
plans improvements into this direction.


Try to catch on with Gwenole Beauchesne's work, is the related 
AVHWAccelFrame available for use now?


Agatha Hu



___
ffmpeg-devel mailing list
ffmpeg-devel@ffmpeg.org
http://ffmpeg.org/mailman/listinfo/ffmpeg-devel


Re: [FFmpeg-devel] have some major changes for nvenc support

2015-11-05 Thread wm4
On Thu, 5 Nov 2015 16:23:04 +0800
Agatha Hu  wrote:

> 2) We use AVFrame::opaque field to store a customized ffnvinfo struture 
> to prevent expensive CPU<->GPU transferration. Without it, the workflow 
> will be like CPU AVFrame input-->copy to GPU-->do CUDA resizing-->copy 
> to CPU AVFrame-->copy to GPU-->NVENC encoding. And now it becomes:
> CPU AVFrame input-->copy to GPU-->do CUDA resizing-->NVENC encoding.
> Our strategy is to check whether AVFrame::opaque is not null AND its 
> first 128 bytes matches some particular GUID. If so, AVFrame::opaque is 
> a valid ffnvinfo struture and we read GPU address directly from it 
> instead of copying data from AVFrame.

Please no, not another hack that makes the hw decoding API situation
worse. Do this properly and coordinate with Gwenole Beauchesne, who
plans improvements into this direction.
___
ffmpeg-devel mailing list
ffmpeg-devel@ffmpeg.org
http://ffmpeg.org/mailman/listinfo/ffmpeg-devel


[FFmpeg-devel] have some major changes for nvenc support

2015-11-05 Thread Agatha Hu

Hi,

Recently Nvidia did some work on improving nvenc performance, it 
includes lots of change so I attach the patch instead of direct send.


Here are the explanations:
1) The first main change is adding an nvresize filter (1:N, one input, 
multiple outputs) to do hardware resizing, because during our interal 
1:N encoding test, we found swscale becomes bottleneck. So we use cuda 
kernel instead.


2) We use AVFrame::opaque field to store a customized ffnvinfo struture 
to prevent expensive CPU<->GPU transferration. Without it, the workflow 
will be like CPU AVFrame input-->copy to GPU-->do CUDA resizing-->copy 
to CPU AVFrame-->copy to GPU-->NVENC encoding. And now it becomes:

CPU AVFrame input-->copy to GPU-->do CUDA resizing-->NVENC encoding.
Our strategy is to check whether AVFrame::opaque is not null AND its 
first 128 bytes matches some particular GUID. If so, AVFrame::opaque is 
a valid ffnvinfo struture and we read GPU address directly from it 
instead of copying data from AVFrame.
Nvresize filter has a -readback parameter, if it's set as 0, resized 
result won't be copied back to CPU, mostly in case it's connected to an 
NVENC encoder。 If it's set as 1, resized result will still be copied 
back to AVFrame so that it could be compatible with other components.


3) Because we are using CUDA address now, input buffer becomes CUDA 
external memory. We replaced NvEncCreateInputBuffer to 
cuMemAllocPitch+NvEncRegisterInputBuffer, and 
NvEncLock/UnlockInputBuffer to NvEncMap/UnmapInputBuffer.


4) And because of using cuda input, it exposed some driver bugs, e.g. 
nvenc generates corrupted chroma plane data if buffer format is YUV420p. 
Bug-fixed driver will soon be released, but considering backwards 
compatibility we decided to convert YUV420P to NV12 explicitly by a cuda 
kernel in nvenc.c. Even in the bug-fixed driver, there's still a 
YUV420P->NV12 conversion kernel. The only difference is that kernel is 
provided along with driver, but here we did it within nvenc.c.
The same reason, YUV444P is removed temporarily, there's a bug for cuda 
input. Once the fix is released, we should enable the support again.
We choose to backwards support YUV420p is because it's much more popular 
than YUV444P.


5) Last is, we move most of cuda typedefs/functions/helpers to cudautils.h/c

A typical use case is:
ffmpeg -y -i $1 $2 $3 -filter_complex \


nvresize=5:s=hd1080\|hd720\|hd480\|wvga\|cif:readback=0[out0][out1][out2][out3][out4] 
\


-map [out0] -an -vcodec nvenc_h264 -preset slow -profile:v main 
-async 1 -b:v 200M -bufsize 200M -maxrate 200M -refs 1 -bf 2 $1_1080p.mp4 \


-map [out1] -an -vcodec nvenc_h264 -preset slow -profile:v main 
-async 1 -b:v 100M -bufsize 100M -maxrate 100M -refs 1 -bf 2 $1_720p.mp4 \


-map [out2] -an -vcodec nvenc_h264 -preset slow -profile:v main 
-async 1 -b:v  50M -bufsize  50M -maxrate  50M -refs 1 -bf 2 $1_480p.mp4 \


-map [out3] -an -vcodec nvenc_h264 -preset slow -profile:v main 
-async 1 -b:v  25M -bufsize  25M -maxrate  25M -refs 1 -bf 2 $1_wvga.mp4 \


-map [out4] -an -vcodec nvenc_h264 -preset slow -profile:v main 
-async 1 -b:v  10M -bufsize  10M -maxrate  10M -refs 1 -bf 2 $1_cif.mp4



Thanks
Agatha Hu
>From 4bb843a47cbcef9c0383efb7e573f0f8eadb65d6 Mon Sep 17 00:00:00 2001
From: Ganapathy Kasi 
Date: Wed, 4 Nov 2015 22:22:35 -0800
Subject: [PATCH] combined: cuda resize,yuv420 fix,remove yuv444,add AQ

---
 libavcodec/Makefile   |   2 +-
 libavcodec/nvenc.c| 435 ++-
 libavcodec/nvenc_ptx.c| 240 +++
 libavfilter/Makefile  |   2 +
 libavfilter/allfilters.c  |   1 +
 libavfilter/vf_nvresize.c | 669 ++
 libavfilter/vf_nvresize_ptx.c | 659 +
 libavutil/Makefile|   2 +
 libavutil/cudautils.c | 288 ++
 libavutil/cudautils.h | 216 ++
 10 files changed, 2241 insertions(+), 273 deletions(-)
 create mode 100644 libavcodec/nvenc_ptx.c
 create mode 100644 libavfilter/vf_nvresize.c
 create mode 100644 libavfilter/vf_nvresize_ptx.c
 create mode 100644 libavutil/cudautils.c
 create mode 100644 libavutil/cudautils.h

diff --git a/libavcodec/Makefile b/libavcodec/Makefile
index 67fb72a..45ac476 100644
--- a/libavcodec/Makefile
+++ b/libavcodec/Makefile
@@ -98,7 +98,7 @@ OBJS-$(CONFIG_MPEGVIDEOENC)+= mpegvideo_enc.o mpeg12data.o  \
   motion_est.o ratecontrol.o\
   mpegvideoencdsp.o
 OBJS-$(CONFIG_MSS34DSP)+= mss34dsp.o
-OBJS-$(CONFIG_NVENC)   += nvenc.o
+OBJS-$(CONFIG_NVENC)   += nvenc.o nvenc_ptx.o
 OBJS-$(CONFIG_PIXBLOCKDSP) += pixblockdsp.o
 OBJS-$(CONFIG_QPELDSP) += qpeldsp.o
 OBJS-$(CONFIG_QSV) += qsv.o
diff --git a/libavcodec/nvenc.c b/