Re: [FFmpeg-devel] [PATCH] avformat: Implement subtitle charenc guessing

2014-12-16 Thread Rodger Combs

 On Dec 14, 2014, at 10:06, Nicolas George geo...@nsup.org wrote:
 
 Le tridi 23 frimaire, an CCXXIII, Rodger Combs a écrit :
 I couldn't see a sensible way to do this in lavc, since the detector
 libraries generally require more than one packet to work effectively.
 Looking at that doxy again, I can see how the detection could be done in
 lavf and the conversion in lavc, but I don't really see an advantage there
 other than fewer API changes.
 
 There is no benefit in doing the conversion in lavc for text files, but text
 files processed by lavf are not the only source of subtitles. The conversion
 in lavc must stay there for those cases, and the conversion in lavf must
 work gracefully with it.
 

Hmm, and unless I'm missing something, there's no way to do the detection in 
lavf and the conversion in lavc, so we'll have to have it in both regardless. 
Still, it'd probably make sense to have the actual conversion code be handled 
generically in lavu.

 So, by default it'd just handle encoding, and then additional
 normalization features could be enabled by the consumer? Sounds useful
 indeed.
 
 Something like that. You can have a look at the first draft for the API
 there:
 
 http://ffmpeg.org/pipermail/ffmpeg-devel/2013-August/146979.html
 
 Splitting lines and normalizing LF was enabled by a flag.
 
 The API itself will probably need to be changed to allow pluggable detection
 modules without using more global state.
 

Looks like a good point to work from.

 I like this model in general, but it brings up a few questions that I kind
 of dodged in my patch. For instance, how should lavu determine which
 module's output to prefer if there are conflicting charenc guesses? How
 can the consumer choose between the given guesses?
 
 In my patch, preference is very simplistic and the order is hardcoded. In
 a more modular system, it'd have to be a bit more complex; I can imagine
 some form of scoring system, or even another type of module that ranks
 possible guesses, but that could get very complex very fast. Any ideas for
 this?
 
 In this case, I believe that keeping simple at API level is the best
 approach: the detection state is held in a structure, each detection module
 is called in turn with the same structure and update it with its result.
 
 Then, it is only a matter of specifying what an acceptable update is: only
 change a value if the new value is more sure than the previous one.
 
 As for the exact fields that must be present in the structure, that depends
 on the exact useful information each relevant libraries can return.
 

The trouble here is that some detection libraries don't provide a certainty 
parameter, or don't expose it.

 In my patch, the consumer can override the choice of encoding by making
 changes to the AVFormatContext between header reading and retrieving the
 packet; it seems like the best way to do so in your system would be to
 pass a callback.
 
 Can you explain in what situation this kind of overriding would be
 necessary?
 

For instance, if a player (or even ffmpeg.c) tries to play/transcode a subtitle 
file and finds itself with multiple guesses for its encoding, it may want to 
present the user with a UI to have them select (what they think is) the correct 
one from the list, or enter the actual value if all guesses were wrong.

 On a bit of a side-note: my system is designed to make every possible
 effort to return a recoded packet, with multiple layers of fallback
 behavior in case the first guess turns out to be incorrect or the source
 file is outright invalid. I wouldn't expect that to be significantly more
 difficult with your design, but I wonder what your opinions on the setup
 are?
 
 For this, I believe this is on a per-user basis. Some users want that
 everything works automagically, some users want to be notified even if the
 smallest detail goes unexpected. In the end, it should probably come to an
 option:
 
 ffmpeg -text_encoding certainty_threshold=80:allow_substitute=invalid
 
 for example, to accept a guess only when it has at least 80% certainty and
 allow to replace invalid input sequences by a mask character.
 

Sounds generally sensible, except that the certainty parameter isn't returned 
by all detection libraries.

 So, the text-file-read API would buffer the entire input file and perform
 charenc detection/conversion and/or other normalization, then FFTextReader
 would read from the normalized buffer?
 
 Something like that. Since FFTextReader is internal, there is room to choose
 the exact implementation.
 
 Regards,
 
 -- 
  Nicolas George
 ___
 ffmpeg-devel mailing list
 ffmpeg-devel@ffmpeg.org
 http://ffmpeg.org/mailman/listinfo/ffmpeg-devel

___
ffmpeg-devel mailing list
ffmpeg-devel@ffmpeg.org
http://ffmpeg.org/mailman/listinfo/ffmpeg-devel


Re: [FFmpeg-devel] [PATCH] avformat: Implement subtitle charenc guessing

2014-12-14 Thread Nicolas George
Le tridi 23 frimaire, an CCXXIII, Rodger Combs a écrit :
 I couldn't see a sensible way to do this in lavc, since the detector
 libraries generally require more than one packet to work effectively.
 Looking at that doxy again, I can see how the detection could be done in
 lavf and the conversion in lavc, but I don't really see an advantage there
 other than fewer API changes.

There is no benefit in doing the conversion in lavc for text files, but text
files processed by lavf are not the only source of subtitles. The conversion
in lavc must stay there for those cases, and the conversion in lavf must
work gracefully with it.

 So, by default it'd just handle encoding, and then additional
 normalization features could be enabled by the consumer? Sounds useful
 indeed.

Something like that. You can have a look at the first draft for the API
there:

http://ffmpeg.org/pipermail/ffmpeg-devel/2013-August/146979.html

Splitting lines and normalizing LF was enabled by a flag.

The API itself will probably need to be changed to allow pluggable detection
modules without using more global state.

 I like this model in general, but it brings up a few questions that I kind
 of dodged in my patch. For instance, how should lavu determine which
 module's output to prefer if there are conflicting charenc guesses? How
 can the consumer choose between the given guesses?

 In my patch, preference is very simplistic and the order is hardcoded. In
 a more modular system, it'd have to be a bit more complex; I can imagine
 some form of scoring system, or even another type of module that ranks
 possible guesses, but that could get very complex very fast. Any ideas for
 this?

In this case, I believe that keeping simple at API level is the best
approach: the detection state is held in a structure, each detection module
is called in turn with the same structure and update it with its result.

Then, it is only a matter of specifying what an acceptable update is: only
change a value if the new value is more sure than the previous one.

As for the exact fields that must be present in the structure, that depends
on the exact useful information each relevant libraries can return.

 In my patch, the consumer can override the choice of encoding by making
 changes to the AVFormatContext between header reading and retrieving the
 packet; it seems like the best way to do so in your system would be to
 pass a callback.

Can you explain in what situation this kind of overriding would be
necessary?

 On a bit of a side-note: my system is designed to make every possible
 effort to return a recoded packet, with multiple layers of fallback
 behavior in case the first guess turns out to be incorrect or the source
 file is outright invalid. I wouldn't expect that to be significantly more
 difficult with your design, but I wonder what your opinions on the setup
 are?

For this, I believe this is on a per-user basis. Some users want that
everything works automagically, some users want to be notified even if the
smallest detail goes unexpected. In the end, it should probably come to an
option:

ffmpeg -text_encoding certainty_threshold=80:allow_substitute=invalid

for example, to accept a guess only when it has at least 80% certainty and
allow to replace invalid input sequences by a mask character.

 So, the text-file-read API would buffer the entire input file and perform
 charenc detection/conversion and/or other normalization, then FFTextReader
 would read from the normalized buffer?

Something like that. Since FFTextReader is internal, there is room to choose
the exact implementation.

Regards,

-- 
  Nicolas George


signature.asc
Description: Digital signature
___
ffmpeg-devel mailing list
ffmpeg-devel@ffmpeg.org
http://ffmpeg.org/mailman/listinfo/ffmpeg-devel


Re: [FFmpeg-devel] [PATCH] avformat: Implement subtitle charenc guessing

2014-12-14 Thread Nicolas George
Le tridi 23 frimaire, an CCXXIII, Carl Eugen Hoyos a écrit :
 Imo, if your distribution contains the libraries 
 but does not install them by default, it is a 
 good indication that we should not auto-detect 
 them.

This is a matter of taste. FFmpeg has that policy, but MPlayer, on the other
hand, has the policy if it is present on the system, enable it. I believe
most programs adopt the same policy as MPlayer for features that can not
normally cause problems when unused.

By the way, I would very much like (but not enough to start working on it
right now, sorry) to have the --enable-auto option to enable
auto-detection for all external libraries. Amongst other things, that would
allow MPlayer's configure to call FFmpeg's instead of reimplementing it.

Regards,

-- 
  Nicolas George


signature.asc
Description: Digital signature
___
ffmpeg-devel mailing list
ffmpeg-devel@ffmpeg.org
http://ffmpeg.org/mailman/listinfo/ffmpeg-devel


Re: [FFmpeg-devel] [PATCH] avformat: Implement subtitle charenc guessing

2014-12-13 Thread Nicolas George
Le duodi 22 frimaire, an CCXXIII, Rodger Combs a écrit :
 This also moves general charenc conversion from avcodec to avformat;
 the version in avcodec is left, but renamed; I'm not sure if that's
 the optimal solution.
 
 The documentation could probably use some improvements, and a few more
 options could be added to ENCA.
 
 This very simply prefers libguess over ENCA, and ENCA over uchardet, but
 will fall back on a less-preferred guess if something decodes wrong, and will
 drop illegal sequences in iconv if all else fails.
 
 It'd be possible to have ffmpeg.c present a UI if multiple guesses are
 returned, and other library consumers could do the same.

So, now that I have a decent connection and time, here are some comments:

First, your patch seems to happen after the text demuxers have parsed the
text files. Therefore, this can not work for non-ASCII-compatible encodings,
such as UTF-16. You might say that UTF-16 already works, but its
implementation is bogus and leads to user-visible problems (see trac ticket
#4059). But even if it was not, we would not want two competing detection
layers.

More importantly: the lavc API is ready to handle situations where the
recoding has been done by the demuxer. See the doxy for sub_charenc_mode and
the associated constants. So if you are discarding it or adding competing
fields, you are certainly missing something on the proper use of the API. Of
course, if you think the API is not actually ready, feel free to explain and
discuss your point.

Third point: detection is not something that works well, and people will
frequently find versions of FFmpeg built without their favourite library.
For both these reasons, applications using the library should be able to
provide their own detection mechanism to complement or even replace the ones
hardcoded in FFmpeg. Same goes for conversion, even if it is not as
important.

Fourth and last point: detecting text encoding is not useful only for text
subtitles formats, other features may need it: filter graph files (think of
drawtext options), ffmetadata files, etc.

Here is the API I am considering. I had started to implement it until
bickering and lack of enthusiasm discouraged me.

The work happens in lavu, and is therefore available everywhere, replacing
av_file_map() whenever it is used for text files. It is an API for reading
text files / buffers / streams, taking care of all the gory details. Text
encoding, of course, but also the LF / CRLF mess, possibly splitting lines
at the same time, maybe normalizing spaces, etc.

The text-file-read API is controlled with a context parameter, holding
amongst other things a list of detection modules, and also recoding
modules. Detection modules are just a structure with a callback. FFmpeg
provides built-in modules, such as your proposed libguess, libenca and
libuchardet code, but applications can also create their own modules.

Then it is just a matter of changing the subtitle-specific FFTextReader API
to use the new lavu text-file-read API.

I hope this helps.

Regards,

-- 
  Nicolas George


signature.asc
Description: Digital signature
___
ffmpeg-devel mailing list
ffmpeg-devel@ffmpeg.org
http://ffmpeg.org/mailman/listinfo/ffmpeg-devel


Re: [FFmpeg-devel] [PATCH] avformat: Implement subtitle charenc guessing

2014-12-13 Thread Carl Eugen Hoyos
Rodger Combs rodger.combs at gmail.com writes:

 +  --disable-libguess   disable libguess [autodetect]
 +  --disable-uchardet   disable universalchardet [autodetect]

I cannot comment on your actual patch, but both 
libraries should not be auto-detected imo.

Imo, if your distribution contains the libraries 
but does not install them by default, it is a 
good indication that we should not auto-detect 
them.

Carl Eugen

___
ffmpeg-devel mailing list
ffmpeg-devel@ffmpeg.org
http://ffmpeg.org/mailman/listinfo/ffmpeg-devel


Re: [FFmpeg-devel] [PATCH] avformat: Implement subtitle charenc guessing

2014-12-13 Thread Rodger Combs

 On Dec 13, 2014, at 05:34, Nicolas George geo...@nsup.org wrote:
 
 So, now that I have a decent connection and time, here are some comments:
 
 First, your patch seems to happen after the text demuxers have parsed the
 text files. Therefore, this can not work for non-ASCII-compatible encodings,
 such as UTF-16. You might say that UTF-16 already works, but its
 implementation is bogus and leads to user-visible problems (see trac ticket
 #4059). But even if it was not, we would not want two competing detection
 layers.
 

Agreed, a single layer would be preferable.

 More importantly: the lavc API is ready to handle situations where the
 recoding has been done by the demuxer. See the doxy for sub_charenc_mode and
 the associated constants. So if you are discarding it or adding competing
 fields, you are certainly missing something on the proper use of the API. Of
 course, if you think the API is not actually ready, feel free to explain and
 discuss your point.
 

I couldn't see a sensible way to do this in lavc, since the detector libraries 
generally require more than one packet to work effectively. Looking at that 
doxy again, I can see how the detection could be done in lavf and the 
conversion in lavc, but I don't really see an advantage there other than fewer 
API changes.

 Third point: detection is not something that works well, and people will
 frequently find versions of FFmpeg built without their favourite library.
 For both these reasons, applications using the library should be able to
 provide their own detection mechanism to complement or even replace the ones
 hardcoded in FFmpeg. Same goes for conversion, even if it is not as
 important.
 

Yeah, a modular approach would be excellent.

 Fourth and last point: detecting text encoding is not useful only for text
 subtitles formats, other features may need it: filter graph files (think of
 drawtext options), ffmetadata files, etc.
 
 Here is the API I am considering. I had started to implement it until
 bickering and lack of enthusiasm discouraged me.
 
 The work happens in lavu, and is therefore available everywhere, replacing
 av_file_map() whenever it is used for text files. It is an API for reading
 text files / buffers / streams, taking care of all the gory details. Text
 encoding, of course, but also the LF / CRLF mess, possibly splitting lines
 at the same time, maybe normalizing spaces, etc.
 

So, by default it'd just handle encoding, and then additional normalization 
features could be enabled by the consumer? Sounds useful indeed.

 The text-file-read API is controlled with a context parameter, holding
 amongst other things a list of detection modules, and also recoding
 modules. Detection modules are just a structure with a callback. FFmpeg
 provides built-in modules, such as your proposed libguess, libenca and
 libuchardet code, but applications can also create their own modules.
 

I like this model in general, but it brings up a few questions that I kind of 
dodged in my patch. For instance, how should lavu determine which module's 
output to prefer if there are conflicting charenc guesses? How can the consumer 
choose between the given guesses?
In my patch, preference is very simplistic and the order is hardcoded. In a 
more modular system, it'd have to be a bit more complex; I can imagine some 
form of scoring system, or even another type of module that ranks possible 
guesses, but that could get very complex very fast. Any ideas for this?
In my patch, the consumer can override the choice of encoding by making changes 
to the AVFormatContext between header reading and retrieving the packet; it 
seems like the best way to do so in your system would be to pass a callback.

On a bit of a side-note: my system is designed to make every possible effort to 
return a recoded packet, with multiple layers of fallback behavior in case the 
first guess turns out to be incorrect or the source file is outright invalid. I 
wouldn't expect that to be significantly more difficult with your design, but I 
wonder what your opinions on the setup are?

 Then it is just a matter of changing the subtitle-specific FFTextReader API
 to use the new lavu text-file-read API.
 

So, the text-file-read API would buffer the entire input file and perform 
charenc detection/conversion and/or other normalization, then FFTextReader 
would read from the normalized buffer?

 I hope this helps.
 
 Regards,
 
 -- 
  Nicolas George
 ___
 ffmpeg-devel mailing list
 ffmpeg-devel@ffmpeg.org
 http://ffmpeg.org/mailman/listinfo/ffmpeg-devel 
 http://ffmpeg.org/mailman/listinfo/ffmpeg-devel
___
ffmpeg-devel mailing list
ffmpeg-devel@ffmpeg.org
http://ffmpeg.org/mailman/listinfo/ffmpeg-devel


Re: [FFmpeg-devel] [PATCH] avformat: Implement subtitle charenc guessing

2014-12-13 Thread wm4
On Sat, 13 Dec 2014 12:34:28 +0100
Nicolas George geo...@nsup.org wrote:

 First, your patch seems to happen after the text demuxers have parsed the
 text files. Therefore, this can not work for non-ASCII-compatible encodings,
 such as UTF-16. You might say that UTF-16 already works, but its
 implementation is bogus and leads to user-visible problems (see trac ticket
 #4059). But even if it was not, we would not want two competing detection
 layers.

For someone who complains about bickering and lack of enthusiasm you
sure bicker and discourage a lot.

What about it is bogus? #4059 is a problem with ffmpeg.c and the stuff
in lavc, not the general approach. In fact, the UTF-16 change made
UTF-16 just work with any API user.

Recoding in the demuxer is unacceptable, because it makes it impossible
to change the codepage later or get any kind of user interaction. Doing
it on file opening unnecessarily complicates these things.
Detection can't be done in lavc (and the way lavc does recoding, with
fatal errors on failure, is also plain unacceptable).

I agree with the comment though that ffmpeg isn't going to be linked to
all these fancy detection mechanisms. This is mostly because you have
to enable external libraries explicitly when compiling, so usually they
won't be picked up.
___
ffmpeg-devel mailing list
ffmpeg-devel@ffmpeg.org
http://ffmpeg.org/mailman/listinfo/ffmpeg-devel


Re: [FFmpeg-devel] [PATCH] avformat: Implement subtitle charenc guessing

2014-12-13 Thread Nicolas George
Le tridi 23 frimaire, an CCXXIII, wm4 a écrit :
 What about it is bogus?

Read the documentation.

In fact, the UTF-16 change made
 UTF-16 just work with any API user.

And inconsistent values in the context.

 Recoding in the demuxer is unacceptable, because it makes it impossible
 to change the codepage later or get any kind of user interaction.

Broken application design.

Regards,

-- 
  Nicolas George


signature.asc
Description: Digital signature
___
ffmpeg-devel mailing list
ffmpeg-devel@ffmpeg.org
http://ffmpeg.org/mailman/listinfo/ffmpeg-devel


Re: [FFmpeg-devel] [PATCH] avformat: Implement subtitle charenc guessing

2014-12-13 Thread wm4
On Sat, 13 Dec 2014 16:17:48 +0100
Nicolas George geo...@nsup.org wrote:

 Le tridi 23 frimaire, an CCXXIII, wm4 a écrit :
  What about it is bogus?
 
 Read the documentation.

Broken library design.

   In fact, the UTF-16 change made
  UTF-16 just work with any API user.
 
 And inconsistent values in the context.

What?

  Recoding in the demuxer is unacceptable, because it makes it impossible
  to change the codepage later or get any kind of user interaction.
 
 Broken application design.

Broken reply.
___
ffmpeg-devel mailing list
ffmpeg-devel@ffmpeg.org
http://ffmpeg.org/mailman/listinfo/ffmpeg-devel


Re: [FFmpeg-devel] [PATCH] avformat: Implement subtitle charenc guessing

2014-12-12 Thread Nicolas George
Le duodi 22 frimaire, an CCXXIII, Rodger Combs a écrit :
 This also moves general charenc conversion from avcodec to avformat;
 the version in avcodec is left, but renamed; I'm not sure if that's
 the optimal solution.
 
 The documentation could probably use some improvements, and a few more
 options could be added to ENCA.
 
 This very simply prefers libguess over ENCA, and ENCA over uchardet, but
 will fall back on a less-preferred guess if something decodes wrong, and will
 drop illegal sequences in iconv if all else fails.
 
 It'd be possible to have ffmpeg.c present a UI if multiple guesses are
 returned, and other library consumers could do the same.

I will not have time to comment soon enough, but I have reservations about
this patch:

- core feature requiring non-trivial external dependencies;

- renaming an existing option;

- tying proper working of lavc decoders to the use of lavf.

Regards,

-- 
  Nicolas George
___
ffmpeg-devel mailing list
ffmpeg-devel@ffmpeg.org
http://ffmpeg.org/mailman/listinfo/ffmpeg-devel


Re: [FFmpeg-devel] [PATCH] avformat: Implement subtitle charenc guessing

2014-12-12 Thread Lukasz Marek
On 12 December 2014 at 07:05, Rodger Combs rodger.co...@gmail.com wrote:

 diff --git a/configure b/configure
 index e2e3619..a5a9f9b 100755
 --- a/configure
 +++ b/configure
 @@ -199,6 +199,9 @@ External library support:
--enable-gnutls  enable gnutls, needed for https support
 if openssl is not used [no]
--disable-iconv  disable iconv [autodetect]
 +  --disable-libguess   disable libguess [autodetect]
 +  --disable-uchardet   disable universalchardet [autodetect]
 +  --enable-encadisable enca [no]


enable


--enable-ladspa  enable LADSPA audio filtering [no]
--enable-libaacplus  enable AAC+ encoding via libaacplus [no]
--enable-libass  enable libass subtitles rendering,
 @@ -1342,6 +1345,9 @@ EXTERNAL_LIBRARY_LIST=
  frei0r
  gnutls
  iconv
 +libguess
 +uchardet
 +enca
  ladspa
  libaacplus
  libass
 @@ -4358,6 +4364,7 @@ die_license_disabled gpl libxavs
  die_license_disabled gpl libxvid
  die_license_disabled gpl libzvbi
  die_license_disabled gpl x11grab
 +die_license_disabled gpl enca

  die_license_disabled nonfree libaacplus
  die_license_disabled nonfree libfaac
 @@ -5117,6 +5124,14 @@ enabled vdpau  enabled xlib 
  # Funny iconv installations are not unusual, so check it after all flags
 have been set
  disabled iconv || check_func_headers iconv.h iconv || check_lib2 iconv.h
 iconv -liconv || disable iconv

 +disabled iconv || disabled libguess || disable libguess  {
 +check_pkg_config libguess libguess.h libguess_determine_encoding 
 require_pkg_config libguess libguess.h libguess_determine_encoding 
 enable libguess;
 +}
 +disabled iconv || disabled uchardet || disable uchardet  {
 +check_pkg_config uchardet uchardet.h uchardet_new 
 require_pkg_config uchardet uchardet.h uchardet_new  enable uchardet;
 +}
 +enabled enca  check_func_headers enca.h enca_analyse || check_lib2
 enca.h enca_analyse -lenca || die ERROR: enca not found
 +
  enabled debug  add_cflags -g$debuglevel  add_asflags -g$debuglevel

  # add some useful compiler flags if supported
 diff --git a/libavcodec/options_table.h b/libavcodec/options_table.h
 index 1d5b078..93b3105 100644
 --- a/libavcodec/options_table.h
 +++ b/libavcodec/options_table.h
 @@ -472,7 +472,7 @@ static const AVOption avcodec_options[] = {
  {ka, Karaoke,0, AV_OPT_TYPE_CONST, {.i64 =
 AV_AUDIO_SERVICE_TYPE_KARAOKE },   INT_MIN, INT_MAX, A|E,
 audio_service_type},
  {request_sample_fmt, sample format audio decoders should prefer,
 OFFSET(request_sample_fmt), AV_OPT_TYPE_SAMPLE_FMT,
 {.i64=AV_SAMPLE_FMT_NONE}, -1, INT_MAX, A|D, request_sample_fmt},
  {pkt_timebase, NULL, OFFSET(pkt_timebase), AV_OPT_TYPE_RATIONAL, {.dbl
 = 0 }, 0, INT_MAX, 0},
 -{sub_charenc, set input text subtitles character encoding,
 OFFSET(sub_charenc), AV_OPT_TYPE_STRING, {.str = NULL}, CHAR_MIN, CHAR_MAX,
 S|D},
 +{sub_charenc_lavc, set input text subtitles character encoding,
 OFFSET(sub_charenc), AV_OPT_TYPE_STRING, {.str = NULL}, CHAR_MIN, CHAR_MAX,
 S|D},


hmm, this is API break. is this really required?


+/**
 + * Add a character encoding guess to an AVFormatContext's list
 + *
 + * @param avctx the context to add to
 + * @param enc   the encoding name to add
 + *
 + * A copy is added, so the original string should be free()d if necessary.
 + * If the same encoding name is already present, it isn't added again.
 + * If NULL or an empty string is passed, it's not added.
 + */
 +static void add_charenc(AVFormatContext *avctx, const char *enc)
 +{
 +char *copy;
 +
 +if (!enc || !enc[0])
 +return;
 +
 +for (unsigned i = 0; i  avctx-nb_sub_charenc_guesses; i++)
 +if (!strcmp(avctx-sub_charenc_guesses[i], enc))
 +return;
 +
 +copy = av_strdup(enc);
 +if (!copy)
 +return;
 +
 +dynarray_add(avctx-sub_charenc_guesses,
 avctx-nb_sub_charenc_guesses,
 + copy);


av_dynarray_add_nofree is probably better.
___
ffmpeg-devel mailing list
ffmpeg-devel@ffmpeg.org
http://ffmpeg.org/mailman/listinfo/ffmpeg-devel


[FFmpeg-devel] [PATCH] avformat: Implement subtitle charenc guessing

2014-12-11 Thread Rodger Combs
This also moves general charenc conversion from avcodec to avformat;
the version in avcodec is left, but renamed; I'm not sure if that's
the optimal solution.

The documentation could probably use some improvements, and a few more
options could be added to ENCA.

This very simply prefers libguess over ENCA, and ENCA over uchardet, but
will fall back on a less-preferred guess if something decodes wrong, and will
drop illegal sequences in iconv if all else fails.

It'd be possible to have ffmpeg.c present a UI if multiple guesses are
returned, and other library consumers could do the same.
---
 configure   |  15 +++
 libavcodec/options_table.h  |   2 +-
 libavformat/aqtitledec.c|   2 +
 libavformat/assdec.c|   2 +
 libavformat/avformat.h  |  50 +
 libavformat/jacosubdec.c|   2 +
 libavformat/microdvddec.c   |   2 +
 libavformat/mpl2dec.c   |   2 +
 libavformat/mpsubdec.c  |   2 +
 libavformat/options_table.h |   7 ++
 libavformat/pjsdec.c|   2 +
 libavformat/realtextdec.c   |   2 +
 libavformat/samidec.c   |   2 +
 libavformat/srtdec.c|   2 +
 libavformat/stldec.c|   2 +
 libavformat/subtitles.c | 262 +++-
 libavformat/subtitles.h |   1 +
 libavformat/subviewer1dec.c |   2 +
 libavformat/subviewerdec.c  |   2 +
 libavformat/utils.c |   2 +
 libavformat/vplayerdec.c|   2 +
 libavformat/webvttdec.c |   2 +
 22 files changed, 365 insertions(+), 4 deletions(-)

diff --git a/configure b/configure
index e2e3619..a5a9f9b 100755
--- a/configure
+++ b/configure
@@ -199,6 +199,9 @@ External library support:
   --enable-gnutls  enable gnutls, needed for https support
if openssl is not used [no]
   --disable-iconv  disable iconv [autodetect]
+  --disable-libguess   disable libguess [autodetect]
+  --disable-uchardet   disable universalchardet [autodetect]
+  --enable-encadisable enca [no]
   --enable-ladspa  enable LADSPA audio filtering [no]
   --enable-libaacplus  enable AAC+ encoding via libaacplus [no]
   --enable-libass  enable libass subtitles rendering,
@@ -1342,6 +1345,9 @@ EXTERNAL_LIBRARY_LIST=
 frei0r
 gnutls
 iconv
+libguess
+uchardet
+enca
 ladspa
 libaacplus
 libass
@@ -4358,6 +4364,7 @@ die_license_disabled gpl libxavs
 die_license_disabled gpl libxvid
 die_license_disabled gpl libzvbi
 die_license_disabled gpl x11grab
+die_license_disabled gpl enca
 
 die_license_disabled nonfree libaacplus
 die_license_disabled nonfree libfaac
@@ -5117,6 +5124,14 @@ enabled vdpau  enabled xlib 
 # Funny iconv installations are not unusual, so check it after all flags have 
been set
 disabled iconv || check_func_headers iconv.h iconv || check_lib2 iconv.h iconv 
-liconv || disable iconv
 
+disabled iconv || disabled libguess || disable libguess  {
+check_pkg_config libguess libguess.h libguess_determine_encoding  
require_pkg_config libguess libguess.h libguess_determine_encoding  enable 
libguess;
+}
+disabled iconv || disabled uchardet || disable uchardet  {
+check_pkg_config uchardet uchardet.h uchardet_new  require_pkg_config 
uchardet uchardet.h uchardet_new  enable uchardet;
+}
+enabled enca  check_func_headers enca.h enca_analyse || check_lib2 enca.h 
enca_analyse -lenca || die ERROR: enca not found
+
 enabled debug  add_cflags -g$debuglevel  add_asflags -g$debuglevel
 
 # add some useful compiler flags if supported
diff --git a/libavcodec/options_table.h b/libavcodec/options_table.h
index 1d5b078..93b3105 100644
--- a/libavcodec/options_table.h
+++ b/libavcodec/options_table.h
@@ -472,7 +472,7 @@ static const AVOption avcodec_options[] = {
 {ka, Karaoke,0, AV_OPT_TYPE_CONST, {.i64 = 
AV_AUDIO_SERVICE_TYPE_KARAOKE },   INT_MIN, INT_MAX, A|E, 
audio_service_type},
 {request_sample_fmt, sample format audio decoders should prefer, 
OFFSET(request_sample_fmt), AV_OPT_TYPE_SAMPLE_FMT, {.i64=AV_SAMPLE_FMT_NONE}, 
-1, INT_MAX, A|D, request_sample_fmt},
 {pkt_timebase, NULL, OFFSET(pkt_timebase), AV_OPT_TYPE_RATIONAL, {.dbl = 0 
}, 0, INT_MAX, 0},
-{sub_charenc, set input text subtitles character encoding, 
OFFSET(sub_charenc), AV_OPT_TYPE_STRING, {.str = NULL}, CHAR_MIN, CHAR_MAX, 
S|D},
+{sub_charenc_lavc, set input text subtitles character encoding, 
OFFSET(sub_charenc), AV_OPT_TYPE_STRING, {.str = NULL}, CHAR_MIN, CHAR_MAX, 
S|D},
 {sub_charenc_mode, set input text subtitles character encoding mode, 
OFFSET(sub_charenc_mode), AV_OPT_TYPE_FLAGS, {.i64 = 
FF_SUB_CHARENC_MODE_AUTOMATIC}, -1, INT_MAX, S|D, sub_charenc_mode},
 {do_nothing,  NULL, 0, AV_OPT_TYPE_CONST, {.i64 = 
FF_SUB_CHARENC_MODE_DO_NOTHING},  INT_MIN, INT_MAX, S|D, sub_charenc_mode},
 {auto,NULL, 0, AV_OPT_TYPE_CONST, {.i64 = 
FF_SUB_CHARENC_MODE_AUTOMATIC},   INT_MIN, INT_MAX, S|D, sub_charenc_mode},
diff --git a/libavformat/aqtitledec.c