Re: [FFmpeg-devel] [RFC][PATCH] ffmpeg: add option to transform metadata using iconv

2015-10-10 Thread James Darnley
On 2015-10-09 14:46, Nicolas George wrote:
> Le quartidi 4 vendémiaire, an CCXXIV, James Darnley a écrit :
>> I can.  You should find it attached to this email.  I cleaned it up and
>> put two test cases of data into the file.  You will need Lua and the
>> Lua-iconv module.  If your package manager doesn't have that see here:
>> https://ittner.github.io/lua-iconv/
>>
>> To run it: lua 
> 
> Sorry, I have not had time yet to look into it, and I would like to spend
> the time I have tackling the issue of request_frame() recursiveness. But I
> did not forget. Is your need for the feature urgent?

No, I made my use of it and moved onto other things.




signature.asc
Description: OpenPGP digital signature
___
ffmpeg-devel mailing list
ffmpeg-devel@ffmpeg.org
http://ffmpeg.org/mailman/listinfo/ffmpeg-devel


Re: [FFmpeg-devel] [RFC][PATCH] ffmpeg: add option to transform metadata using iconv

2015-10-09 Thread Nicolas George
Le quartidi 4 vendémiaire, an CCXXIV, James Darnley a écrit :
> I can.  You should find it attached to this email.  I cleaned it up and
> put two test cases of data into the file.  You will need Lua and the
> Lua-iconv module.  If your package manager doesn't have that see here:
> https://ittner.github.io/lua-iconv/
> 
> To run it: lua 

Sorry, I have not had time yet to look into it, and I would like to spend
the time I have tackling the issue of request_frame() recursiveness. But I
did not forget. Is your need for the feature urgent?

Regards,

-- 
  Nicolas George


signature.asc
Description: Digital signature
___
ffmpeg-devel mailing list
ffmpeg-devel@ffmpeg.org
http://ffmpeg.org/mailman/listinfo/ffmpeg-devel


Re: [FFmpeg-devel] [RFC][PATCH] ffmpeg: add option to transform metadata using iconv

2015-09-25 Thread James Darnley
On 2015-09-24 20:37, Nicolas George wrote:
> Le tridi 3 vendémiaire, an CCXXIV, James Darnley a écrit :
>> I don't know what to say here.  I know the encodings needed for iconv
>> because I arrived at them by brute force.  I wrote a short Lua script to
>> iterate over a list of encodings supported by my iconv and arrived at
>> this answer.  The command line tool called iconv is too clever for this
>> because it returns an error when it can't convert.  As for ending in
>> GBK, it is what the script told me.
> 
> Could you share the script and enough input to run it and reproduce the
> results?

I can.  You should find it attached to this email.  I cleaned it up and
put two test cases of data into the file.  You will need Lua and the
Lua-iconv module.  If your package manager doesn't have that see here:
https://ittner.github.io/lua-iconv/

To run it: lua 

>> This feature would not work if there was a misinterpretation in the
>> middle.  As you say that would need A->B and C->D where B != C.  Perhaps
>> this is why my solution isn't perfect, because there should be an
>> assumption in the middle.
>>
>> I could rework my code to allow for assumptions in the middle.  My case
>> would then use "CP1252,UTF-8,UTF-8,GBK" as an argument.
> 
> I must say, I do not like your approach very much because it manipulates
> text encoding in the middle of the program. All strings inside the program
> should be in UTF-8.
> 
> I can propose this: add an option "metadata_text_encoding" to
> AVFormatContext. If it is set on a demuxer, the demuxing framework uses it
> to convert from it to UTF-8; and similarly, if it is set on a muxer, the
> muxing framework uses it to convert from UTF-8 to it.
>
> Then we can have a special syntax for it to specify bogus conversions.
> Possibly: -metadata_text_encoding "[CP1252>UTF-8]GBK" to specify that the
> text must first be converted from CP1252 to UTF-8 then considered to be GBK
> (and converted to UTF-8). (Well, I consider the feature evil, so I will
> probably not volunteer to implement it, but I will not oppose as long as it
> can not be triggered too easily by an unsuspecting user.
> 
> What do you think of it?

As for more special syntax, I'm not a fan of it.  Handling this in the
demuxer, somewhere, might be a better idea.

local iconv = require('iconv')

local function canonicalize_list(list)
local tbl = {}

for _,v in ipairs(list) do
local cp = iconv.canonicalize(v)
tbl[cp] = true
end

local ret = {}
for k,_ in pairs(tbl) do
table.insert(ret, k)
end
table.sort(ret)

return ret
end

local function hex_string_to_bytes(str)
local ret = ''
for i in string.gmatch(str, '%x%x') do
ret = ret .. string.char(tonumber(i, 16))
end
return ret
end

-- Moderately slow, ~15sec for 143 encodings.
local function run(encoding_list, mojibake, correct)
for _,a in ipairs(encoding_list) do
for _,b in ipairs(encoding_list) do
for _,c in ipairs(encoding_list) do

local a2b = iconv.new(a, b)
local b2c = iconv.new(b, c)
local str = a2b:iconv(mojibake)
str = b2c:iconv(str)
if string.match(str, correct) then
io.stdout:write(string.format('%s,%s,%s = %s\n', a, b, c, 
str))
end

end
end
end
end

-- Very fast, ~0.1sec for 143 encodings.
local function run_assume_middle_utf8(encoding_list, mojibake, correct)
for _,a in ipairs(encoding_list) do
for _,b in ipairs(encoding_list) do

local a2utf = iconv.new(a, 'UTF-8')
local utf2b = iconv.new('UTF-8', b)
local str = a2utf:iconv(mojibake)
str = utf2b:iconv(str)
if string.match(str, correct) then
io.stdout:write(string.format('%s,UTF-8,%s = %s\n', a, b, str))
end

end
end
end

-- Very slow, many minutes for 143 encodings.
local function run_assume_middle_random(encoding_list, mojibake, correct)
for _,a in ipairs(encoding_list) do
for _,b in ipairs(encoding_list) do
for _,c in ipairs(encoding_list) do
for _,d in ipairs(encoding_list) do

local a2b = iconv.new(a, b)
local c2d = iconv.new(c, d)
local str = a2b:iconv(mojibake)
str = c2d:iconv(str)
if string.match(str, correct) then
io.stdout:write(string.format('%s,%s_%s,%s = %s\n', a, 
b, c, d, str))
end

end
end
end
end
end

-- Main program

local encoding_list = {}
if true or not iconv.list or not iconv.canonicalize then io.stdout:write(
'The iconv module does not support the list or canonicalize functions so '
.. 'this tool will use an internal list of character encodings.\n')

encoding_list = { "ARMSCII-8", "ATARIST", "BIG5", 

Re: [FFmpeg-devel] [RFC][PATCH] ffmpeg: add option to transform metadata using iconv

2015-09-24 Thread Nicolas George
Le duodi 2 vendémiaire, an CCXXIV, James Darnley a écrit :
> It is not supposed to replace any invalid bytes with a "random"
> character. That sounds like it will only make the problem worse with
> that lossy removal of data. This is trying to fix incorrect
> interpretation of bytes.
> 
> This feature is to transform bytes into other bytes which when
> interpreted and displayed the correct text is appears on screen.
> 
> I will detail my exact use case at the bottom.

Indeed, but a feature like that must be designed with all reasonable use
cases in mind, and replacing isolated invalid octets values in input by a
fixed replacement character is a common practice, totally acceptable when
dealing with large amounts of data. I am not saying that it must be enabled
by default, but it needs to be an option.

> What do you mean? You need at least two encodings to be given to perform
> a conversion.  Two things become a list.

Yes, but the code to handle more than two is much more complicated than it
would have been with just two elements, i.e. only one conversion.

> It might not be very good but it is (void*) and NULL if you don't use
> the feature.

Yes, I understood that, but this is fragile code.

> It shouldn't.  This function receives buf2_len as equal to BUF_LEN - 1
> which means that iconv can only advance buf2 to buf2 + BUF_LEN - 1 which
> will let us write 0 into the last byte.

In that case, it should be written very explicitly in the function
documentation, otherwise someone may change the code and break the
assumption.

Also, I notice another flaw in that code: it uses '\0' as string terminator.
Text in non-ASCII encodings can contain '\0'. The end of the text must be
handled differently, with an explicit size argument or maybe a pointer to
the end.

> I won't send another patch for a little while.  I will see how your API
> proposal plays out.
> 
> And now for my tale.
> 
> I wanted ffmpeg to turn the string at [1] into the string at [3].  [1],
> with the exact hex bytes at [2], is artist tag out of a Flac file.  Flac
> files have Vorbis Comment metadata tags.  They are UTF-8 only.  If a
> program puts incorrect data in there how will any other program know how
> to display it?  What's worse is when the data gets converted twice.

Indeed. All modern formats should specify UTF-8 for all strings, there is
absolutely no valid reason to do otherwise nowadays.

> This specific case was to convert CP1252 to UTF-8 to GBK -- that is to
> interpret the input bytes as the CP1252 encoding and convert them to
> UTF-8, then take those bytes and convert them to GBK.  I added the code
> needed to take an argument in the form
> > "CP1252,UTF-8,GBK"
> parse it into separate encodings, open two iconv contexts, and finally
> perform the conversion.

I can not reproduce your conversion, but there is something that bugs me in
your reasoning.

With any sane iconv implementation, converting from A to B then from B to C
will either fail if B is too poor to code the input, or succeed and then
give the exact same result as converting directly from A to C.

As for chaining a conversion from A to B then from C to D with B != C, this
is braindead, and if FFmpeg were to support it at all, it should only be
with a very explicit request from the user to perform a dangerously lossy
conversion.

I do not understand what you were trying to achieve with your
"CP1252,UTF-8,GBK" conversion. First, why finish with GBK instead of sane
UTF-8? And second, if the output is really wanted in GBK, then what is the
purpose of the intermediate UTF-8 step?

IMHO, the only sane way of dealing with text is to handle everything in
UTF-8 internally: demuxers and anything that deals with input should convert
as soon as the encoding is known, muxers and anything that deals with output
should convert as late as possible.

(We do not do that for audio and video because conversions are expensive,
but text conversions are cheap and negligible compared to video and audio
processing.)

Regards,

-- 
  Nicolas George


signature.asc
Description: Digital signature
___
ffmpeg-devel mailing list
ffmpeg-devel@ffmpeg.org
http://ffmpeg.org/mailman/listinfo/ffmpeg-devel


Re: [FFmpeg-devel] [RFC][PATCH] ffmpeg: add option to transform metadata using iconv

2015-09-24 Thread James Darnley
On 2015-09-24 12:36, Nicolas George wrote:
> Le duodi 2 vendémiaire, an CCXXIV, James Darnley a écrit :
>> It is not supposed to replace any invalid bytes with a "random"
>> character. That sounds like it will only make the problem worse with
>> that lossy removal of data. This is trying to fix incorrect
>> interpretation of bytes.
>>
>> This feature is to transform bytes into other bytes which when
>> interpreted and displayed the correct text is appears on screen.
>>
>> I will detail my exact use case at the bottom.
> 
> Indeed, but a feature like that must be designed with all reasonable use
> cases in mind, and replacing isolated invalid octets values in input by a
> fixed replacement character is a common practice, totally acceptable when
> dealing with large amounts of data. I am not saying that it must be enabled
> by default, but it needs to be an option.

Noted.

As far as I understand the iconv API, it doesn't appear to do this for
you.  So adding this feature would require writing code to handle more
errors returned from the iconv() function.  That means a more
complicated argument handling structure is needed.

I don't mind trying to write this but it would be better to do it behind
the API you propose.  I will help you with it as best I can because I
seem to have involuntarily volunteered myself.

>> What do you mean? You need at least two encodings to be given to perform
>> a conversion.  Two things become a list.
> 
> Yes, but the code to handle more than two is much more complicated than it
> would have been with just two elements, i.e. only one conversion.
> 
>> It might not be very good but it is (void*) and NULL if you don't use
>> the feature.
> 
> Yes, I understood that, but this is fragile code.

Again, noted.  I can't see this being much less fragile though.  Having
thought again, I could reuse the av_dynarray that holds the char
encoding strings.

>> It shouldn't.  This function receives buf2_len as equal to BUF_LEN - 1
>> which means that iconv can only advance buf2 to buf2 + BUF_LEN - 1 which
>> will let us write 0 into the last byte.
> 
> In that case, it should be written very explicitly in the function
> documentation, otherwise someone may change the code and break the
> assumption.

Will add a comment to note this.

> Also, I notice another flaw in that code: it uses '\0' as string terminator.
> Text in non-ASCII encodings can contain '\0'. The end of the text must be
> handled differently, with an explicit size argument or maybe a pointer to
> the end.

Yes.  A user may request a conversion through UCS-2 which would break
because I call strlen() on the resulting string.  I need to return the
output length as it comes from iconv.

>> I won't send another patch for a little while.  I will see how your API
>> proposal plays out.
>>
>> And now for my tale.
>>
>> I wanted ffmpeg to turn the string at [1] into the string at [3].  [1],
>> with the exact hex bytes at [2], is artist tag out of a Flac file.  Flac
>> files have Vorbis Comment metadata tags.  They are UTF-8 only.  If a
>> program puts incorrect data in there how will any other program know how
>> to display it?  What's worse is when the data gets converted twice.
> 
> Indeed. All modern formats should specify UTF-8 for all strings, there is
> absolutely no valid reason to do otherwise nowadays.

Yes I agree.  I would suspect the problem comes from older software.

>> This specific case was to convert CP1252 to UTF-8 to GBK -- that is to
>> interpret the input bytes as the CP1252 encoding and convert them to
>> UTF-8, then take those bytes and convert them to GBK.  I added the code
>> needed to take an argument in the form
>>> "CP1252,UTF-8,GBK"
>> parse it into separate encodings, open two iconv contexts, and finally
>> perform the conversion.
> 
> I can not reproduce your conversion, but there is something that bugs me in
> your reasoning.
> 
> With any sane iconv implementation, converting from A to B then from B to C
> will either fail if B is too poor to code the input, or succeed and then
> give the exact same result as converting directly from A to C.
> 
> As for chaining a conversion from A to B then from C to D with B != C, this
> is braindead, and if FFmpeg were to support it at all, it should only be
> with a very explicit request from the user to perform a dangerously lossy
> conversion.
> 
> I do not understand what you were trying to achieve with your
> "CP1252,UTF-8,GBK" conversion. First, why finish with GBK instead of sane
> UTF-8? And second, if the output is really wanted in GBK, then what is the
> purpose of the intermediate UTF-8 step?
> 
> IMHO, the only sane way of dealing with text is to handle everything in
> UTF-8 internally: demuxers and anything that deals with input should convert
> as soon as the encoding is known, muxers and anything that deals with output
> should convert as late as possible.
> 
> (We do not do that for audio and video because conversions are expensive,
> but text 

Re: [FFmpeg-devel] [RFC][PATCH] ffmpeg: add option to transform metadata using iconv

2015-09-24 Thread Nicolas George
Le tridi 3 vendémiaire, an CCXXIV, James Darnley a écrit :
> As far as I understand the iconv API, it doesn't appear to do this for
> you.  So adding this feature would require writing code to handle more
> errors returned from the iconv() function.  That means a more
> complicated argument handling structure is needed.
> 
> I don't mind trying to write this but it would be better to do it behind
> the API you propose.

Of course. Actually, it is already there in the API, although I am not quite
satisfied because it can not be set as an option.

>   I will help you with it as best I can because I
> seem to have involuntarily volunteered myself.

I need some feedback to know if this kind of API is useful in FFmpeg (other
people are welcome to give advice too!), and to know if the actual API I
propose is suitable for various needs. But as for writing the code, I expect
it to be quite straightforward.

The question where I most need feedback is this: shall I make an API that
allows to convert from any encoding to any encoding, or an API that can
convert from any encoding to UTF-8 and from UTF-8 to any encoding?

There are pros and cons for each case. UTF-8 to/from anything is enough for
the needs of any sane program, and makes the handling of the replacement
character easier (because it can be specified in UTF-8 directly). OTOH,
any-to-any is more generic.

> I don't know what to say here.  I know the encodings needed for iconv
> because I arrived at them by brute force.  I wrote a short Lua script to
> iterate over a list of encodings supported by my iconv and arrived at
> this answer.  The command line tool called iconv is too clever for this
> because it returns an error when it can't convert.  As for ending in
> GBK, it is what the script told me.

Could you share the script and enough input to run it and reproduce the
results?

> This feature would not work if there was a misinterpretation in the
> middle.  As you say that would need A->B and C->D where B != C.  Perhaps
> this is why my solution isn't perfect, because there should be an
> assumption in the middle.
> 
> I could rework my code to allow for assumptions in the middle.  My case
> would then use "CP1252,UTF-8,UTF-8,GBK" as an argument.

I must say, I do not like your approach very much because it manipulates
text encoding in the middle of the program. All strings inside the program
should be in UTF-8.

I can propose this: add an option "metadata_text_encoding" to
AVFormatContext. If it is set on a demuxer, the demuxing framework uses it
to convert from it to UTF-8; and similarly, if it is set on a muxer, the
muxing framework uses it to convert from UTF-8 to it.

Then we can have a special syntax for it to specify bogus conversions.
Possibly: -metadata_text_encoding "[CP1252>UTF-8]GBK" to specify that the
text must first be converted from CP1252 to UTF-8 then considered to be GBK
(and converted to UTF-8). (Well, I consider the feature evil, so I will
probably not volunteer to implement it, but I will not oppose as long as it
can not be triggered too easily by an unsuspecting user.

What do you think of it?

Regards,

-- 
  Nicolas George


signature.asc
Description: Digital signature
___
ffmpeg-devel mailing list
ffmpeg-devel@ffmpeg.org
http://ffmpeg.org/mailman/listinfo/ffmpeg-devel


Re: [FFmpeg-devel] [RFC][PATCH] ffmpeg: add option to transform metadata using iconv

2015-09-23 Thread James Darnley
On 2015-09-22 20:22, Nicolas George wrote:
> Le primidi 1er vendémiaire, an CCXXIV, James Darnley a écrit :

>> +@item -metadata_iconv_code_page_list @var{code_page_list}
>> +Force the metadata from input files to be converted through the given 
>> codepages
>> +using iconv. This allows the user to correct
>> +@uref{https://en.wikipedia.org/wiki/Mojibake, mojibake} providing they know 
>> the
>> +correct code pages to use.
> 
> I will have some more technical comments later. For now, I would just remark
> that I believe "code page" is not the wording I would suggest: it refers to
> proprietary and obsolete implementations. IMHO, the best way to refer to
> this is "character encoding".

Then I will change all instances of "code page" to "character encoding"
even in the code, except for one in the docs which I might change to
"character encoding (code page)" unless that will be confusing for users.




signature.asc
Description: OpenPGP digital signature
___
ffmpeg-devel mailing list
ffmpeg-devel@ffmpeg.org
http://ffmpeg.org/mailman/listinfo/ffmpeg-devel


Re: [FFmpeg-devel] [RFC][PATCH] ffmpeg: add option to transform metadata using iconv

2015-09-23 Thread Nicolas George
Le primidi 1er vendémiaire, an CCXXIV, James Darnley a écrit :
> At present it only converts global metadata as that is what I wanted to do.  
> It
> should be possible to extend it so that the conversion can be different for
> different files or streams.
> ---
>  doc/ffmpeg.texi |   6 +++
>  ffmpeg.c|  15 ++
>  ffmpeg.h|   1 +
>  ffmpeg_opt.c| 149 
> +++-
>  4 files changed, 169 insertions(+), 2 deletions(-)
> 

> Then I will change all instances of "code page" to "character encoding"
> even in the code, except for one in the docs which I might change to
> "character encoding (code page)" unless that will be confusing for users.

Having the words "code page" in the docs is good for users that know it by
that name. In that case, I suggest to also include "charset", which is
another incorrect way of calling it (incorrect because en encoding is not
just a set).

See the technical comments interleaved in the code.

More globally, there are several issues with the patch as is that come from
the direct use of iconv:

- it uses an optional feature, with all the code noise and ifdefry that
  implies;

- it has annoying buffer handling;

- it does not let specify the replacement character for invalid encodings.

There is already a place in lavc that uses iconv, with the exact same
drawbacks. There will possibly be other similar places later if someone
decides that vf_drawtext needs to accept legacy encodings for example.

For some time, I have had in mind a generic text encoding conversion utility
in libavutil, that would solve all this once and for all. It was rather
low-priority for me, but since the question is raised now, I can get on with
it.

I will post the API I have in mind in a separate thread.

> diff --git a/doc/ffmpeg.texi b/doc/ffmpeg.texi
> index f4ffc6c..d4c1c23 100644
> --- a/doc/ffmpeg.texi
> +++ b/doc/ffmpeg.texi
> @@ -855,6 +855,12 @@ such streams is attempted.
>  Allow input streams with unknown type to be copied instead of failing if 
> copying
>  such streams is attempted.
>  

> +@item -metadata_iconv_code_page_list @var{code_page_list}

Can you explain the case for having a list here?

> +Force the metadata from input files to be converted through the given 
> codepages
> +using iconv. This allows the user to correct
> +@uref{https://en.wikipedia.org/wiki/Mojibake, mojibake} providing they know 
> the
> +correct code pages to use.
> +
>  @item -map_channel 
> [@var{input_file_id}.@var{stream_specifier}.@var{channel_id}|-1][:@var{output_file_id}.@var{stream_specifier}]
>  Map an audio channel from a given input to an output. If
>  @var{output_file_id}.@var{stream_specifier} is not set, the audio channel 
> will
> diff --git a/ffmpeg.c b/ffmpeg.c
> index e31a2c6..5c04571 100644
> --- a/ffmpeg.c
> +++ b/ffmpeg.c
> @@ -101,6 +101,10 @@
>  #include 
>  #endif
>  
> +#if CONFIG_ICONV
> +#include 
> +#endif
> +
>  #include 
>  
>  #include "ffmpeg.h"
> @@ -564,6 +568,17 @@ static void ffmpeg_cleanup(int ret)
>  fclose(vstats_file);
>  av_freep(_filename);
>  
> +#if CONFIG_ICONV
> +if (metadata_iconv_contexts) {
> +iconv_t *cd = metadata_iconv_contexts;
> +for (i = 0; cd[i]; i++) {
> +iconv_close(cd[i]);
> +cd[i] = NULL;
> +}
> +}
> +av_freep(_iconv_contexts);
> +#endif
> +
>  av_freep(_streams);
>  av_freep(_files);
>  av_freep(_streams);
> diff --git a/ffmpeg.h b/ffmpeg.h
> index 6544e6f..2b8cbd7 100644
> --- a/ffmpeg.h
> +++ b/ffmpeg.h
> @@ -495,6 +495,7 @@ extern intnb_filtergraphs;
>  
>  extern char *vstats_filename;
>  extern char *sdp_filename;

> +extern void *metadata_iconv_contexts;

I am not very comfortable with that structure.

>  
>  extern float audio_drift_threshold;
>  extern float dts_delta_threshold;
> diff --git a/ffmpeg_opt.c b/ffmpeg_opt.c
> index 4edd118..d226f78 100644
> --- a/ffmpeg_opt.c
> +++ b/ffmpeg_opt.c
> @@ -42,6 +42,10 @@
>  #include "libavutil/pixfmt.h"
>  #include "libavutil/time_internal.h"
>  
> +#if CONFIG_ICONV
> +# include 
> +#endif
> +
>  #define DEFAULT_PASS_LOGFILENAME_PREFIX "ffmpeg2pass"
>  
>  #define MATCH_PER_STREAM_OPT(name, type, outvar, fmtctx, st)\
> @@ -84,6 +88,7 @@ const HWAccel hwaccels[] = {
>  
>  char *vstats_filename;
>  char *sdp_filename;
> +void *metadata_iconv_contexts;
>  
>  float audio_drift_threshold = 0.1;
>  float dts_delta_threshold   = 10;
> @@ -120,6 +125,7 @@ static int input_stream_potentially_available = 0;
>  static int ignore_unknown_streams = 0;
>  static int copy_unknown_streams = 0;
>  

> +

Stray.

>  static void uninit_options(OptionsContext *o)
>  {
>  const OptionDef *po = options;
> @@ -196,6 +202,58 @@ static AVDictionary *strip_specifiers(AVDictionary *dict)
>  return ret;
>  }
>  
> +static int opt_metadata_iconv(void *optctx, const char *opt, const char *arg)
> +{
> +#if !CONFIG_ICONV
> +av_log(NULL, AV_LOG_ERROR, "converting 

Re: [FFmpeg-devel] [RFC][PATCH] ffmpeg: add option to transform metadata using iconv

2015-09-23 Thread James Darnley
On 2015-09-23 16:56, Nicolas George wrote:
> Le primidi 1er vendémiaire, an CCXXIV, James Darnley a écrit :
>> At present it only converts global metadata as that is what I wanted to do.  
>> It
>> should be possible to extend it so that the conversion can be different for
>> different files or streams.
>> ---
>>  doc/ffmpeg.texi |   6 +++
>>  ffmpeg.c|  15 ++
>>  ffmpeg.h|   1 +
>>  ffmpeg_opt.c| 149 
>> +++-
>>  4 files changed, 169 insertions(+), 2 deletions(-)
>>
> 
>> Then I will change all instances of "code page" to "character encoding"
>> even in the code, except for one in the docs which I might change to
>> "character encoding (code page)" unless that will be confusing for users.
> 
> Having the words "code page" in the docs is good for users that know it by
> that name. In that case, I suggest to also include "charset", which is
> another incorrect way of calling it (incorrect because en encoding is not
> just a set).
> 
> See the technical comments interleaved in the code.
> 
> More globally, there are several issues with the patch as is that come from
> the direct use of iconv:
> 
> - it uses an optional feature, with all the code noise and ifdefry that
>   implies;
> 
> - it has annoying buffer handling;
> 
> - it does not let specify the replacement character for invalid encodings.

It is not supposed to replace any invalid bytes with a "random"
character. That sounds like it will only make the problem worse with
that lossy removal of data. This is trying to fix incorrect
interpretation of bytes.

This feature is to transform bytes into other bytes which when
interpreted and displayed the correct text is appears on screen.

I will detail my exact use case at the bottom.

> There is already a place in lavc that uses iconv, with the exact same
> drawbacks. There will possibly be other similar places later if someone
> decides that vf_drawtext needs to accept legacy encodings for example.
> 
> For some time, I have had in mind a generic text encoding conversion utility
> in libavutil, that would solve all this once and for all. It was rather
> low-priority for me, but since the question is raised now, I can get on with
> it.
> 
> I will post the API I have in mind in a separate thread.

I will comment on your proposal in that thread.

>> diff --git a/doc/ffmpeg.texi b/doc/ffmpeg.texi
>> index f4ffc6c..d4c1c23 100644
>> --- a/doc/ffmpeg.texi
>> +++ b/doc/ffmpeg.texi
>> @@ -855,6 +855,12 @@ such streams is attempted.
>>  Allow input streams with unknown type to be copied instead of failing if 
>> copying
>>  such streams is attempted.
>>  
> 
>> +@item -metadata_iconv_code_page_list @var{code_page_list}
> 
> Can you explain the case for having a list here?

What do you mean? You need at least two encodings to be given to perform
a conversion.  Two things become a list.  You might find the specific
example below helpful.

>> +Force the metadata from input files to be converted through the given 
>> codepages
>> +using iconv. This allows the user to correct
>> +@uref{https://en.wikipedia.org/wiki/Mojibake, mojibake} providing they know 
>> the
>> +correct code pages to use.
>> +
>>  @item -map_channel 
>> [@var{input_file_id}.@var{stream_specifier}.@var{channel_id}|-1][:@var{output_file_id}.@var{stream_specifier}]
>>  Map an audio channel from a given input to an output. If
>>  @var{output_file_id}.@var{stream_specifier} is not set, the audio channel 
>> will
>> diff --git a/ffmpeg.c b/ffmpeg.c
>> index e31a2c6..5c04571 100644
>> --- a/ffmpeg.c
>> +++ b/ffmpeg.c
>> @@ -101,6 +101,10 @@
>>  #include 
>>  #endif
>>  
>> +#if CONFIG_ICONV
>> +#include 
>> +#endif
>> +
>>  #include 
>>  
>>  #include "ffmpeg.h"
>> @@ -564,6 +568,17 @@ static void ffmpeg_cleanup(int ret)
>>  fclose(vstats_file);
>>  av_freep(_filename);
>>  
>> +#if CONFIG_ICONV
>> +if (metadata_iconv_contexts) {
>> +iconv_t *cd = metadata_iconv_contexts;
>> +for (i = 0; cd[i]; i++) {
>> +iconv_close(cd[i]);
>> +cd[i] = NULL;
>> +}
>> +}
>> +av_freep(_iconv_contexts);
>> +#endif
>> +
>>  av_freep(_streams);
>>  av_freep(_files);
>>  av_freep(_streams);
>> diff --git a/ffmpeg.h b/ffmpeg.h
>> index 6544e6f..2b8cbd7 100644
>> --- a/ffmpeg.h
>> +++ b/ffmpeg.h
>> @@ -495,6 +495,7 @@ extern intnb_filtergraphs;
>>  
>>  extern char *vstats_filename;
>>  extern char *sdp_filename;
> 
>> +extern void *metadata_iconv_contexts;
> 
> I am not very comfortable with that structure.

It might not be very good but it is (void*) and NULL if you don't use
the feature.

>>  extern float audio_drift_threshold;
>>  extern float dts_delta_threshold;
>> diff --git a/ffmpeg_opt.c b/ffmpeg_opt.c
>> index 4edd118..d226f78 100644
>> --- a/ffmpeg_opt.c
>> +++ b/ffmpeg_opt.c
>> @@ -42,6 +42,10 @@
>>  #include "libavutil/pixfmt.h"
>>  #include "libavutil/time_internal.h"
>>  
>> +#if CONFIG_ICONV
>> 

[FFmpeg-devel] [RFC][PATCH] ffmpeg: add option to transform metadata using iconv

2015-09-22 Thread James Darnley
At present it only converts global metadata as that is what I wanted to do.  It
should be possible to extend it so that the conversion can be different for
different files or streams.
---
 doc/ffmpeg.texi |   6 +++
 ffmpeg.c|  15 ++
 ffmpeg.h|   1 +
 ffmpeg_opt.c| 149 +++-
 4 files changed, 169 insertions(+), 2 deletions(-)

diff --git a/doc/ffmpeg.texi b/doc/ffmpeg.texi
index f4ffc6c..d4c1c23 100644
--- a/doc/ffmpeg.texi
+++ b/doc/ffmpeg.texi
@@ -855,6 +855,12 @@ such streams is attempted.
 Allow input streams with unknown type to be copied instead of failing if 
copying
 such streams is attempted.
 
+@item -metadata_iconv_code_page_list @var{code_page_list}
+Force the metadata from input files to be converted through the given codepages
+using iconv. This allows the user to correct
+@uref{https://en.wikipedia.org/wiki/Mojibake, mojibake} providing they know the
+correct code pages to use.
+
 @item -map_channel 
[@var{input_file_id}.@var{stream_specifier}.@var{channel_id}|-1][:@var{output_file_id}.@var{stream_specifier}]
 Map an audio channel from a given input to an output. If
 @var{output_file_id}.@var{stream_specifier} is not set, the audio channel will
diff --git a/ffmpeg.c b/ffmpeg.c
index e31a2c6..5c04571 100644
--- a/ffmpeg.c
+++ b/ffmpeg.c
@@ -101,6 +101,10 @@
 #include 
 #endif
 
+#if CONFIG_ICONV
+#include 
+#endif
+
 #include 
 
 #include "ffmpeg.h"
@@ -564,6 +568,17 @@ static void ffmpeg_cleanup(int ret)
 fclose(vstats_file);
 av_freep(_filename);
 
+#if CONFIG_ICONV
+if (metadata_iconv_contexts) {
+iconv_t *cd = metadata_iconv_contexts;
+for (i = 0; cd[i]; i++) {
+iconv_close(cd[i]);
+cd[i] = NULL;
+}
+}
+av_freep(_iconv_contexts);
+#endif
+
 av_freep(_streams);
 av_freep(_files);
 av_freep(_streams);
diff --git a/ffmpeg.h b/ffmpeg.h
index 6544e6f..2b8cbd7 100644
--- a/ffmpeg.h
+++ b/ffmpeg.h
@@ -495,6 +495,7 @@ extern intnb_filtergraphs;
 
 extern char *vstats_filename;
 extern char *sdp_filename;
+extern void *metadata_iconv_contexts;
 
 extern float audio_drift_threshold;
 extern float dts_delta_threshold;
diff --git a/ffmpeg_opt.c b/ffmpeg_opt.c
index 4edd118..d226f78 100644
--- a/ffmpeg_opt.c
+++ b/ffmpeg_opt.c
@@ -42,6 +42,10 @@
 #include "libavutil/pixfmt.h"
 #include "libavutil/time_internal.h"
 
+#if CONFIG_ICONV
+# include 
+#endif
+
 #define DEFAULT_PASS_LOGFILENAME_PREFIX "ffmpeg2pass"
 
 #define MATCH_PER_STREAM_OPT(name, type, outvar, fmtctx, st)\
@@ -84,6 +88,7 @@ const HWAccel hwaccels[] = {
 
 char *vstats_filename;
 char *sdp_filename;
+void *metadata_iconv_contexts;
 
 float audio_drift_threshold = 0.1;
 float dts_delta_threshold   = 10;
@@ -120,6 +125,7 @@ static int input_stream_potentially_available = 0;
 static int ignore_unknown_streams = 0;
 static int copy_unknown_streams = 0;
 
+
 static void uninit_options(OptionsContext *o)
 {
 const OptionDef *po = options;
@@ -196,6 +202,58 @@ static AVDictionary *strip_specifiers(AVDictionary *dict)
 return ret;
 }
 
+static int opt_metadata_iconv(void *optctx, const char *opt, const char *arg)
+{
+#if !CONFIG_ICONV
+av_log(NULL, AV_LOG_ERROR, "converting metadata through codepages requires 
"
+"ffmpeg to be built with iconv support\n");
+return AVERROR(EINVAL);
+#else
+void *temp;
+char **list = NULL;
+char *code_page_list = av_strdup(arg);
+char *token = av_strtok(code_page_list, ",", (char**));
+int i, token_count = 0;
+iconv_t *cd;
+
+/* TODO: correct handling of memory in error case. */
+
+while (token) {
+av_dynarray_add(, _count, token);
+if (!list)
+return AVERROR(ENOMEM);
+token = av_strtok(NULL, ",", (char**));
+}
+
+if (token_count < 2) {
+av_log(NULL, AV_LOG_ERROR, "too few code pages (%d) in list (%s)\n", 
token_count, code_page_list);
+return AVERROR(EINVAL);
+}
+
+cd = av_mallocz_array(sizeof(iconv_t), token_count);
+if (!cd)
+return AVERROR(ENOMEM);
+
+for (i = 0; i < token_count - 1; i++) {
+av_log(NULL, AV_LOG_DEBUG, "Opening iconv with code pages: %s, %s\n", 
list[i], list[i+1]);
+
+temp = iconv_open(list[i], list[i+1]);
+if (temp == (iconv_t)(-1)) {
+av_log(NULL, AV_LOG_ERROR, "error opening iconv with code pages 
(%s, %s)\n", list[i], list[i+1]);
+return AVERROR(EINVAL);
+/* TODO: check for other errors. */
+}
+cd[i] = temp;
+}
+metadata_iconv_contexts = cd;
+
+av_freep(_page_list);
+av_freep();
+
+return 0;
+#endif /* !CONFIG_ICONV */
+}
+
 static int opt_sameq(void *optctx, const char *opt, const char *arg)
 {
 av_log(NULL, AV_LOG_ERROR, "Option '%s' was removed. "
@@ -454,6 +512,84 @@ static void parse_meta_type(char *arg, char *type, int 
*index, const char **stre
 *type = 'g';
 }
 
+#if 

Re: [FFmpeg-devel] [RFC][PATCH] ffmpeg: add option to transform metadata using iconv

2015-09-22 Thread Nicolas George
Le primidi 1er vendémiaire, an CCXXIV, James Darnley a écrit :
> At present it only converts global metadata as that is what I wanted to do.  
> It
> should be possible to extend it so that the conversion can be different for
> different files or streams.
> ---
>  doc/ffmpeg.texi |   6 +++
>  ffmpeg.c|  15 ++
>  ffmpeg.h|   1 +
>  ffmpeg_opt.c| 149 
> +++-
>  4 files changed, 169 insertions(+), 2 deletions(-)
> 
> diff --git a/doc/ffmpeg.texi b/doc/ffmpeg.texi
> index f4ffc6c..d4c1c23 100644
> --- a/doc/ffmpeg.texi
> +++ b/doc/ffmpeg.texi
> @@ -855,6 +855,12 @@ such streams is attempted.
>  Allow input streams with unknown type to be copied instead of failing if 
> copying
>  such streams is attempted.
>  
> +@item -metadata_iconv_code_page_list @var{code_page_list}
> +Force the metadata from input files to be converted through the given 
> codepages
> +using iconv. This allows the user to correct
> +@uref{https://en.wikipedia.org/wiki/Mojibake, mojibake} providing they know 
> the
> +correct code pages to use.

I will have some more technical comments later. For now, I would just remark
that I believe "code page" is not the wording I would suggest: it refers to
proprietary and obsolete implementations. IMHO, the best way to refer to
this is "character encoding".

Regards,

-- 
  Nicolas George


signature.asc
Description: Digital signature
___
ffmpeg-devel mailing list
ffmpeg-devel@ffmpeg.org
http://ffmpeg.org/mailman/listinfo/ffmpeg-devel