I have prepared a mini-patch for explaining better what I mean with the
third point (I have used EXTRACTOR_METATYPE_NONE in the end, I think it is
more clear).

Please find it attached

--madmurphy

On Tue, Feb 8, 2022 at 1:38 PM madmurphy <[email protected]> wrote:

> Got it! I agree about your solution for the duplicate mime types.
>
> but until that is done, a key-value pair type would at least be better
> than 'unknown'.
>
> “Unknown” can continue to exist as an identifier for other cases, just not
> the key-value ones :)
>
> Also I forgot to mention a third point:
>
> 3. Add an EXTRACTOR_METATYPE_NO_METATYPE = -1 to enum EXTRACTOR_MetaType
> (more or less like NULL if that was a pointer). Without a
> EXTRACTOR_METATYPE_NO_METATYPE a programmer is forced to save the
> have_metatype information in another variable. The fact that it is a
> negative number is not a problem, because as the name suggests, *it is
> not a metatype*.
>
> P.S. Sorry for picking the wrong mailing list!
>
> On Tue, Feb 8, 2022 at 9:57 AM Christian Grothoff <[email protected]>
> wrote:
>
>> Hi madmurphy,
>>
>> The 'correct' place for GNU libextractor discussions would be
>>
>>   https://lists.gnu.org/mailman/listinfo/libextractor
>>
>> Alas, with my libextractor maintainer hat on, I would say this:
>>
>> On 2/7/22 10:01 PM, madmurphy wrote:
>> > Hi again, GNUnet people.
>> >
>> > Is this the place where to discuss about libextractor? I have two
>> points.
>> >
>> > #1 I often see something interesting. Key-value pairs are categorized as
>> > |EXTRACTOR_METATYPE_UNKNOWN|:
>> >
>> > unknown: chroma-format=4:2:0
>> > unknown: bit-depth-chroma=8
>> > unknown: colorimetry=bt709
>> > unknown: stream-format=avc
>> > unknown: stream-format=raw
>> > unknown: bit-depth-luma=8
>> > unknown: base-profile=lc
>> > unknown: mpegversion=4
>> > unknown: profile=high
>> > unknown: alignment=au
>> > unknown: parsed=true
>> > unknown: framed=true
>> > unknown: variant=iso
>> > unknown: profile=lc
>> > unknown: level=4.1
>> >
>> > But one point is that they are often numerous, and another point is that
>> > that of a key-value type is a really interesting metatype to have (and
>> > is not really “unknown”, since the key is self-explanatory). Would it
>> > not make sense to add an |EXTRACTOR_METATYPE_KEY_VALUE_PAIR| to the list
>> > of MetaTypes?
>>
>> We could do that. Sometimes I think it would be better to add new
>> specific LE types for some of the above, but until that is done, a
>> key-value pair type would at least be better than 'unknown'.
>>
>> > ...
>> >
>> >   /* generic attributes */
>> >   EXTRACTOR_METATYPE_UNKNOWN = 45,
>> >   EXTRACTOR_METATYPE_DESCRIPTION = 46,
>> >   EXTRACTOR_METATYPE_COPYRIGHT = 47,
>> >   EXTRACTOR_METATYPE_RIGHTS = 48,
>> >   EXTRACTOR_METATYPE_KEYWORDS = 49,
>> >   EXTRACTOR_METATYPE_ABSTRACT = 50,
>> >   EXTRACTOR_METATYPE_SUMMARY = 51,
>> >   EXTRACTOR_METATYPE_SUBJECT = 52,
>> >   EXTRACTOR_METATYPE_CREATOR = 53,
>> >   EXTRACTOR_METATYPE_FORMAT = 54,
>> >   EXTRACTOR_METATYPE_FORMAT_VERSION = 55,
>> >   *EXTRACTOR_METATYPE_KEY_VALUE_PAIR* = XXX,
>> >
>> > ...
>> >
>> > #2 I often see that files get tagged with multiple mime types according
>> > to libextractor:
>> >
>> > mimetype: video/quicktime
>> > mimetype: video/x-h264
>> > mimetype: audio/mpeg
>> > mimetype: video/mp4
>>
>> That is because different plugins (using different methods/libraries)
>> disagree on the 'correct' mime-type. Ideally, we'd identify which plugin
>> gets it wrong (and why), and unify the mime-types.
>>
>> > But that never reflects the reality, since files should have only one
>> > mime type (or at most, multiple mime types that mean the same thing).
>> > But then I see what happens with file names: there is only one
>> > |EXTRACTOR_METATYPE_GNUNET_ORIGINAL_FILENAME|, but there can be many
>> > |EXTRACTOR_METATYPE_FILENAME|s (in the case of archives, for example):
>> >
>> > EXTRACTOR_METATYPE_FILENAME = 2,
>> > ...
>> > EXTRACTOR_METATYPE_GNUNET_ORIGINAL_FILENAME = 180,
>> >
>> > Would it not make sense to do something similar for mime types? Only one
>> > “original mime type”, and an infinity of secondary mime types…?
>> >
>> > EXTRACTOR_METATYPE_MIMETYPE = 1,
>> > ...
>> > *EXTRACTOR_METATYPE_GNUNET_ORIGINAL_MIMETYPE* = XXX,
>>
>> I guess it depends. If this is for archives where files _inside_ the
>> archive are given mime-types, then a different metatype makes sense
>> (ditto for FILENAME: here we probably could have two types, one for the
>> 'archive' and one for the 'contents'). But if the different mime-types
>> are all about the 'original' file, then we should rather figure out
>> which plugin gets it wrong. As for the "_GNUNET_" in the
>> "_GNUNET_ORIGINAL_FILENAME" there, IIRC this again different because
>> that is NOT a metatype used by GNU libextractor, but one that GNUnet
>> itself generates and puts with the 'rest ' of the metadata.
>>
>> > So, two simple proposals:
>> >
>> >  1. Create |EXTRACTOR_METATYPE_KEY_VALUE_PAIR|
>> >  2. Create |EXTRACTOR_METATYPE_GNUNET_ORIGINAL_MIMETYPE|
>> >
>> > What do you think? Does it make sense?
>>
>> It should definitively not be "GNUNET_ORIGINAL_MIMETYPE", and the real
>> question is what is the origin of the different mime-types. If this is
>> from an archive, maybe we should introduce
>>
>> EXTRACTOR_MIMETYPE_ARCHIVE_CONTENT_FILENAME
>> EXTRACTOR_MIMETYPE_ARCHIVE_CONTENT_MIMETYPE
>>
>> and reserve
>>
>> EXTRACTOR_MIMETYPE_FILENAME
>> EXTRACTOR_MIMETYPE_MIMETYPE
>>
>> for the top-level file. But AFAIK that won't solve your mime-type issue,
>> which should really be resolved by going over the plugins and finding
>> out why and where they disagree and picking the 'right' answer.
>>
>> My 2 cents
>>
>> Christian
>>
>>

<<attachment: add-extractor_metatype_none.patch.zip>>

Reply via email to