Andreas Meier created TIKA-2574:
-----------------------------------
Summary: Extend PCX detection in tika-mimetypes.xml
Key: TIKA-2574
URL: https://issues.apache.org/jira/browse/TIKA-2574
Project: Tika
Issue Type: Sub-task
Components: detector
Affects Versions: 1.17
Reporter: Andreas Meier
Attachments: IUC10-da-Q.UTF-16LE.without-BOM,
IUC10-da-Q.UTF-32LE.without-BOM, IUC10-da.UTF-16LE.without-BOM,
IUC10-it.UTF-16LE.without-BOM, Test.pcx, Test_without_filehandle
The matcher for pcx should be reworked to avoid false-positives upon UTF-16LE
and UTF-32LE textfiles.
I suggest adding the filler from the header as mentioned in the original [pcx
specification|https://www.iana.org/assignments/media-types/image/vnd.zbrush.pcx]
{code:xml}
<mime-type type="image/vnd.zbrush.pcx">
<acronym>PCX</acronym>
<_comment>ZSoft Paintbrush PiCture eXchange</_comment>
<alias type="image/x-pcx"/>
<alias type="image/x-pc-paintbrush"/>
<magic priority="40">
<match value="0x0A" type="string" offset="0">
<!-- bytes 74 to 128 are blank to fill out 128 byte header. Set all bytes
to 0 -->
<!-- This has to be set to avoid false positives for
text/plain;charset=UTF-16LE and text/plain;charset=UTF-32LE -->
<match
value="0x000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000"
type="string" offset="74">
<match value="0x00" type="string" offset="1"/>
<match value="0x02" type="string" offset="1"/>
<match value="0x03" type="string" offset="1"/>
<match value="0x04" type="string" offset="1"/>
<match value="0x05" type="string" offset="1"/>
</match>
</match>
</magic>
<glob pattern="*.pcx"/>
</mime-type>
{code}
I added some testfiles.
[~gagravarr] Can you please check this?
--
This message was sent by Atlassian JIRA
(v7.6.3#76005)