[
https://issues.apache.org/jira/browse/TIKA-2689?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17351715#comment-17351715
]
Sameer edited comment on TIKA-2689 at 5/26/21, 11:09 AM:
---------------------------------------------------------
[~amitp7007] thanks that worked for me as well.
Putting in additional bits for someone who might need a complete solution,
1. Create a file called custom-mimetypes.xml, with following content, notice
the <mime-info> tags added to the xml provided by Amit.
{color:#e8bf6a}<?{color}{color:#bababa}xml version{color}{color:#6a8759}="1.0"
{color}{color:#bababa}encoding{color}{color:#6a8759}="UTF-8"{color}{color:#e8bf6a}?>{color}{color:#e8bf6a}<mime-info>{color}{color:#e8bf6a}
<mime-type
{color}{color:#bababa}type{color}{color:#6a8759}="application/illustrator"{color}{color:#e8bf6a}>{color}{color:#e8bf6a}
<alias
{color}{color:#bababa}type{color}{color:#6a8759}="application/vnd.adobe.illustrator"{color}{color:#e8bf6a}/>{color}{color:#e8bf6a}
<acronym>{color}AI{color:#e8bf6a}</acronym>{color}{color:#e8bf6a}
<_comment>{color}Adobe Illustrator
Artwork{color:#e8bf6a}</_comment>{color}{color:#e8bf6a} <magic
{color}{color:#bababa}priority{color}{color:#6a8759}="50"{color}{color:#e8bf6a}>{color}
{color:#808080}<!-- Normally just %PDF- -->{color} {color:#e8bf6a}<match
{color}{color:#bababa}value{color}{color:#6a8759}="%PDF-"
{color}{color:#bababa}type{color}{color:#6a8759}="string"
{color}{color:#bababa}offset{color}{color:#6a8759}="0"{color}{color:#e8bf6a}/>{color}
{color:#808080}<!-- Sometimes has a UTF-8 Byte Order Mark first -->{color}
{color:#e8bf6a}<match
{color}{color:#bababa}value{color}{color:#6a8759}="\xef\xbb\xbf%PDF-"
{color}{color:#bababa}type{color}{color:#6a8759}="string"
{color}{color:#bababa}offset{color}{color:#6a8759}="0"{color}{color:#e8bf6a}/>{color}{color:#e8bf6a}
</magic>{color}{color:#e8bf6a} <magic
{color}{color:#bababa}priority{color}{color:#6a8759}="20"{color}{color:#e8bf6a}>{color}
{color:#808080}<!-- Low priority match for %PDF-#.# near the start of the file
-->{color}{color:#808080} <!-- Can trigger false positives, so set the priority
rather low here -->{color} {color:#e8bf6a}<match
{color}{color:#bababa}value{color}{color:#6a8759}="%PDF-1."
{color}{color:#bababa}type{color}{color:#6a8759}="string"
{color}{color:#bababa}offset{color}{color:#6a8759}="1:512"{color}{color:#e8bf6a}/>{color}{color:#e8bf6a}
<match {color}{color:#bababa}value{color}{color:#6a8759}="%PDF-2."
{color}{color:#bababa}type{color}{color:#6a8759}="string"
{color}{color:#bababa}offset{color}{color:#6a8759}="1:512"{color}{color:#e8bf6a}/>{color}{color:#e8bf6a}
</magic>{color}{color:#e8bf6a} <glob
{color}{color:#bababa}pattern{color}{color:#6a8759}="*.ai"{color}{color:#e8bf6a}/>{color}{color:#e8bf6a}
<sub-class-of
{color}{color:#bababa}type{color}{color:#6a8759}="application/postscript"{color}{color:#e8bf6a}/>{color}{color:#e8bf6a}
</mime-type>{color}{color:#e8bf6a}</mime-info>{color}
2. Just put in your classpath, that's it, Tika will pick it up to add a new
matcher.
*Couple of callouts*,
# This is still not an accurate signature of AI files, we are just enabling
this to match files with the mentioned signature as AI as well.
# You still need to hint Tika, with a filename,
{{{color:#9876aa}tika{color}.detect(inputStream, fileName) }}then only it will
resolved as ai, otherwise it will be matched to both pdf and ai, though
resolved as pdf.
# You might find
[this|https://tika.apache.org/1.8/parser_guide.html#Add_your_MIME-Type] guide
helpful as well.
Since, its been quite sometime since the last comment on this thread, has there
been any progress to match AI files better?
was (Author: sameer.sunil):
[~amitp7007] thanks that worked for me as well.
Putting in additional bits for someone who might need a complete solution,
1. Create a file called custom-mimetypes.xml, with following content, notice
the <mime-info> tags added to the xml provided by Amit.
{color:#e8bf6a}<?{color}{color:#bababa}xml version{color}{color:#6a8759}="1.0"
{color}{color:#bababa}encoding{color}{color:#6a8759}="UTF-8"{color}{color:#e8bf6a}?>
{color}{color:#e8bf6a}<mime-info>
{color}{color:#e8bf6a} <mime-type
{color}{color:#bababa}type{color}{color:#6a8759}="application/illustrator"{color}{color:#e8bf6a}>
{color}{color:#e8bf6a} <alias
{color}{color:#bababa}type{color}{color:#6a8759}="application/vnd.adobe.illustrator"{color}{color:#e8bf6a}/>
{color}{color:#e8bf6a} <acronym>{color}AI{color:#e8bf6a}</acronym>
{color}{color:#e8bf6a} <_comment>{color}Adobe Illustrator
Artwork{color:#e8bf6a}</_comment>
{color}{color:#e8bf6a} <magic
{color}{color:#bababa}priority{color}{color:#6a8759}="50"{color}{color:#e8bf6a}>
{color} {color:#808080}<!-- Normally just %PDF- -->
{color} {color:#e8bf6a}<match
{color}{color:#bababa}value{color}{color:#6a8759}="%PDF-"
{color}{color:#bababa}type{color}{color:#6a8759}="string"
{color}{color:#bababa}offset{color}{color:#6a8759}="0"{color}{color:#e8bf6a}/>
{color} {color:#808080}<!-- Sometimes has a UTF-8 Byte Order Mark first -->
{color} {color:#e8bf6a}<match
{color}{color:#bababa}value{color}{color:#6a8759}="\xef\xbb\xbf%PDF-"
{color}{color:#bababa}type{color}{color:#6a8759}="string"
{color}{color:#bababa}offset{color}{color:#6a8759}="0"{color}{color:#e8bf6a}/>
{color}{color:#e8bf6a} </magic>
{color}{color:#e8bf6a} <magic
{color}{color:#bababa}priority{color}{color:#6a8759}="20"{color}{color:#e8bf6a}>
{color} {color:#808080}<!-- Low priority match for %PDF-#.# near the start of
the file -->
{color}{color:#808080} <!-- Can trigger false positives, so set the priority
rather low here -->
{color} {color:#e8bf6a}<match
{color}{color:#bababa}value{color}{color:#6a8759}="%PDF-1."
{color}{color:#bababa}type{color}{color:#6a8759}="string"
{color}{color:#bababa}offset{color}{color:#6a8759}="1:512"{color}{color:#e8bf6a}/>
{color}{color:#e8bf6a} <match
{color}{color:#bababa}value{color}{color:#6a8759}="%PDF-2."
{color}{color:#bababa}type{color}{color:#6a8759}="string"
{color}{color:#bababa}offset{color}{color:#6a8759}="1:512"{color}{color:#e8bf6a}/>
{color}{color:#e8bf6a} </magic>
{color}{color:#e8bf6a} <glob
{color}{color:#bababa}pattern{color}{color:#6a8759}="*.ai"{color}{color:#e8bf6a}/>
{color}{color:#e8bf6a} <sub-class-of
{color}{color:#bababa}type{color}{color:#6a8759}="application/postscript"{color}{color:#e8bf6a}/>
{color}{color:#e8bf6a} </mime-type>
{color}{color:#e8bf6a}</mime-info>{color}
2. Just put in your classpath, that's it, Tika will pick it up to add a new
matcher.
*Couple of callouts*,
# This is still not an accurate signature of AI files, we are just enabling
this to match files with the mentioned signature as AI as well.
# You still need to hint Tika, with a filename,
{{{color:#9876aa}tika{color}.detect(inputStream, fileName)}}
then only it will resolved as ai, otherwise it will be matched to both pdf and
ai, though resolved as pdf.
# You might find
[this|https://tika.apache.org/1.8/parser_guide.html#Add_your_MIME-Type] guide
helpful as well.
Since, its been quite sometime since the last comment on this thread, has there
been any progress to match AI files better?
> *.ai type (Adobe illustrator ) files are not detected correctly.
> ----------------------------------------------------------------
>
> Key: TIKA-2689
> URL: https://issues.apache.org/jira/browse/TIKA-2689
> Project: Tika
> Issue Type: Bug
> Components: core
> Affects Versions: 1.16, 1.17, 1.18
> Reporter: Amit Pandey
> Priority: Major
> Attachments: example.ai
>
>
> There is in-consistency in detecting **ai* types files when using different
> overloaded detect method. When I am using _detect(String filename)_, it gives
> correct file type - "*application/illustrator*". If I use _detect(InputStream
> is, String filename)_ or _detect(File fileObj)_ - it gives file type
> "*application/pdf*".
> Here is sample code I used.
>
> [https://stackoverflow.com/questions/51359351/tika-detect-method-not-giving-same-exact-file-type|http://example.com/]
>
--
This message was sent by Atlassian Jira
(v8.3.4#803005)