[ 
https://issues.apache.org/jira/browse/TIKA-2443?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16138462#comment-16138462
 ] 

Viorica Visan commented on TIKA-2443:
-------------------------------------

We use AutoDetectParser +  tika-config.xml to exclude CompositeExternalParser 
which has caused us some performance problems in the past
And that is it. As far as I see, the only detector used is, internally, the 
DefaultDetector

I've added the custom-mimetypes.xml, which after Nick's #1 comment looks like 
this:
<mime-type type="text/custom-logs">
    <magic priority="50">
      <match value="Date:" type="string" offset="0"/>
          <match value="Level:" type="string" offset="0:1000"/>
    </magic>
  </mime-type>

so that the file is detected as text/custom-logs and in the end parsed by the 
DefaultParser, which avoids the StackOverflowError. 

But we were thinking that perhaps there might be other potential mismatches and 
it would be good to give this possibility to pass configuration to tika from 
outside our application. Only that for the osgi setup, on classpath means 
inside the folder plugin.  
and  from our  point of view, that is not a good place, because these plugins 
get replaced at every release, so this patching would have to be maintained. 
That is why we are looking to do it from outside.




> Plain text file identified as rfc822 and which can cause StackOverflowError
> ---------------------------------------------------------------------------
>
>                 Key: TIKA-2443
>                 URL: https://issues.apache.org/jira/browse/TIKA-2443
>             Project: Tika
>          Issue Type: Bug
>          Components: mime
>    Affects Versions: 1.11, 1.16
>            Reporter: Viorica Visan
>
> I have a file called test.txt, containing only:
> Date:         06/25/2014 15:54:19
> And some more text I am writing. This will
> be detected as rfc822
> This file is detected and parsed as message/rfc822. 
> I think the magic rule on "Date: " is too strong and it should have detected 
> only as plain/text file. It looks to me like the reverse of  
> https://issues.apache.org/jira/browse/TIKA-879 
> We noticed this issue, because we have a large log file, which has many lines 
> with Date, Log level and Message which is parsed as message/rfc822 (only 
> because it starts with "Date:") and which throws 
> StackOverflowError in the end. 
> Is there some workaround to make this rule weaker ? through configuration ? 
> We use DefaultParser and everything default. We use tika in 1.11 version, but 
> we tried also  with tika 1.16 and we saw the same StackOverflowError (which 
> probably again happened because it was parsed as a rc822 type).
> The only workaround that I found was to add 
> custom-mimetypes.xml like this
>  <mime-type type="text/plain">
>     <magic priority="70">
>       <match value="Date:" type="string" offset="0"/>
>     </magic>
>   </mime-type>
> Would you recomend some other workaround to make sure the file does not get 
> parsed as rfc822 ? 
> And I have another question: can this custom-mimetypes.xml be specified from 
> an external location? 
> Many thanks.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

Reply via email to