RE: TIKA-1509 (2.x breaking parser change) - ready for first review!

2018-03-19 Thread Allison, Timothy B.
Y, this is an impressive step forward.  Thank you, Nick!

-Original Message-
From: Chris Mattmann [mailto:mattm...@apache.org] 
Sent: Sunday, March 18, 2018 6:00 PM
To: dev@tika.apache.org
Subject: Re: TIKA-1509 (2.x breaking parser change) - ready for first review!

Completely agree, awesome job Nick.

I will definitely try this week as well.

Thank you!

Sincerely,
Chris



On 3/18/18, 2:47 PM, "David Meikle"  wrote:

Nice one Nick!  Will take a look this week.

Cheers,
Dave

On 14 March 2018 at 17:38, Nick Burch  wrote:

> Hi All
>
> As promised, I've finally had a go to try and implement my ideas for
> TIKA-1509 / https://wiki.apache.org/tika/CompositeParserDiscussion /
> breaking 2.x parser change
>
> My work so far is in this github branch, and is ready for review!
> https://github.com/apache/tika/tree/multiple-parsers
>
>
> It seems to work fine for the Fallback case, and for the Supplemental
> case. You can set a policy that controls how clashing metadata is handled,
> currently "first one to set a key wins", "last one to set a key wins",
> "ignore previous parsers", and "keep old and new unique values"
>
> I've also done a proof of concept for "pick best" case, to try running the
> text parser with a specified set of different charsets, capture the text
> from each, "pick the best" (hard coded 1st...) then run for real with that
> one.
>
>
> Key TODOs - Support InputStreamFactory, properly work out what mimetypes
> to claim to support, Tika Config XML friendly helper for the metadata 
clash
> policy, review ContentHandlerFactory signature and tweak if needed.
>
> Proposed breaking 2.x change - add second parse method that takes
> ContentHandlerFactory instead of ContentHandler, with most parsers getting
> that just grabbing a single one and using that as before
>
>
> Before I do any more though... Thoughts? Comments? Ideas? Changes? Should
> I stop? Carry on? Modify it? Other?
>
> Nick
>






[jira] [Commented] (TIKA-2608) tika matlab parser incorrectly identifies content type of minified javascript file

2018-03-19 Thread Nick Burch (JIRA)

[ 
https://issues.apache.org/jira/browse/TIKA-2608?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16404900#comment-16404900
 ] 

Nick Burch commented on TIKA-2608:
--

I've just tried with a latest build from master, and it works as expected - 
{{application/javascript}} when given the filename, and {{text/plain}} without 
the filename.

1.18 will be cut from the master branch, so you should either build from a 
github checkout of the master branch, or grab a nightly from our Jenkins server

> tika matlab parser incorrectly identifies content type of minified javascript 
> file
> --
>
> Key: TIKA-2608
> URL: https://issues.apache.org/jira/browse/TIKA-2608
> Project: Tika
>  Issue Type: Bug
>  Components: parser
>Affects Versions: 1.17, 1.18
> Environment: * xwiki 10.1,
>  * Tomcat 8 (8.0.32-1ubuntu1)
>  * Ubuntu 16.04.4 LTS
>  * Oracle Java 1.8.0_161-b12
>Reporter: pdwalker
>Priority: Minor
> Fix For: 2.0.0
>
>
> When the tika "detects" the following file, it returns the wrong content type:
> {{$ curl -I 
> [https://wiki.charltonslaw.com/xwiki/webjars/wiki%3Ait/mxgraph-editor/3.7.2/mxGraphEditor.min.js]}}
>  {{HTTP/1.1 200 OK}}
>  {{Server: nginx/1.10.3 (Ubuntu)}}
>  {{Date: Fri, 16 Mar 2018 10:09:54 GMT}}
>  {{Content-Type: text/x-matlab}}
>  {{  [snip]}}
>  {{X-Frame-Options: SAMEORIGIN}}
> However, the unminified version of the same file returns the correct type:
> {{$ curl -I 
> [https://wiki.charltonslaw.com/xwiki/webjars/wiki%3Ait/mxgraph-editor/3.7.2/mxGraphEditor.js]}}
>  {{HTTP/1.1 200 OK}}
>  {{Server: nginx/1.10.3 (Ubuntu)}}
>  {{Date: Fri, 16 Mar 2018 10:10:25 GMT}}
>  {{Content-Type: application/javascript}}
>  {{  [snip]}}
>  {{X-Frame-Options: SAMEORIGIN}}
> The problem this causes is when my xwiki installation is behind an ssl proxy 
> (nginx) and I enable the add_header X-Content-Type-Options nosniff; header.  
> Modern browsers return the following error:
> {quote}Refused to execute script from 
> '[https://wiki.charltonslaw.com/xwiki/webjars/wiki%3Ait/mxgraph-editor/3.7.2/mxGraphEditor.min.js|https://wiki.proxy.domain/xwiki/webjars/wiki%3Ait/mxgraph-editor/3.7.2/mxGraphEditor.min.js]'
>  because its MIME type ('text/x-matlab') is not executable, and strict MIME 
> type checking is enabled.
> {quote}
> My "solution" is to disable the strict mime type checking in the ssl proxy, 
> but I don't think that is idea.  It'd be better of the matlab parser didn't 
> claim random minified js files as its own.
>  
> Note:
> Edit: I  marked the problem as being with the matlab parser, but that may be 
> incorrect - I'm not sure exactly what code actually does the detection.
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (TIKA-2610) Extend HtmlMapper isDiscardElement method with Attributes parameter

2018-03-19 Thread Aleksei Udalov (JIRA)

[ 
https://issues.apache.org/jira/browse/TIKA-2610?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16404643#comment-16404643
 ] 

Aleksei Udalov commented on TIKA-2610:
--

Will be happy to contribute to the fix if the Team agrees it makes sense.

> Extend HtmlMapper isDiscardElement method with Attributes parameter
> ---
>
> Key: TIKA-2610
> URL: https://issues.apache.org/jira/browse/TIKA-2610
> Project: Tika
>  Issue Type: Improvement
>  Components: parser
>Affects Versions: 1.17
>Reporter: Aleksei Udalov
>Priority: Major
>
> Currently, if we want to discard HTML elements by attribute value/existence, 
> an example from one of our projects
> {code:html}
> Some content to be ignored by custom search indexer 
> (Tika parser)
> {code}
> it's required to implement a custom handler with logic very similar to what 
> we have in org.apache.tika.parser.html.HtmlHandler. While it can be easily 
> done by keep using HtmlHandler, but setting an instance of HtmlMapper with 
> (newly added) isDiscardElement(String name, Attributes attributes) method 
> overridden into the ParseContext.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Updated] (TIKA-2610) Extend HtmlMapper isDiscardElement method with Attributes parameter

2018-03-19 Thread Aleksei Udalov (JIRA)

 [ 
https://issues.apache.org/jira/browse/TIKA-2610?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Aleksei Udalov updated TIKA-2610:
-
Description: 
Currently, if we want to discard HTML elements by attribute value/existence, an 
example from one of our projects
{code:html}
Some content to be ignored by custom search indexer 
(Tika parser)
{code}
it's required to implement a custom handler with logic very similar to what we 
have in org.apache.tika.parser.html.HtmlHandler. While it can be easily done by 
keep using HtmlHandler, but setting an instance of HtmlMapper with (newly 
added) isDiscardElement(String name, Attributes attributes) method overridden 
into the ParseContext.

  was:
Currently, if we want to disregard HTML elements by attribute value/existence, 
an example from one of our projects
{code:html}
Some content to be ignored by custom search indexer 
(Tika parser)
{code}
it's required to implement a custom handler with logic very similar to what we 
have in org.apache.tika.parser.html.HtmlHandler. While it can be easily done by 
keep using HtmlHandler, but setting an instance of HtmlMapper with (newly 
added) isDiscardElement(String name, Attributes attributes) method overridden 
into the ParseContext.


> Extend HtmlMapper isDiscardElement method with Attributes parameter
> ---
>
> Key: TIKA-2610
> URL: https://issues.apache.org/jira/browse/TIKA-2610
> Project: Tika
>  Issue Type: Improvement
>  Components: parser
>Affects Versions: 1.17
>Reporter: Aleksei Udalov
>Priority: Major
>
> Currently, if we want to discard HTML elements by attribute value/existence, 
> an example from one of our projects
> {code:html}
> Some content to be ignored by custom search indexer 
> (Tika parser)
> {code}
> it's required to implement a custom handler with logic very similar to what 
> we have in org.apache.tika.parser.html.HtmlHandler. While it can be easily 
> done by keep using HtmlHandler, but setting an instance of HtmlMapper with 
> (newly added) isDiscardElement(String name, Attributes attributes) method 
> overridden into the ParseContext.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Created] (TIKA-2610) Extend HtmlMapper isDiscardElement method with Attributes parameter

2018-03-19 Thread Aleksei Udalov (JIRA)
Aleksei Udalov created TIKA-2610:


 Summary: Extend HtmlMapper isDiscardElement method with Attributes 
parameter
 Key: TIKA-2610
 URL: https://issues.apache.org/jira/browse/TIKA-2610
 Project: Tika
  Issue Type: Improvement
  Components: parser
Affects Versions: 1.17
Reporter: Aleksei Udalov


Currently, if we want to disregard HTML elements by attribute value/existence, 
an example from one of our projects
{code:html}
Some content to be ignored by custom search indexer 
(Tika parser)
{code}
it's required to implement a custom handler with logic very similar to what we 
have in org.apache.tika.parser.html.HtmlHandler. While it can be easily done by 
keep using HtmlHandler, but setting an instance of HtmlMapper with (newly 
added) isDiscardElement(String name, Attributes attributes) method overridden 
into the ParseContext.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)