Re: [DISCUSS] Enable specific ContentHandler for tika-server

2017-10-24 Thread Chris Mattmann
This makes sense to me, +1 Giuseppe!



On 10/24/17, 6:12 PM, "Giuseppe Totaro"  wrote:

Hi folks,

I am developing the proposed solutions within tika-server for enabling
specific ContentHandlers. Basically, I am working to provide the ability of
giving the name of the ContentHandler to be used by either command-line or
HTTP header.
In order to complete my work, I would like to get your feedback about the
following aspects:

   1. To create and use the given ContentHandler, should I modify each
   method within the TikaResource class (as well as the other classes
   within org.apache.tika.server.resource) where the parse method is
   performed by wrapping the ContentHandler currently used? Alternatively, I
   could create a new method (therefore a new REST API) specifically focused
   on creating a ContentHandler from the list provided by the user. Of 
course,
   I am totally open to other solutions.

   2. As ContentHandlers often provide different types of constructors, we
   would need a mechanism to determine via reflection the constructor and 
the
   parameters to be used. I think we could get the ContentHandler by using 
the
   static method Class.forName(String className) [0] with the
   fully-qualified name of the given class and then using the method
getConstructor(Class...
   parameterTypes) [1] to determine the constructor to be used and
   instantiates the ContentHandler.

   3. If you agree with the above, I think that we can allow users to
   provide the parameters according to RCFC822 [3] so that they can give the
   name of the ContentHandler to be used and the parameter as a
   semicolon-separated list of entries:

= X-Content-Handler:  *[, ]
=  *[; ]
=  = 

   Consistently, I would enable the same syntax when using the command-line
   option:

   java -jar tika-server-X.jar -contentHandler *[,]

I look forward to having your feedback.

Thanks a lot,
Giuseppe

[0]

https://docs.oracle.com/javase/8/docs/api/java/lang/Class.html#forName-java.lang.String-
[1]

https://docs.oracle.com/javase/8/docs/api/java/lang/Class.html#getConstructor-java.lang.Class...-
[3] https://www.w3.org/Protocols/rfc822/

On Fri, Oct 6, 2017 at 3:06 PM, Sergey Beryozkin 
wrote:

> Konstantin, by the way, if you are interested in having a good discussion
> to do with using the serialized lambdas then you will be welcome to 
comment
> on the relevant text in the Tika Concerns Beam thread, though may be Beam
> knows how to take care of the issues you raised...
>
> Thanks, Sergey
>
> On 06/10/17 18:27, Sergey Beryozkin wrote:
>
>> On 06/10/17 18:08, Konstantin Gribov wrote:
>>
>>> My +1 to this idea.
>>>
>>> IMHO, second option is more flexible. I also like Nick's suggestion 
about
>>> using default package for handlers and interpret dot-separated string as
>>> fqcn. Solr does similar thing and it's very convenient to use (but they
>>> use
>>> prefix `solr.` for their classes in predefined package and any other is
>>> interpreted as fqcn).
>>>
>>> I'll add that you could allow user to pass several comma-separated
>>> handlers
>>> to allow build content-handler stack if user wants to.
>>>
>>> I would disagree with Sergey about serialized lambdas for 2 reasons:
>>> - it's useful only for java-clients;
>>> - it could bring very nasty bugs leading to RCE class vulnerabilities, 
so
>>> it's very controversial from security PoV.
>>>
>> Sure. I was not actually suggesting to use them in Tika natively, I only
>> referred to it as the alternative mentioned in the context of the Beam
>> integration work
>>
>> Sergey
>>
>>>
>>> On Thu, Sep 28, 2017 at 11:35 PM Giuseppe Totaro 
>>> wrote:
>>>
>>> Hi folks,

 if I am not wrong, currently you cannot configure a specific
 ContentHandler
 while using tika-server. I mean that you can configure your own parser
 [0]
 but you cannot control which ContentHandler the parser leverages to
 extract
 text and metadata (e.g., you cannot use PhoneExtractingContentHandler,
 StandardsExtractingContentHandler, etc).
 If it is correct, it would be nice to enable the use of specific
 ContentHandlers within tika-server and I would like to discuss how to
 solve
 this issue generally.

 I propose two solutions:

 1. augment the TikaConfig class so that a specific ContentHandler
 can be
 used in tika-config.xml;
 2. determine the ContentHandler to use for parsing through HTTP
 

Re: [DISCUSS] Enable specific ContentHandler for tika-server

2017-10-24 Thread Giuseppe Totaro
Hi folks,

I am developing the proposed solutions within tika-server for enabling
specific ContentHandlers. Basically, I am working to provide the ability of
giving the name of the ContentHandler to be used by either command-line or
HTTP header.
In order to complete my work, I would like to get your feedback about the
following aspects:

   1. To create and use the given ContentHandler, should I modify each
   method within the TikaResource class (as well as the other classes
   within org.apache.tika.server.resource) where the parse method is
   performed by wrapping the ContentHandler currently used? Alternatively, I
   could create a new method (therefore a new REST API) specifically focused
   on creating a ContentHandler from the list provided by the user. Of course,
   I am totally open to other solutions.

   2. As ContentHandlers often provide different types of constructors, we
   would need a mechanism to determine via reflection the constructor and the
   parameters to be used. I think we could get the ContentHandler by using the
   static method Class.forName(String className) [0] with the
   fully-qualified name of the given class and then using the method
getConstructor(Class...
   parameterTypes) [1] to determine the constructor to be used and
   instantiates the ContentHandler.

   3. If you agree with the above, I think that we can allow users to
   provide the parameters according to RCFC822 [3] so that they can give the
   name of the ContentHandler to be used and the parameter as a
   semicolon-separated list of entries:

= X-Content-Handler:  *[, ]
=  *[; ]
=  = 

   Consistently, I would enable the same syntax when using the command-line
   option:

   java -jar tika-server-X.jar -contentHandler *[,]

I look forward to having your feedback.

Thanks a lot,
Giuseppe

[0]
https://docs.oracle.com/javase/8/docs/api/java/lang/Class.html#forName-java.lang.String-
[1]
https://docs.oracle.com/javase/8/docs/api/java/lang/Class.html#getConstructor-java.lang.Class...-
[3] https://www.w3.org/Protocols/rfc822/

On Fri, Oct 6, 2017 at 3:06 PM, Sergey Beryozkin 
wrote:

> Konstantin, by the way, if you are interested in having a good discussion
> to do with using the serialized lambdas then you will be welcome to comment
> on the relevant text in the Tika Concerns Beam thread, though may be Beam
> knows how to take care of the issues you raised...
>
> Thanks, Sergey
>
> On 06/10/17 18:27, Sergey Beryozkin wrote:
>
>> On 06/10/17 18:08, Konstantin Gribov wrote:
>>
>>> My +1 to this idea.
>>>
>>> IMHO, second option is more flexible. I also like Nick's suggestion about
>>> using default package for handlers and interpret dot-separated string as
>>> fqcn. Solr does similar thing and it's very convenient to use (but they
>>> use
>>> prefix `solr.` for their classes in predefined package and any other is
>>> interpreted as fqcn).
>>>
>>> I'll add that you could allow user to pass several comma-separated
>>> handlers
>>> to allow build content-handler stack if user wants to.
>>>
>>> I would disagree with Sergey about serialized lambdas for 2 reasons:
>>> - it's useful only for java-clients;
>>> - it could bring very nasty bugs leading to RCE class vulnerabilities, so
>>> it's very controversial from security PoV.
>>>
>> Sure. I was not actually suggesting to use them in Tika natively, I only
>> referred to it as the alternative mentioned in the context of the Beam
>> integration work
>>
>> Sergey
>>
>>>
>>> On Thu, Sep 28, 2017 at 11:35 PM Giuseppe Totaro 
>>> wrote:
>>>
>>> Hi folks,

 if I am not wrong, currently you cannot configure a specific
 ContentHandler
 while using tika-server. I mean that you can configure your own parser
 [0]
 but you cannot control which ContentHandler the parser leverages to
 extract
 text and metadata (e.g., you cannot use PhoneExtractingContentHandler,
 StandardsExtractingContentHandler, etc).
 If it is correct, it would be nice to enable the use of specific
 ContentHandlers within tika-server and I would like to discuss how to
 solve
 this issue generally.

 I propose two solutions:

 1. augment the TikaConfig class so that a specific ContentHandler
 can be
 used in tika-config.xml;
 2. determine the ContentHandler to use for parsing through HTTP
 headers,
 for example:
 curl -T filename.pdf http://localhost:9998/meta --header
 "X-Content-Handler: PhoneExtractingContentHandler"
 This should affect also the TikaResource.java class.

 I look forward to having your feedback. I strongly believe that every
 user
 who wants to use Tika as a service through tika-server and needs to
 extract
 content and metadata like phone numbers, standard references, etc would
 be
 very happy.

 Thanks a lot,
 Giuseppe


>
> --
> Sergey Beryozkin
>
> Talend 

Re: Tika 2 parsers

2017-10-24 Thread Sergey Beryozkin

I did try the modules in the earlier version of the CXF demo,

see the right panel,

https://github.com/apache/cxf/commit/c2ccecb23ba23497c95be89f9b37f38c69faba7a#diff-b5ed531ebf92978dcbcf1ac6cc6331c0

They should be available in the snapshot repo

Cheers, Sergey
On 24/10/17 19:45, Allison, Timothy B. wrote:

We'll switch master over to the 2.0 layout after our next release, which should 
happen shortly after the release of PDFBox 2.0.8...roughly in the next week for 
PDFBox, next month for Tika.

We have abandoned keeping the current 2.x up to date, and I was hoping there 
would at least be a build here: 
https://builds.apache.org/view/T/view/Tika/job/tika-2.x/, but there isn't a 
clean build there.

So, unfortunately, for now, your best bet is to build it yourself from source.  
Sorry.



-Original Message-
From: Gethin James [mailto:gja...@nuxeo.com]
Sent: Tuesday, October 24, 2017 12:19 PM
To: dev@tika.apache.org
Subject: Tika 2 parsers

Hi, I am interested in trying the more modular approach of using the Tika 2 
parsers.  Are the Tika 2 artifacts available in a maven repo somewhere?  Is the 
any documentation on how to use them or how they differ from Tika 1?

Thanks,
Gethin.



[jira] [Resolved] (TIKA-1788) message/rfc822 parser doesn't identify attachment filenames from Content-Disposition header

2017-10-24 Thread Tim Allison (JIRA)

 [ 
https://issues.apache.org/jira/browse/TIKA-1788?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tim Allison resolved TIKA-1788.
---
   Resolution: Fixed
Fix Version/s: 1.17

Thank you, AarjavP!

> message/rfc822 parser doesn't identify attachment filenames from 
> Content-Disposition header
> ---
>
> Key: TIKA-1788
> URL: https://issues.apache.org/jira/browse/TIKA-1788
> Project: Tika
>  Issue Type: Bug
>Affects Versions: 1.11
>Reporter: Sergey Tsalkov
>Assignee: Tim Allison
> Fix For: 1.17
>
> Attachments: grep_content_disposition.zip
>
>
> rfc822 email files can contain attachments as subparts, and they'll
> generally specify the filename of the attachment in a manner like
> this:
> Content-Disposition: attachment;
> filename*=utf-8''image001.jpg
> Tika doesn't seem to be grabbing that information at all!



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Comment Edited] (TIKA-2478) MBOX import includes redundant copies of the text

2017-10-24 Thread Tim Allison (JIRA)

[ 
https://issues.apache.org/jira/browse/TIKA-2478?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16217509#comment-16217509
 ] 

Tim Allison edited comment on TIKA-2478 at 10/24/17 7:30 PM:
-

First patch.  This incorporates the test file from TIKA-2471 and [~kkrugler]'s 
test files.  Thank you!

While this change will make the behavior equivalent to the OutlookParser and 
how it handles multiple bodies, it will be a pretty big breaking change.

Given the complexity of this patch, and the breaking change-ness of it, I'm 
tempted to hold off until Tika 2.0.

Any and all feedback is welcomed.  Thank you!


was (Author: talli...@mitre.org):
First patch.  This incorporates the test file from TIKA-2471 and [~kkrugler]'s 
test files.  Thakn you!

While this change will make the behavior equivalent to the OutlookParser and 
how it handles multiple bodies, it will be a pretty big breaking change.

Given the complexity of this patch, and the breaking change-ness of it, I'm 
tempted to hold off until Tika 2.0.

Any and all feedback is welcomed.  Thank you!

> MBOX import includes redundant copies of the text
> -
>
> Key: TIKA-2478
> URL: https://issues.apache.org/jira/browse/TIKA-2478
> Project: Tika
>  Issue Type: Bug
>Affects Versions: 1.16
>Reporter: Robert Letzler
>Assignee: Tim Allison
>Priority: Minor
> Attachments: TIKA-2478.patch, UET6KCXR5FYIEJYKUCK2AKF3FLXTRNAT.eml, 
> mixed-simple, mixed-with-pdf-inline
>
>
> MBOX messages often get parsed into four documents:
> a.The mbox file - outer container "/"
> b.The actual email--  "/embedded-1"
> c.The utf-8 text content of the email "/embedded-1/embedded-2"
> d.The utf-8 html content of the email  "/embedded-1/embedded-3"
> entries C and D are redundant and distracting.  The MSG parser parses the 
> first non-null: email body and then it skips the rest.  Please modify MBOX to 
> not have separate "attached" documents for the html body and the text body.
> The attachment to https://issues.apache.org/jira/browse/TIKA-2471 is an 
> example of input sufficient to generate this behavior.
> Thanks!



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Updated] (TIKA-2478) MBOX import includes redundant copies of the text

2017-10-24 Thread Tim Allison (JIRA)

 [ 
https://issues.apache.org/jira/browse/TIKA-2478?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tim Allison updated TIKA-2478:
--
Attachment: TIKA-2478.patch

First patch.  This incorporates the test file from TIKA-2471 and [~kkrugler]'s 
test files.  Thakn you!

While this change will make the behavior equivalent to the OutlookParser and 
how it handles multiple bodies, it will be a pretty big breaking change.

Given the complexity of this patch, and the breaking change-ness of it, I'm 
tempted to hold off until Tika 2.0.

Any and all feedback is welcomed.  Thank you!

> MBOX import includes redundant copies of the text
> -
>
> Key: TIKA-2478
> URL: https://issues.apache.org/jira/browse/TIKA-2478
> Project: Tika
>  Issue Type: Bug
>Affects Versions: 1.16
>Reporter: Robert Letzler
>Assignee: Tim Allison
>Priority: Minor
> Attachments: TIKA-2478.patch, UET6KCXR5FYIEJYKUCK2AKF3FLXTRNAT.eml, 
> mixed-simple, mixed-with-pdf-inline
>
>
> MBOX messages often get parsed into four documents:
> a.The mbox file - outer container "/"
> b.The actual email--  "/embedded-1"
> c.The utf-8 text content of the email "/embedded-1/embedded-2"
> d.The utf-8 html content of the email  "/embedded-1/embedded-3"
> entries C and D are redundant and distracting.  The MSG parser parses the 
> first non-null: email body and then it skips the rest.  Please modify MBOX to 
> not have separate "attached" documents for the html body and the text body.
> The attachment to https://issues.apache.org/jira/browse/TIKA-2471 is an 
> example of input sufficient to generate this behavior.
> Thanks!



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


RE: Tika 2 parsers

2017-10-24 Thread Allison, Timothy B.
We'll switch master over to the 2.0 layout after our next release, which should 
happen shortly after the release of PDFBox 2.0.8...roughly in the next week for 
PDFBox, next month for Tika.

We have abandoned keeping the current 2.x up to date, and I was hoping there 
would at least be a build here: 
https://builds.apache.org/view/T/view/Tika/job/tika-2.x/, but there isn't a 
clean build there.

So, unfortunately, for now, your best bet is to build it yourself from source.  
Sorry.



-Original Message-
From: Gethin James [mailto:gja...@nuxeo.com] 
Sent: Tuesday, October 24, 2017 12:19 PM
To: dev@tika.apache.org
Subject: Tika 2 parsers

Hi, I am interested in trying the more modular approach of using the Tika 2 
parsers.  Are the Tika 2 artifacts available in a maven repo somewhere?  Is the 
any documentation on how to use them or how they differ from Tika 1?

Thanks,
Gethin.


Tika 2 parsers

2017-10-24 Thread Gethin James
Hi, I am interested in trying the more modular approach of using the Tika 2
parsers.  Are the Tika 2 artifacts available in a maven repo somewhere?  Is
the any documentation on how to use them or how they differ from Tika 1?

Thanks,
Gethin.