[jira] [Commented] (TIKA-3560) Tika 2.0 (and 2.1) parses doc with less fidelity than using 1.24.1

2021-10-06 Thread Josh Burchard (Jira)


[ 
https://issues.apache.org/jira/browse/TIKA-3560?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17425171#comment-17425171
 ] 

Josh Burchard commented on TIKA-3560:
-

Thank you Tim. The wiki looks good so far and I appreciate you creating it.  
I'll let you if there are any issues I see moving forward.

> Tika 2.0 (and 2.1) parses doc with less fidelity than using 1.24.1
> --
>
> Key: TIKA-3560
> URL: https://issues.apache.org/jira/browse/TIKA-3560
> Project: Tika
>  Issue Type: Bug
>  Components: parser
>Affects Versions: 2.0.0, 2.1.0
> Environment: Windows 10
>Reporter: Josh Burchard
>Priority: Major
> Attachments: Capture.jpg
>
>
> I'm parsing an old .doc file and I'm sending my request to the /rmeta/text 
> endpoint. I see that some metadata fields that were returned to me from Tika 
> 1.24.1 are no longer returned in 2.0 and above.
> I diffed the output between 1.24.1, 2.0 and 2.1.  Please see the screenshot 
> I've attached. 
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (TIKA-3560) Tika 2.0 (and 2.1) parses doc with less fidelity than using 1.24.1

2021-10-04 Thread Tim Allison (Jira)


[ 
https://issues.apache.org/jira/browse/TIKA-3560?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17424178#comment-17424178
 ] 

Tim Allison commented on TIKA-3560:
---

I updated the metadata section in our wiki page "migrating to tika 2.x" today.  
I looked into subject, and it looks like we were putting "keywords" into 
subject in 1.x as well as into keywords.  We've kept that behavior in 2.x.  I'm 
not sure why there's an array in 2.x but not in 1.x.  Those should be the same. 

In 2.1.1-SNAPSHOT, I added empty checks for subject, keywords, title and other 
keys in the MSOffice parsers.  They used to allow an empty string for string 
based metadata values. 

> Tika 2.0 (and 2.1) parses doc with less fidelity than using 1.24.1
> --
>
> Key: TIKA-3560
> URL: https://issues.apache.org/jira/browse/TIKA-3560
> Project: Tika
>  Issue Type: Bug
>  Components: parser
>Affects Versions: 2.0.0, 2.1.0
> Environment: Windows 10
>Reporter: Josh Burchard
>Priority: Major
> Attachments: Capture.jpg
>
>
> I'm parsing an old .doc file and I'm sending my request to the /rmeta/text 
> endpoint. I see that some metadata fields that were returned to me from Tika 
> 1.24.1 are no longer returned in 2.0 and above.
> I diffed the output between 1.24.1, 2.0 and 2.1.  Please see the screenshot 
> I've attached. 
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Resolved] (TIKA-3560) Tika 2.0 (and 2.1) parses doc with less fidelity than using 1.24.1

2021-10-04 Thread Tim Allison (Jira)


 [ 
https://issues.apache.org/jira/browse/TIKA-3560?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tim Allison resolved TIKA-3560.
---
Resolution: Fixed

Please reopen if there are any surprises and/or if there's anything I can do on 
our wiki to improve the documentation in migrating to 2.x.

> Tika 2.0 (and 2.1) parses doc with less fidelity than using 1.24.1
> --
>
> Key: TIKA-3560
> URL: https://issues.apache.org/jira/browse/TIKA-3560
> Project: Tika
>  Issue Type: Bug
>  Components: parser
>Affects Versions: 2.0.0, 2.1.0
> Environment: Windows 10
>Reporter: Josh Burchard
>Priority: Major
> Attachments: Capture.jpg
>
>
> I'm parsing an old .doc file and I'm sending my request to the /rmeta/text 
> endpoint. I see that some metadata fields that were returned to me from Tika 
> 1.24.1 are no longer returned in 2.0 and above.
> I diffed the output between 1.24.1, 2.0 and 2.1.  Please see the screenshot 
> I've attached. 
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (TIKA-3560) Tika 2.0 (and 2.1) parses doc with less fidelity than using 1.24.1

2021-09-30 Thread Tim Allison (Jira)


[ 
https://issues.apache.org/jira/browse/TIKA-3560?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17422697#comment-17422697
 ] 

Tim Allison commented on TIKA-3560:
---

I'm sorry for my delay.  I should have a chance to look at dc:subject.  I 
definitely should document the major changes.  Where should I do that?  
https://cwiki.apache.org/confluence/display/TIKA/Migrating+to+Tika+2.0.0 ?

> Tika 2.0 (and 2.1) parses doc with less fidelity than using 1.24.1
> --
>
> Key: TIKA-3560
> URL: https://issues.apache.org/jira/browse/TIKA-3560
> Project: Tika
>  Issue Type: Bug
>  Components: parser
>Affects Versions: 2.0.0, 2.1.0
> Environment: Windows 10
>Reporter: Josh Burchard
>Priority: Major
> Attachments: Capture.jpg
>
>
> I'm parsing an old .doc file and I'm sending my request to the /rmeta/text 
> endpoint. I see that some metadata fields that were returned to me from Tika 
> 1.24.1 are no longer returned in 2.0 and above.
> I diffed the output between 1.24.1, 2.0 and 2.1.  Please see the screenshot 
> I've attached. 
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (TIKA-3560) Tika 2.0 (and 2.1) parses doc with less fidelity than using 1.24.1

2021-09-23 Thread Josh Burchard (Jira)


[ 
https://issues.apache.org/jira/browse/TIKA-3560?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17419473#comment-17419473
 ] 

Josh Burchard commented on TIKA-3560:
-

Thank you for all the comments, Tim.   Author is one that we were using in our 
application and that's what first got my attention.   OK, so it's now _only_ 
dc:creator.  That's fine.  I guess I just have a bunch of code adjustments to 
make on our end as the consumer. ;)    Just two more questions:
 # Is there any cross-reference doc that was used during the task to slim down 
these duplicated attributes?


 # dc:subject looks like it's now an array where in 1.24.1 it was just a simple 
string. Is that intentional?



Feel free to close this as a non-issue.   Thanks again.

 

> Tika 2.0 (and 2.1) parses doc with less fidelity than using 1.24.1
> --
>
> Key: TIKA-3560
> URL: https://issues.apache.org/jira/browse/TIKA-3560
> Project: Tika
>  Issue Type: Bug
>  Components: parser
>Affects Versions: 2.0.0, 2.1.0
> Environment: Windows 10
>Reporter: Josh Burchard
>Priority: Major
> Attachments: Capture.jpg
>
>
> I'm parsing an old .doc file and I'm sending my request to the /rmeta/text 
> endpoint. I see that some metadata fields that were returned to me from Tika 
> 1.24.1 are no longer returned in 2.0 and above.
> I diffed the output between 1.24.1, 2.0 and 2.1.  Please see the screenshot 
> I've attached. 
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Comment Edited] (TIKA-3560) Tika 2.0 (and 2.1) parses doc with less fidelity than using 1.24.1

2021-09-23 Thread Tim Allison (Jira)


[ 
https://issues.apache.org/jira/browse/TIKA-3560?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17419338#comment-17419338
 ] 

Tim Allison edited comment on TIKA-3560 at 9/23/21, 5:16 PM:
-

As background, before my time on the project, IIUC, we used a file-format 
specific keys, some formats may have had "author", others "creator", etc.

Then we had a massive contribution from Ray Gauss II which normalized 
everything as much as possible to DublinCore (e.g. dcterms:created).  To enable 
backward compatibility at that time, we left in the old keys and added the new 
"standard" keys; so at that time we had duplicate/triplicate keys for the same 
information.

In 2.x, we tried to remove the old duplicate/triplicate keys and use only the 
"standard" keys.


was (Author: talli...@mitre.org):
As background, before my time on the project, IIUC, we used a file-format 
specific keys, some formats may have had "author", others "creator", etc.

Then we had a massive contribution from Ray Gauss II which normalized 
everything as much as possible to DublinCore (e.g. dcterms:created).  To enable 
backward compatibility at that time, we left in the old keys and added the new 
"standard" keys.  In 2.x, we tried to remove the old duplicate/triplicate keys.

> Tika 2.0 (and 2.1) parses doc with less fidelity than using 1.24.1
> --
>
> Key: TIKA-3560
> URL: https://issues.apache.org/jira/browse/TIKA-3560
> Project: Tika
>  Issue Type: Bug
>  Components: parser
>Affects Versions: 2.0.0, 2.1.0
> Environment: Windows 10
>Reporter: Josh Burchard
>Priority: Major
> Attachments: Capture.jpg
>
>
> I'm parsing an old .doc file and I'm sending my request to the /rmeta/text 
> endpoint. I see that some metadata fields that were returned to me from Tika 
> 1.24.1 are no longer returned in 2.0 and above.
> I diffed the output between 1.24.1, 2.0 and 2.1.  Please see the screenshot 
> I've attached. 
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Comment Edited] (TIKA-3560) Tika 2.0 (and 2.1) parses doc with less fidelity than using 1.24.1

2021-09-23 Thread Tim Allison (Jira)


[ 
https://issues.apache.org/jira/browse/TIKA-3560?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17419334#comment-17419334
 ] 

Tim Allison edited comment on TIKA-3560 at 9/23/21, 5:15 PM:
-

As I look at the image a bit more, there are other cases where we've removed 
the duplicate or even triplicate keys for the same information.

{{Application-Name}} used to have {{Application-Name}} and 
{{extended-properties:Application}}.  We've slimmed down in favor of 
{{extended-properties:Application}}.

{{Edit-Time}} is now {{extended-properties:TotalTime}}

{{Creation-Date}}, {{meta:creation-date}} and {{dcterms:created}} have been 
reduced to {{dcterms:created}}

{{Last-Save-Date}} is now {{dcterms:modified}}

Which are the concerning keys that do not at all exist in 2.x? -{{Template}}-, 
-{{Revision-Number}}-...  Anything else?

Sorry, Template is there: {{extended-properties:Template}}.  Revision number is 
there too: {{cp:revision}}.

So, are there any keys in 1.x that do not have a value in 2.x?


was (Author: talli...@mitre.org):
As I look at the image a bit more, there are other cases where we've removed 
the duplicate or even triplicate keys for the same information.

{{Application-Name}} used to have {{Application-Name}} and 
{{extended-properties:Application}}.  We've slimmed down in favor of 
{{extended-properties:Application}}.

{{Edit-Time}} is now {{extended-properties:TotalTime}}

{{Creation-Date}}, {{meta:creation-date}} and {{dcterms:created}} have been 
reduced to {{dcterms:created}}

{{Last-Save-Date}} is now {{dcterms:modified}}

Which are the concerning keys that do not at all exist in 2.x? {{Template}}, 
{{Revision-Number}}...  Anything else?

> Tika 2.0 (and 2.1) parses doc with less fidelity than using 1.24.1
> --
>
> Key: TIKA-3560
> URL: https://issues.apache.org/jira/browse/TIKA-3560
> Project: Tika
>  Issue Type: Bug
>  Components: parser
>Affects Versions: 2.0.0, 2.1.0
> Environment: Windows 10
>Reporter: Josh Burchard
>Priority: Major
> Attachments: Capture.jpg
>
>
> I'm parsing an old .doc file and I'm sending my request to the /rmeta/text 
> endpoint. I see that some metadata fields that were returned to me from Tika 
> 1.24.1 are no longer returned in 2.0 and above.
> I diffed the output between 1.24.1, 2.0 and 2.1.  Please see the screenshot 
> I've attached. 
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (TIKA-3560) Tika 2.0 (and 2.1) parses doc with less fidelity than using 1.24.1

2021-09-23 Thread Tim Allison (Jira)


[ 
https://issues.apache.org/jira/browse/TIKA-3560?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17419338#comment-17419338
 ] 

Tim Allison commented on TIKA-3560:
---

As background, before my time on the project, IIUC, we used a file-format 
specific keys, some formats may have had "author", others "creator", etc.

Then we had a massive contribution from Ray Gauss II which normalized 
everything as much as possible to DublinCore (e.g. dcterms:created).  To enable 
backward compatibility at that time, we left in the old keys and added the new 
"standard" keys.  In 2.x, we tried to remove the old duplicate/triplicate keys.

> Tika 2.0 (and 2.1) parses doc with less fidelity than using 1.24.1
> --
>
> Key: TIKA-3560
> URL: https://issues.apache.org/jira/browse/TIKA-3560
> Project: Tika
>  Issue Type: Bug
>  Components: parser
>Affects Versions: 2.0.0, 2.1.0
> Environment: Windows 10
>Reporter: Josh Burchard
>Priority: Major
> Attachments: Capture.jpg
>
>
> I'm parsing an old .doc file and I'm sending my request to the /rmeta/text 
> endpoint. I see that some metadata fields that were returned to me from Tika 
> 1.24.1 are no longer returned in 2.0 and above.
> I diffed the output between 1.24.1, 2.0 and 2.1.  Please see the screenshot 
> I've attached. 
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Comment Edited] (TIKA-3560) Tika 2.0 (and 2.1) parses doc with less fidelity than using 1.24.1

2021-09-23 Thread Tim Allison (Jira)


[ 
https://issues.apache.org/jira/browse/TIKA-3560?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17419334#comment-17419334
 ] 

Tim Allison edited comment on TIKA-3560 at 9/23/21, 5:11 PM:
-

As I look at the image a bit more, there are other cases where we've removed 
the duplicate or even triplicate keys for the same information.

{{Application-Name}} used to have {{Application-Name}} and 
{{extended-properties:Application}}.  We've slimmed down in favor of 
{{extended-properties:Application}}.

{{Edit-Time}} is now {{extended-properties:TotalTime}}

{{Creation-Date}}, {{meta:creation-date}} and {{dcterms:created}} have been 
reduced to {{dcterms:created}}

{{Last-Save-Date}} is now {{dcterms:modified}}

Which are the concerning keys that do not at all exist in 2.x? {{Template}}, 
{{Revision-Number}}...  Anything else?


was (Author: talli...@mitre.org):
As I look at the image a bit more, there are other cases where we've removed 
the duplicate keys.

{{Application-Name}} used to have {{Application-Name}} and 
{{extended-properties:Application}}.  We've slimmed down in favor of 
{{extended-properties:Application}}.

{{Edit-Time}} is now {{extended-properties:TotalTime}}

{{Creation-Date}}, {{meta:creation-date}} and {{dcterms:created}} have been 
reduced to {{dcterms:created}}

{{Last-Save-Date}} is now {{dcterms:modified}}

Which are the concerning keys that do not at all exist in 2.x? {{Template}}, 
{{Revision-Number}}...  Anything else?

> Tika 2.0 (and 2.1) parses doc with less fidelity than using 1.24.1
> --
>
> Key: TIKA-3560
> URL: https://issues.apache.org/jira/browse/TIKA-3560
> Project: Tika
>  Issue Type: Bug
>  Components: parser
>Affects Versions: 2.0.0, 2.1.0
> Environment: Windows 10
>Reporter: Josh Burchard
>Priority: Major
> Attachments: Capture.jpg
>
>
> I'm parsing an old .doc file and I'm sending my request to the /rmeta/text 
> endpoint. I see that some metadata fields that were returned to me from Tika 
> 1.24.1 are no longer returned in 2.0 and above.
> I diffed the output between 1.24.1, 2.0 and 2.1.  Please see the screenshot 
> I've attached. 
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (TIKA-3560) Tika 2.0 (and 2.1) parses doc with less fidelity than using 1.24.1

2021-09-23 Thread Tim Allison (Jira)


[ 
https://issues.apache.org/jira/browse/TIKA-3560?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17419334#comment-17419334
 ] 

Tim Allison commented on TIKA-3560:
---

As I look at the image a bit more, there are other cases where we've removed 
the duplicate keys.

{{Application-Name}} used to have {{Application-Name}} and 
{{extended-properties:Application}}.  We've slimmed down in favor of 
{{extended-properties:Application}}.

{{Edit-Time}} is now {{extended-properties:TotalTime}}

{{Creation-Date}}, {{meta:creation-date}} and {{dcterms:created}} have been 
reduced to {{dcterms:created}}

{{Last-Save-Date}} is now {{dcterms:modified}}

Which are the concerning keys that do not at all exist in 2.x? {{Template}}, 
{{Revision-Number}}...  Anything else?

> Tika 2.0 (and 2.1) parses doc with less fidelity than using 1.24.1
> --
>
> Key: TIKA-3560
> URL: https://issues.apache.org/jira/browse/TIKA-3560
> Project: Tika
>  Issue Type: Bug
>  Components: parser
>Affects Versions: 2.0.0, 2.1.0
> Environment: Windows 10
>Reporter: Josh Burchard
>Priority: Major
> Attachments: Capture.jpg
>
>
> I'm parsing an old .doc file and I'm sending my request to the /rmeta/text 
> endpoint. I see that some metadata fields that were returned to me from Tika 
> 1.24.1 are no longer returned in 2.0 and above.
> I diffed the output between 1.24.1, 2.0 and 2.1.  Please see the screenshot 
> I've attached. 
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (TIKA-3560) Tika 2.0 (and 2.1) parses doc with less fidelity than using 1.24.1

2021-09-23 Thread Tim Allison (Jira)


[ 
https://issues.apache.org/jira/browse/TIKA-3560?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17419327#comment-17419327
 ] 

Tim Allison commented on TIKA-3560:
---

Not a problem.  We probably have some example files within our unit test sets 
or possibly regression.

> Tika 2.0 (and 2.1) parses doc with less fidelity than using 1.24.1
> --
>
> Key: TIKA-3560
> URL: https://issues.apache.org/jira/browse/TIKA-3560
> Project: Tika
>  Issue Type: Bug
>  Components: parser
>Affects Versions: 2.0.0, 2.1.0
> Environment: Windows 10
>Reporter: Josh Burchard
>Priority: Major
> Attachments: Capture.jpg
>
>
> I'm parsing an old .doc file and I'm sending my request to the /rmeta/text 
> endpoint. I see that some metadata fields that were returned to me from Tika 
> 1.24.1 are no longer returned in 2.0 and above.
> I diffed the output between 1.24.1, 2.0 and 2.1.  Please see the screenshot 
> I've attached. 
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (TIKA-3560) Tika 2.0 (and 2.1) parses doc with less fidelity than using 1.24.1

2021-09-23 Thread Josh Burchard (Jira)


[ 
https://issues.apache.org/jira/browse/TIKA-3560?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17419297#comment-17419297
 ] 

Josh Burchard commented on TIKA-3560:
-

It looks like it contains some confidential info as well as some PII, so I'd 
better not upload this particular one.  That's disappointing.  I'll see if I 
have another file that prompts similar output when parsed.

> Tika 2.0 (and 2.1) parses doc with less fidelity than using 1.24.1
> --
>
> Key: TIKA-3560
> URL: https://issues.apache.org/jira/browse/TIKA-3560
> Project: Tika
>  Issue Type: Bug
>  Components: parser
>Affects Versions: 2.0.0, 2.1.0
> Environment: Windows 10
>Reporter: Josh Burchard
>Priority: Major
> Attachments: Capture.jpg
>
>
> I'm parsing an old .doc file and I'm sending my request to the /rmeta/text 
> endpoint. I see that some metadata fields that were returned to me from Tika 
> 1.24.1 are no longer returned in 2.0 and above.
> I diffed the output between 1.24.1, 2.0 and 2.1.  Please see the screenshot 
> I've attached. 
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (TIKA-3560) Tika 2.0 (and 2.1) parses doc with less fidelity than using 1.24.1

2021-09-23 Thread Tim Allison (Jira)


[ 
https://issues.apache.org/jira/browse/TIKA-3560?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17419293#comment-17419293
 ] 

Tim Allison commented on TIKA-3560:
---

K.  If you can't share it publicly, but can share it with me privately, let me 
know.

> Tika 2.0 (and 2.1) parses doc with less fidelity than using 1.24.1
> --
>
> Key: TIKA-3560
> URL: https://issues.apache.org/jira/browse/TIKA-3560
> Project: Tika
>  Issue Type: Bug
>  Components: parser
>Affects Versions: 2.0.0, 2.1.0
> Environment: Windows 10
>Reporter: Josh Burchard
>Priority: Major
> Attachments: Capture.jpg
>
>
> I'm parsing an old .doc file and I'm sending my request to the /rmeta/text 
> endpoint. I see that some metadata fields that were returned to me from Tika 
> 1.24.1 are no longer returned in 2.0 and above.
> I diffed the output between 1.24.1, 2.0 and 2.1.  Please see the screenshot 
> I've attached. 
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (TIKA-3560) Tika 2.0 (and 2.1) parses doc with less fidelity than using 1.24.1

2021-09-23 Thread Josh Burchard (Jira)


[ 
https://issues.apache.org/jira/browse/TIKA-3560?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17419290#comment-17419290
 ] 

Josh Burchard commented on TIKA-3560:
-

It's a pretty old file that's used in a test suite that I inherited, AND it's 
in Japanese so I'll need to translate it and check that it's ok to upload. 

> Tika 2.0 (and 2.1) parses doc with less fidelity than using 1.24.1
> --
>
> Key: TIKA-3560
> URL: https://issues.apache.org/jira/browse/TIKA-3560
> Project: Tika
>  Issue Type: Bug
>  Components: parser
>Affects Versions: 2.0.0, 2.1.0
> Environment: Windows 10
>Reporter: Josh Burchard
>Priority: Major
> Attachments: Capture.jpg
>
>
> I'm parsing an old .doc file and I'm sending my request to the /rmeta/text 
> endpoint. I see that some metadata fields that were returned to me from Tika 
> 1.24.1 are no longer returned in 2.0 and above.
> I diffed the output between 1.24.1, 2.0 and 2.1.  Please see the screenshot 
> I've attached. 
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (TIKA-3560) Tika 2.0 (and 2.1) parses doc with less fidelity than using 1.24.1

2021-09-22 Thread Tim Allison (Jira)


[ 
https://issues.apache.org/jira/browse/TIKA-3560?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17418847#comment-17418847
 ] 

Tim Allison commented on TIKA-3560:
---

We streamlined the double-entry key names for "created" and "modified" and a 
few others in 2.x.  There are several in there, though, that are more 
perplexing.  Any chance you can share the file?

> Tika 2.0 (and 2.1) parses doc with less fidelity than using 1.24.1
> --
>
> Key: TIKA-3560
> URL: https://issues.apache.org/jira/browse/TIKA-3560
> Project: Tika
>  Issue Type: Bug
>  Components: parser
>Affects Versions: 2.0.0, 2.1.0
> Environment: Windows 10
>Reporter: Josh Burchard
>Priority: Major
> Attachments: Capture.jpg
>
>
> I'm parsing an old .doc file and I'm sending my request to the /rmeta/text 
> endpoint. I see that some metadata fields that were returned to me from Tika 
> 1.24.1 are no longer returned in 2.0 and above.
> I diffed the output between 1.24.1, 2.0 and 2.1.  Please see the screenshot 
> I've attached. 
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (TIKA-3560) Tika 2.0 (and 2.1) parses doc with less fidelity than using 1.24.1

2021-09-22 Thread Josh Burchard (Jira)


 [ 
https://issues.apache.org/jira/browse/TIKA-3560?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Josh Burchard updated TIKA-3560:

Attachment: Capture.jpg

> Tika 2.0 (and 2.1) parses doc with less fidelity than using 1.24.1
> --
>
> Key: TIKA-3560
> URL: https://issues.apache.org/jira/browse/TIKA-3560
> Project: Tika
>  Issue Type: Bug
>  Components: parser
>Affects Versions: 2.0.0, 2.1.0
> Environment: Windows 10
>Reporter: Josh Burchard
>Priority: Major
> Attachments: Capture.jpg
>
>
> I'm parsing an old .doc file and I'm sending my request to the /rmeta/text 
> endpoint. I see that some metadata fields that were returned to me from Tika 
> 1.24.1 are no longer returned in 2.0 and above.
> I diffed the output between 1.24.1, 2.0 and 2.1.  Please see the screenshot 
> I've attached. 
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (TIKA-3560) Tika 2.0 (and 2.1) parses doc with less fidelity than using 1.24.1

2021-09-22 Thread Josh Burchard (Jira)
Josh Burchard created TIKA-3560:
---

 Summary: Tika 2.0 (and 2.1) parses doc with less fidelity than 
using 1.24.1
 Key: TIKA-3560
 URL: https://issues.apache.org/jira/browse/TIKA-3560
 Project: Tika
  Issue Type: Bug
  Components: parser
Affects Versions: 2.1.0, 2.0.0
 Environment: Windows 10
Reporter: Josh Burchard


I'm parsing an old .doc file and I'm sending my request to the /rmeta/text 
endpoint. I see that some metadata fields that were returned to me from Tika 
1.24.1 are no longer returned in 2.0 and above.

I diffed the output between 1.24.1, 2.0 and 2.1.  Please see the screenshot 
I've attached. 

 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (TIKA-2071) Tika 2.0 - DefaultParser and CompositeParser does not filter excludedParsers from dynamic ServiceLoader Parsers

2021-07-28 Thread Tim Allison (Jira)


 [ 
https://issues.apache.org/jira/browse/TIKA-2071?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tim Allison updated TIKA-2071:
--
Fix Version/s: (was: 2.0.1)
   (was: 2.0.0-BETA)

> Tika 2.0 - DefaultParser and CompositeParser does not filter excludedParsers 
> from dynamic ServiceLoader Parsers
> ---
>
> Key: TIKA-2071
> URL: https://issues.apache.org/jira/browse/TIKA-2071
> Project: Tika
>  Issue Type: Bug
>Reporter: Bob Paulin
>Assignee: Bob Paulin
>Priority: Major
>
> The DefaultParser and CompositeParser do not filter dynamic services using 
> the excludedParser List.  The exclude list should be applied here as well.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (TIKA-2071) Tika 2.0 - DefaultParser and CompositeParser does not filter excludedParsers from dynamic ServiceLoader Parsers

2021-07-21 Thread Tim Allison (Jira)


 [ 
https://issues.apache.org/jira/browse/TIKA-2071?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tim Allison updated TIKA-2071:
--
Fix Version/s: (was: 2.0.0)
   2.0.0-BETA

> Tika 2.0 - DefaultParser and CompositeParser does not filter excludedParsers 
> from dynamic ServiceLoader Parsers
> ---
>
> Key: TIKA-2071
> URL: https://issues.apache.org/jira/browse/TIKA-2071
> Project: Tika
>  Issue Type: Bug
>Reporter: Bob Paulin
>Assignee: Bob Paulin
>Priority: Major
> Fix For: 2.0.0-BETA, 2.0.1
>
>
> The DefaultParser and CompositeParser do not filter dynamic services using 
> the excludedParser List.  The exclude list should be applied here as well.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (TIKA-2071) Tika 2.0 - DefaultParser and CompositeParser does not filter excludedParsers from dynamic ServiceLoader Parsers

2021-07-21 Thread Tim Allison (Jira)


 [ 
https://issues.apache.org/jira/browse/TIKA-2071?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tim Allison updated TIKA-2071:
--
Fix Version/s: (was: 2.0.0)
   2.0.1

> Tika 2.0 - DefaultParser and CompositeParser does not filter excludedParsers 
> from dynamic ServiceLoader Parsers
> ---
>
> Key: TIKA-2071
> URL: https://issues.apache.org/jira/browse/TIKA-2071
> Project: Tika
>  Issue Type: Bug
>Reporter: Bob Paulin
>Assignee: Bob Paulin
>Priority: Major
> Fix For: 2.0.0-BETA, 2.0.1
>
>
> The DefaultParser and CompositeParser do not filter dynamic services using 
> the excludedParser List.  The exclude list should be applied here as well.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (TIKA-3343) Move Tika's legacy lang id to its own submodule for Tika 2.0

2021-04-10 Thread Hudson (Jira)


[ 
https://issues.apache.org/jira/browse/TIKA-3343?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17318489#comment-17318489
 ] 

Hudson commented on TIKA-3343:
--

SUCCESS: Integrated in Jenkins build Tika » tika-main-jdk8 #191 (See 
[https://ci-builds.apache.org/job/Tika/job/tika-main-jdk8/191/])
TIKA-3343 -- move Tika's legacy lang detector to its own submodule in 
tika-langdetect -- git add lang models (tallison: 
[https://github.com/apache/tika/commit/bd6cbb56b9ee65bb1ef72a23cad5961f02223a9a])
* (add) 
tika-langdetect/tika-langdetect-tika/src/main/resources/org/apache/tika/langdetect/tika/eo.ngp
* (add) 
tika-langdetect/tika-langdetect-tika/src/main/resources/org/apache/tika/langdetect/tika/it.ngp
* (add) 
tika-langdetect/tika-langdetect-tika/src/main/resources/org/apache/tika/langdetect/tika/be.ngp
* (add) 
tika-langdetect/tika-langdetect-tika/src/main/resources/org/apache/tika/langdetect/tika/el.ngp
* (add) 
tika-langdetect/tika-langdetect-tika/src/main/resources/org/apache/tika/langdetect/tika/nl.ngp
* (add) 
tika-langdetect/tika-langdetect-tika/src/main/resources/org/apache/tika/langdetect/tika/no.ngp
* (add) 
tika-langdetect/tika-langdetect-tika/src/main/resources/org/apache/tika/langdetect/tika/de.ngp
* (add) 
tika-langdetect/tika-langdetect-tika/src/main/resources/org/apache/tika/langdetect/tika/ru.ngp
* (add) 
tika-langdetect/tika-langdetect-tika/src/main/resources/org/apache/tika/langdetect/tika/fi.ngp
* (add) 
tika-langdetect/tika-langdetect-tika/src/main/resources/org/apache/tika/langdetect/tika/es.ngp
* (add) 
tika-langdetect/tika-langdetect-tika/src/main/resources/org/apache/tika/langdetect/tika/en.ngp
* (add) 
tika-langdetect/tika-langdetect-tika/src/main/resources/org/apache/tika/langdetect/tika/pt.ngp
* (add) 
tika-langdetect/tika-langdetect-tika/src/main/resources/org/apache/tika/langdetect/tika/th.ngp
* (add) 
tika-langdetect/tika-langdetect-tika/src/main/resources/org/apache/tika/langdetect/tika/ro.ngp
* (add) 
tika-langdetect/tika-langdetect-tika/src/main/resources/org/apache/tika/langdetect/tika/fa.ngp
* (add) 
tika-langdetect/tika-langdetect-tika/src/main/resources/org/apache/tika/langdetect/tika/hu.ngp
* (add) 
tika-langdetect/tika-langdetect-tika/src/main/resources/org/apache/tika/langdetect/tika/pl.ngp
* (add) 
tika-langdetect/tika-langdetect-tika/src/main/resources/org/apache/tika/langdetect/tika/ca.ngp
* (add) 
tika-langdetect/tika-langdetect-tika/src/main/resources/org/apache/tika/langdetect/tika/da.ngp
* (add) 
tika-langdetect/tika-langdetect-tika/src/main/resources/org/apache/tika/langdetect/tika/tika.language.properties
* (add) 
tika-langdetect/tika-langdetect-tika/src/main/resources/org/apache/tika/langdetect/tika/fr.ngp
* (add) 
tika-langdetect/tika-langdetect-tika/src/main/resources/org/apache/tika/langdetect/tika/is.ngp
* (add) 
tika-langdetect/tika-langdetect-tika/src/main/resources/org/apache/tika/langdetect/tika/sk.ngp
* (add) 
tika-langdetect/tika-langdetect-tika/src/main/resources/org/apache/tika/langdetect/tika/uk.ngp
* (add) 
tika-langdetect/tika-langdetect-tika/src/main/resources/org/apache/tika/langdetect/tika/sv.ngp
* (add) 
tika-langdetect/tika-langdetect-tika/src/main/resources/org/apache/tika/langdetect/tika/et.ngp
* (add) 
tika-langdetect/tika-langdetect-tika/src/main/resources/org/apache/tika/langdetect/tika/sl.ngp
* (add) 
tika-langdetect/tika-langdetect-tika/src/main/resources/org/apache/tika/langdetect/tika/lt.ngp
* (add) 
tika-langdetect/tika-langdetect-tika/src/main/resources/org/apache/tika/langdetect/tika/gl.ngp


> Move Tika's legacy lang id to its own submodule for Tika 2.0
> 
>
> Key: TIKA-3343
> URL: https://issues.apache.org/jira/browse/TIKA-3343
> Project: Tika
>  Issue Type: Task
>Reporter: Tim Allison
>Priority: Major
> Fix For: 2.0.0
>
>
> In the back of my mind, this was an agreed upon change for 2.x. I can't find 
> documentation, tho, so I'm opening this issue to discuss.  
> My memory is that we agreed that we should outsource language id to other 
> tools and remove our own lang ider for 2.x.  If my memory is wrong, or if 
> there's a good reason to keep our language detection algorithm and data, 
> let's discuss.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (TIKA-3343) Move Tika's legacy lang id to its own submodule for Tika 2.0

2021-04-09 Thread Hudson (Jira)
/apache/tika/language/eo.ngp
* (add) 
tika-langdetect/tika-langdetect-tika/src/test/resources/org/apache/tika/langdetect/tika/da.test
* (delete) tika-core/src/main/resources/org/apache/tika/language/pl.ngp
* (delete) tika-core/src/test/resources/org/apache/tika/language/fr.test
* (delete) 
tika-core/src/test/resources/org/apache/tika/language/langbuilder/welsh_corpus.txt
* (delete) tika-core/src/main/resources/org/apache/tika/language/el.ngp
* (add) tika-langdetect/tika-langdetect-tika/pom.xml
* (delete) tika-core/src/main/resources/org/apache/tika/language/it.ngp
* (add) 
tika-langdetect/tika-langdetect-tika/src/main/java/org/apache/tika/langdetect/tika/ProfilingWriter.java
* (delete) 
tika-core/src/test/java/org/apache/tika/language/LanguageProfilerBuilderTest.java
* (delete) tika-core/src/test/resources/org/apache/tika/language/de.test
* (add) 
tika-langdetect/tika-langdetect-tika/src/test/java/org/apache/tika/langdetect/tika/ProfilingHandler.java
* (delete) tika-core/src/main/resources/org/apache/tika/language/ro.ngp
* (add) 
tika-langdetect/tika-langdetect-tika/src/test/resources/org/apache/tika/langdetect/tika/pt.test
* (delete) tika-core/src/main/java/org/apache/tika/language/ProfilingWriter.java
* (add) 
tika-langdetect/tika-langdetect-tika/src/test/java/org/apache/tika/langdetect/tika/LanguageProfilerBuilderTest.java
* (delete) tika-core/src/main/resources/org/apache/tika/language/sk.ngp
* (delete) 
tika-core/src/main/resources/org/apache/tika/language/tika.language.properties
* (add) 
tika-langdetect/tika-langdetect-tika/src/test/resources/org/apache/tika/langdetect/tika/fi.test
* (add) 
tika-langdetect/tika-langdetect-tika/src/test/resources/org/apache/tika/langdetect/tika/fr.test
* (delete) tika-core/src/main/resources/org/apache/tika/language/es.ngp
* (add) 
tika-langdetect/tika-langdetect-tika/src/test/resources/org/apache/tika/langdetect/tika/es.test
* (delete) tika-core/src/main/resources/org/apache/tika/language/fi.ngp
* (delete) 
tika-core/src/main/java/org/apache/tika/language/LanguageIdentifier.java
* (add) 
tika-langdetect/tika-langdetect-tika/src/main/java/org/apache/tika/langdetect/tika/LanguageIdentifier.java
* (delete) tika-core/src/test/resources/org/apache/tika/language/fi.test
* (delete) tika-core/src/test/resources/org/apache/tika/language/sv.test
* (delete) tika-core/src/test/resources/org/apache/tika/language/es.test


> Move Tika's legacy lang id to its own submodule for Tika 2.0
> 
>
> Key: TIKA-3343
> URL: https://issues.apache.org/jira/browse/TIKA-3343
> Project: Tika
>  Issue Type: Task
>Reporter: Tim Allison
>Priority: Major
> Fix For: 2.0.0
>
>
> In the back of my mind, this was an agreed upon change for 2.x. I can't find 
> documentation, tho, so I'm opening this issue to discuss.  
> My memory is that we agreed that we should outsource language id to other 
> tools and remove our own lang ider for 2.x.  If my memory is wrong, or if 
> there's a good reason to keep our language detection algorithm and data, 
> let's discuss.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (TIKA-3343) Move Tika's legacy lang id to its own submodule for Tika 2.0

2021-04-09 Thread Tim Allison (Jira)


 [ 
https://issues.apache.org/jira/browse/TIKA-3343?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tim Allison updated TIKA-3343:
--
Summary: Move Tika's legacy lang id to its own submodule for Tika 2.0  
(was: Move Tika's legacy lang id to its own module)

> Move Tika's legacy lang id to its own submodule for Tika 2.0
> 
>
> Key: TIKA-3343
> URL: https://issues.apache.org/jira/browse/TIKA-3343
> Project: Tika
>  Issue Type: Task
>Reporter: Tim Allison
>Priority: Major
>
> In the back of my mind, this was an agreed upon change for 2.x. I can't find 
> documentation, tho, so I'm opening this issue to discuss.  
> My memory is that we agreed that we should outsource language id to other 
> tools and remove our own lang ider for 2.x.  If my memory is wrong, or if 
> there's a good reason to keep our language detection algorithm and data, 
> let's discuss.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Resolved] (TIKA-3343) Move Tika's legacy lang id to its own submodule for Tika 2.0

2021-04-09 Thread Tim Allison (Jira)


 [ 
https://issues.apache.org/jira/browse/TIKA-3343?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tim Allison resolved TIKA-3343.
---
Fix Version/s: 2.0.0
   Resolution: Fixed

> Move Tika's legacy lang id to its own submodule for Tika 2.0
> 
>
> Key: TIKA-3343
> URL: https://issues.apache.org/jira/browse/TIKA-3343
> Project: Tika
>  Issue Type: Task
>Reporter: Tim Allison
>Priority: Major
> Fix For: 2.0.0
>
>
> In the back of my mind, this was an agreed upon change for 2.x. I can't find 
> documentation, tho, so I'm opening this issue to discuss.  
> My memory is that we agreed that we should outsource language id to other 
> tools and remove our own lang ider for 2.x.  If my memory is wrong, or if 
> there's a good reason to keep our language detection algorithm and data, 
> let's discuss.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


Re: OSGi support in Tika 2.0

2020-08-31 Thread Tim Allison
Bob,

Thank you for taking the lead on this discussion!

tl:dr -- I somewhat prefer tighter modularization at the risk of duplicate
dependencies, too.  The simplicity of higher level bundles might make sense
if we do a slight refactoring of the tika-parsers module.


After a week away, I'm thinking it might make sense to refactor the
tika-parsers module a bit to make explicit some of my underlying design
choices.  This basic design is somewhat already in main, but it is hidden.

Based on user feedback over the years, it feels like there are three
categories of parsers.

1) tika-parsers-module
  * pure Java ... no native libs
  * no parsers that are network dependent/rely on rest clients
  * "heavy" dependencies should be justified by the utility for the
"general user" -- this is admittedly and regrettably qualitative/hand-wavy,
but I want to allow POI's ooxml-schemas but disallow other large
dependencies for more niche formats
  * no ML, entity extraction or "recognizers"
  * packaging: separate modules as we now have, but packaged and shipped as
a single jar, and used as default by tika-app and tika-server

  EXCEPTION: OCR. Justification: this has become such a basic expectation
for many users and is tightly coupled in the PDFParser.  Further, the
current "dependency" requires a user to install tesseract, meaning that the
user has to choose to add this dependency.

2) tika-parsers-extended-module
  * native libs are allowed
  * parsers that are network dependent/rely on rest clients are allowed
  * no ML, entity extraction or recognizers
  * may have dependencies on parsers in tika-parsers-module
  * packaging: separate modules as we now have, packaged and released per
module (e.g. there will be a sqlite-parser jar which includes our parser
_and_ the native xerial.org's dependency); users will need to add their
chosen parsers to their classpath if they're using tika-app or tika-server

3) tika-parsers-advanced-module
  * enormous dependencies, native libs and rest clients are allowed
  * ML, entity extraction, recognizers are allowed
  * may have dependencies on parsers in tika-parsers-module
  * packaging: separate jars per sub module.  These jars will not be part
of the release.

I'll work on this in a separate branch today so that we can look at it
together.  I think it is important to get this consolidated before we make
OSGi decisions.

Note: I do not mean to hijack the OSGi discussion!  And, I'm sorry for not
realizing this earlier/including it in the refactoring a week ago, but here
we are. :D

Thank you, Bob and all, again!

Cheers,

  Tim




On Fri, Aug 28, 2020 at 10:46 AM Yegor Kozlov  wrote:

> Hi Bob,
>
> I'd say decomposition into smaller bundles is the way to go. In my
> experience, OSGi bundles with too many dependencies are fragile and hard to
> maintain. In the worst case, a regression in a maven-bundle-plugin
> configuration would break a parser bundle instead of breaking all of them
> in the uber-jar.
>
> Static linking of dependencies should be fine, however  it can  increase
> the total size of the Tika distro because different parser bundles may
> embed the same transitive dependencies like Apache-Commons, etc.  The huge
> pros is that static linking will make the bundles self-contained.
> The alternative is to make dependencies optional, but in this case clients
> will have to solve the puzzle of adding them into their OSGi containers.
> It's doable, but will kill acceptance.
>
>
>  Regards,
>  Yegor
>
> On Thu, Aug 27, 2020 at 5:24 AM Bob Paulin  wrote:
>
> > Hi,
> >
> > I wanted to discuss OSGi support in Tika 2.0.  My current thought is to
> > start with the minimum support which is to add bundle packaging to each
> of
> > the modules [1].  This will make the bundles usable is OSGi but will
> leave
> > users on there own for putting the right dependencies together for usage.
> > From there we either stop or we can choose from a few different options:
> > 1) Tika Bundle
> >
> >  This is an all encompassing uber jar with all the parsers and
> > dependencies we can legally get away with shipping with an Apache
> license.
> >
> > Pros
> >
> > Low bar to entry for novice OSGi users
> >
> > Already exists in Tika 1.x
> >
> > Cons
> >
> > Difficult to maintain (very complicated maven-bundle-plugin config).
> This
> > has broken in several releases leaving it unusable.
> >
> >
> > 2) Tika module convenience bundles
> >
> > This was part of the early 2.0 POC branch where each module had it's own
> > tika-bundle with just it's dependencies statically included.
> >
> > Pros
> >
> > Less sophisticated maven-bundle-plugin configuration
> >
> > Low bar for novice OSGi users
> >
> > Cons
> >
> > More sub-modules to maintain.
> >
> >
> > There are of course other options but I think it's important to decide if
> > either, neither, or both of these options should be considered for the
> > initial 2.0 release.
> >
> >
> > - Bob
> >
> >
> > [1]  https://github.com/apache/tika/pull/344
> >
> >
> >
>


Re: OSGi support in Tika 2.0

2020-08-28 Thread Yegor Kozlov
Hi Bob,

I'd say decomposition into smaller bundles is the way to go. In my
experience, OSGi bundles with too many dependencies are fragile and hard to
maintain. In the worst case, a regression in a maven-bundle-plugin
configuration would break a parser bundle instead of breaking all of them
in the uber-jar.

Static linking of dependencies should be fine, however  it can  increase
the total size of the Tika distro because different parser bundles may
embed the same transitive dependencies like Apache-Commons, etc.  The huge
pros is that static linking will make the bundles self-contained.
The alternative is to make dependencies optional, but in this case clients
will have to solve the puzzle of adding them into their OSGi containers.
It's doable, but will kill acceptance.


 Regards,
 Yegor

On Thu, Aug 27, 2020 at 5:24 AM Bob Paulin  wrote:

> Hi,
>
> I wanted to discuss OSGi support in Tika 2.0.  My current thought is to
> start with the minimum support which is to add bundle packaging to each of
> the modules [1].  This will make the bundles usable is OSGi but will leave
> users on there own for putting the right dependencies together for usage.
> From there we either stop or we can choose from a few different options:
> 1) Tika Bundle
>
>  This is an all encompassing uber jar with all the parsers and
> dependencies we can legally get away with shipping with an Apache license.
>
> Pros
>
> Low bar to entry for novice OSGi users
>
> Already exists in Tika 1.x
>
> Cons
>
> Difficult to maintain (very complicated maven-bundle-plugin config).  This
> has broken in several releases leaving it unusable.
>
>
> 2) Tika module convenience bundles
>
> This was part of the early 2.0 POC branch where each module had it's own
> tika-bundle with just it's dependencies statically included.
>
> Pros
>
> Less sophisticated maven-bundle-plugin configuration
>
> Low bar for novice OSGi users
>
> Cons
>
> More sub-modules to maintain.
>
>
> There are of course other options but I think it's important to decide if
> either, neither, or both of these options should be considered for the
> initial 2.0 release.
>
>
> - Bob
>
>
> [1]  https://github.com/apache/tika/pull/344
>
>
>


OSGi support in Tika 2.0

2020-08-26 Thread Bob Paulin
Hi,

I wanted to discuss OSGi support in Tika 2.0.  My current thought is to
start with the minimum support which is to add bundle packaging to each
of the modules [1].  This will make the bundles usable is OSGi but will
leave users on there own for putting the right dependencies together for
usage.  From there we either stop or we can choose from a few different
options:

1) Tika Bundle

 This is an all encompassing uber jar with all the parsers and
dependencies we can legally get away with shipping with an Apache license.

Pros

Low bar to entry for novice OSGi users

Already exists in Tika 1.x

Cons

Difficult to maintain (very complicated maven-bundle-plugin config). 
This has broken in several releases leaving it unusable.


2) Tika module convenience bundles

This was part of the early 2.0 POC branch where each module had it's own
tika-bundle with just it's dependencies statically included.

Pros

Less sophisticated maven-bundle-plugin configuration

Low bar for novice OSGi users

Cons

More sub-modules to maintain.


There are of course other options but I think it's important to decide
if either, neither, or both of these options should be considered for
the initial 2.0 release.


- Bob


[1]  https://github.com/apache/tika/pull/344




signature.asc
Description: OpenPGP digital signature


Re: [EXTERNAL] Tika 2.0 modularization

2020-08-19 Thread Sergey Beryozkin
Hi Tim

It looks good. Perfect.
Do you plant to have tika-parsers reuse the new module as its dependencies
?

Cheers, Sergey

On Tue, Aug 18, 2020 at 3:41 PM Tim Allison  wrote:

> If anyone has any time, please take a look here:
> https://github.com/apache/tika/tree/branch_2x/tika-parser-modules
>
> Does this basically look ok?
>
> I've put the integration tests in
> https://github.com/apache/tika/tree/branch_2x/tika-parser-integration-tests
> ... that doesn't build yet.
>
> I've flipped Bob's design so that the integration tests pull test files
> from the individual parser modules via test-jar.
>
> On Fri, Aug 14, 2020 at 3:30 PM Bob Paulin  wrote:
>
> > +1 excited about this.
> >
> > - Bob
> > On 8/14/2020 11:29 AM, Sergey Beryozkin wrote:
> >
> > +1 
> >
> > Cheers Sergey
> >
> > On Fri 14 Aug 2020, 18:26 Chris Mattmann,  <
> mattm...@apache.org> wrote:
> >
> >
> > Haha  I’m down and supportive!
> >
> >
> >
> > Time’s TIME FOR 2.x 
> >
> >
> >
> >
> >
> >
> >
> > From: Tim Allison  
> > Reply-To: "dev@tika.apache.org"  <
> dev@tika.apache.org> , "Allison, Tim (US
> > 174B-Affiliate)"  <
> timothy.b.alli...@jpl.nasa.gov>
> > Date: Friday, August 14, 2020 at 6:06 AM
> > To: " " 
> 
> > Subject: [EXTERNAL] Tika 2.0 modularization
> >
> >
> >
> > All,
> >
> >   I _think_ I might have some time to start working on integrating Bob's
> >
> > work on the current main branch.  I'll have to ignore most of the
> incoming
> >
> > issues for a bit...unlike the last 4 years...this time I mean it. :)
> >
> >   Let me know if there are any objections to heading down this path now.
> >
> >
> >
> >Cheers,
> >
> >
> >
> >   Tim
> >
> >
> >
> >
> >
> >
>


Re: [EXTERNAL] Tika 2.0 modularization

2020-08-18 Thread Bob Paulin
Hey Tim,

Just started taking a look.  The test-jar approach could work but I
recall I ran into some issues with getting access to some of the test
files inside the test-jars for some of the junits.  For many tests this
was simple but for some I think it would require larger functional
changes to the code that I was not comfortable proposing at the time.

Makes sense to try this path again and see if you can get further than I
did.

- Bob

On 8/18/2020 9:40 AM, Tim Allison wrote:
> If anyone has any time, please take a look here:
> https://github.com/apache/tika/tree/branch_2x/tika-parser-modules
>
> Does this basically look ok?
>
> I've put the integration tests in
> https://github.com/apache/tika/tree/branch_2x/tika-parser-integration-tests
> ... that doesn't build yet.
>
> I've flipped Bob's design so that the integration tests pull test files
> from the individual parser modules via test-jar.
>
> On Fri, Aug 14, 2020 at 3:30 PM Bob Paulin  wrote:
>
>> +1 excited about this.
>>
>> - Bob
>> On 8/14/2020 11:29 AM, Sergey Beryozkin wrote:
>>
>> +1 
>>
>> Cheers Sergey
>>
>> On Fri 14 Aug 2020, 18:26 Chris Mattmann,  
>>  wrote:
>>
>>
>> Haha  I’m down and supportive!
>>
>>
>>
>> Time’s TIME FOR 2.x 
>>
>>
>>
>>
>>
>>
>>
>> From: Tim Allison  
>> Reply-To: "dev@tika.apache.org"   
>> , "Allison, Tim (US
>> 174B-Affiliate)"  
>> 
>> Date: Friday, August 14, 2020 at 6:06 AM
>> To: " "  
>> 
>> Subject: [EXTERNAL] Tika 2.0 modularization
>>
>>
>>
>> All,
>>
>>   I _think_ I might have some time to start working on integrating Bob's
>>
>> work on the current main branch.  I'll have to ignore most of the incoming
>>
>> issues for a bit...unlike the last 4 years...this time I mean it. :)
>>
>>   Let me know if there are any objections to heading down this path now.
>>
>>
>>
>>Cheers,
>>
>>
>>
>>   Tim
>>
>>
>>
>>
>>
>>


signature.asc
Description: OpenPGP digital signature


Re: [EXTERNAL] Tika 2.0 modularization

2020-08-18 Thread Oleg Tikhonov
Hi Tim,
looks awesome.
Somehow I did not find a couple of parsers, probably it is because of
on-going work ...
In addition, I was thinking about "getting rid of" maven. If we are going
to make Tika more modern, maybe gradle can do a trick?
Do we plan to add new Java "gooddies" like lambdas, foreign-memory access
API, records ...

WDYT?
BR,
Oleg




On Tue, Aug 18, 2020 at 5:41 PM Tim Allison  wrote:

> If anyone has any time, please take a look here:
> https://github.com/apache/tika/tree/branch_2x/tika-parser-modules
>
> Does this basically look ok?
>
> I've put the integration tests in
> https://github.com/apache/tika/tree/branch_2x/tika-parser-integration-tests
> ... that doesn't build yet.
>
> I've flipped Bob's design so that the integration tests pull test files
> from the individual parser modules via test-jar.
>
> On Fri, Aug 14, 2020 at 3:30 PM Bob Paulin  wrote:
>
> > +1 excited about this.
> >
> > - Bob
> > On 8/14/2020 11:29 AM, Sergey Beryozkin wrote:
> >
> > +1 
> >
> > Cheers Sergey
> >
> > On Fri 14 Aug 2020, 18:26 Chris Mattmann,  <
> mattm...@apache.org> wrote:
> >
> >
> > Haha  I’m down and supportive!
> >
> >
> >
> > Time’s TIME FOR 2.x 
> >
> >
> >
> >
> >
> >
> >
> > From: Tim Allison  
> > Reply-To: "dev@tika.apache.org"  <
> dev@tika.apache.org> , "Allison, Tim (US
> > 174B-Affiliate)"  <
> timothy.b.alli...@jpl.nasa.gov>
> > Date: Friday, August 14, 2020 at 6:06 AM
> > To: " " 
> 
> > Subject: [EXTERNAL] Tika 2.0 modularization
> >
> >
> >
> > All,
> >
> >   I _think_ I might have some time to start working on integrating Bob's
> >
> > work on the current main branch.  I'll have to ignore most of the
> incoming
> >
> > issues for a bit...unlike the last 4 years...this time I mean it. :)
> >
> >   Let me know if there are any objections to heading down this path now.
> >
> >
> >
> >Cheers,
> >
> >
> >
> >   Tim
> >
> >
> >
> >
> >
> >
>


Re: [EXTERNAL] Tika 2.0 modularization

2020-08-18 Thread Ken Krugler
Hi Tim,

I looked at the HTML module, and seems logical/straightforward.

Thanks for pushing on this.

— Ken

> On Aug 18, 2020, at 7:40 AM, Tim Allison  wrote:
> 
> If anyone has any time, please take a look here:
> https://github.com/apache/tika/tree/branch_2x/tika-parser-modules
> 
> Does this basically look ok?
> 
> I've put the integration tests in
> https://github.com/apache/tika/tree/branch_2x/tika-parser-integration-tests
> ... that doesn't build yet.
> 
> I've flipped Bob's design so that the integration tests pull test files
> from the individual parser modules via test-jar.
> 
> On Fri, Aug 14, 2020 at 3:30 PM Bob Paulin  wrote:
> 
>> +1 excited about this.
>> 
>> - Bob
>> On 8/14/2020 11:29 AM, Sergey Beryozkin wrote:
>> 
>> +1 
>> 
>> Cheers Sergey
>> 
>> On Fri 14 Aug 2020, 18:26 Chris Mattmann,  
>>  wrote:
>> 
>> 
>> Haha  I’m down and supportive!
>> 
>> 
>> 
>> Time’s TIME FOR 2.x 
>> 
>> From: Tim Allison  
>> Reply-To: "dev@tika.apache.org"   
>> , "Allison, Tim (US
>> 174B-Affiliate)"  
>> 
>> Date: Friday, August 14, 2020 at 6:06 AM
>> To: " "  
>> 
>> Subject: [EXTERNAL] Tika 2.0 modularization
>> 
>> 
>> 
>> All,
>> 
>>  I _think_ I might have some time to start working on integrating Bob's
>> 
>> work on the current main branch.  I'll have to ignore most of the incoming
>> 
>> issues for a bit...unlike the last 4 years...this time I mean it. :)
>> 
>>  Let me know if there are any objections to heading down this path now.
>> 
>> 
>> 
>>   Cheers,
>> 
>> 
>> 
>>  Tim

--
Ken Krugler
http://www.scaleunlimited.com
custom big data solutions & training
Hadoop, Cascading, Cassandra & Solr



Re: [EXTERNAL] Tika 2.0 modularization

2020-08-18 Thread Tim Allison
Thank you!

>Somehow I did not find a couple of parsers, probably it is because of
on-going work ...

Yep.  Exactly.  I didn't want to put in the work in this direction if there
were any showstoppers.

>If we are going to make Tika more modern, maybe gradle can do a trick?
My gradle isn't as strong as maven, but if you or anyone else wants to
translate, I'd be good with that.  Let me do the maven modularization
first?  How much effort would this be?

>Do we plan to add new Java "gooddies" like lambdas, foreign-memory access
API, records
Elasticsearch is already at 11, and the next version of Solr requires 11.
I'm happy keeping Tika at 1.8 or moving to 11.  I think 14 is a bit too
cutting edge for Tika 2.0.0...maybe 3.0.0?

Any thoughts on what we do with Jigsaw?  Should we shoot the moon and move
to 11 and jigsaw, go with multi-version jars or just go with what we have
and make modest changes so that we are hostile to folks using jigsaw?



On Tue, Aug 18, 2020 at 11:38 AM Oleg Tikhonov  wrote:

> Hi Tim,
> looks awesome.
> Somehow I did not find a couple of parsers, probably it is because of
> on-going work ...
> In addition, I was thinking about "getting rid of" maven. If we are going
> to make Tika more modern, maybe gradle can do a trick?
> Do we plan to add new Java "gooddies" like lambdas, foreign-memory access
> API, records ...
>
> WDYT?
> BR,
> Oleg
>
>
>
>
> On Tue, Aug 18, 2020 at 5:41 PM Tim Allison  wrote:
>
>> If anyone has any time, please take a look here:
>> https://github.com/apache/tika/tree/branch_2x/tika-parser-modules
>>
>> Does this basically look ok?
>>
>> I've put the integration tests in
>>
>> https://github.com/apache/tika/tree/branch_2x/tika-parser-integration-tests
>> ... that doesn't build yet.
>>
>> I've flipped Bob's design so that the integration tests pull test files
>> from the individual parser modules via test-jar.
>>
>> On Fri, Aug 14, 2020 at 3:30 PM Bob Paulin  wrote:
>>
>> > +1 excited about this.
>> >
>> > - Bob
>> > On 8/14/2020 11:29 AM, Sergey Beryozkin wrote:
>> >
>> > +1 
>> >
>> > Cheers Sergey
>> >
>> > On Fri 14 Aug 2020, 18:26 Chris Mattmann,  <
>> mattm...@apache.org> wrote:
>> >
>> >
>> > Haha  I’m down and supportive!
>> >
>> >
>> >
>> > Time’s TIME FOR 2.x 
>> >
>> >
>> >
>> >
>> >
>> >
>> >
>> > From: Tim Allison  
>> > Reply-To: "dev@tika.apache.org"  <
>> dev@tika.apache.org> , "Allison, Tim (US
>> > 174B-Affiliate)"  <
>> timothy.b.alli...@jpl.nasa.gov>
>> > Date: Friday, August 14, 2020 at 6:06 AM
>> > To: " " 
>> 
>> > Subject: [EXTERNAL] Tika 2.0 modularization
>> >
>> >
>> >
>> > All,
>> >
>> >   I _think_ I might have some time to start working on integrating Bob's
>> >
>> > work on the current main branch.  I'll have to ignore most of the
>> incoming
>> >
>> > issues for a bit...unlike the last 4 years...this time I mean it. :)
>> >
>> >   Let me know if there are any objections to heading down this path now.
>> >
>> >
>> >
>> >Cheers,
>> >
>> >
>> >
>> >   Tim
>> >
>> >
>> >
>> >
>> >
>> >
>>
>


Re: [EXTERNAL] Tika 2.0 modularization

2020-08-18 Thread Tim Allison
If anyone has any time, please take a look here:
https://github.com/apache/tika/tree/branch_2x/tika-parser-modules

Does this basically look ok?

I've put the integration tests in
https://github.com/apache/tika/tree/branch_2x/tika-parser-integration-tests
... that doesn't build yet.

I've flipped Bob's design so that the integration tests pull test files
from the individual parser modules via test-jar.

On Fri, Aug 14, 2020 at 3:30 PM Bob Paulin  wrote:

> +1 excited about this.
>
> - Bob
> On 8/14/2020 11:29 AM, Sergey Beryozkin wrote:
>
> +1 
>
> Cheers Sergey
>
> On Fri 14 Aug 2020, 18:26 Chris Mattmann,  
>  wrote:
>
>
> Haha  I’m down and supportive!
>
>
>
> Time’s TIME FOR 2.x 
>
>
>
>
>
>
>
> From: Tim Allison  
> Reply-To: "dev@tika.apache.org"   
> , "Allison, Tim (US
> 174B-Affiliate)"  
> 
> Date: Friday, August 14, 2020 at 6:06 AM
> To: " "  
> 
> Subject: [EXTERNAL] Tika 2.0 modularization
>
>
>
> All,
>
>   I _think_ I might have some time to start working on integrating Bob's
>
> work on the current main branch.  I'll have to ignore most of the incoming
>
> issues for a bit...unlike the last 4 years...this time I mean it. :)
>
>   Let me know if there are any objections to heading down this path now.
>
>
>
>Cheers,
>
>
>
>   Tim
>
>
>
>
>
>


Re: [EXTERNAL] Tika 2.0 modularization

2020-08-14 Thread Bob Paulin
+1 excited about this.

- Bob

On 8/14/2020 11:29 AM, Sergey Beryozkin wrote:
> +1 
>
> Cheers Sergey
>
> On Fri 14 Aug 2020, 18:26 Chris Mattmann,  wrote:
>
>> Haha  I’m down and supportive!
>>
>>
>>
>> Time’s TIME FOR 2.x 
>>
>>
>>
>>
>>
>>
>>
>> From: Tim Allison 
>> Reply-To: "dev@tika.apache.org" , "Allison, Tim (US
>> 174B-Affiliate)" 
>> Date: Friday, August 14, 2020 at 6:06 AM
>> To: "" 
>> Subject: [EXTERNAL] Tika 2.0 modularization
>>
>>
>>
>> All,
>>
>>   I _think_ I might have some time to start working on integrating Bob's
>>
>> work on the current main branch.  I'll have to ignore most of the incoming
>>
>> issues for a bit...unlike the last 4 years...this time I mean it. :)
>>
>>   Let me know if there are any objections to heading down this path now.
>>
>>
>>
>>Cheers,
>>
>>
>>
>>   Tim
>>
>>
>>
>>


signature.asc
Description: OpenPGP digital signature


Re: [EXTERNAL] Tika 2.0 modularization

2020-08-14 Thread Sergey Beryozkin
+1 

Cheers Sergey

On Fri 14 Aug 2020, 18:26 Chris Mattmann,  wrote:

> Haha  I’m down and supportive!
>
>
>
> Time’s TIME FOR 2.x 
>
>
>
>
>
>
>
> From: Tim Allison 
> Reply-To: "dev@tika.apache.org" , "Allison, Tim (US
> 174B-Affiliate)" 
> Date: Friday, August 14, 2020 at 6:06 AM
> To: "" 
> Subject: [EXTERNAL] Tika 2.0 modularization
>
>
>
> All,
>
>   I _think_ I might have some time to start working on integrating Bob's
>
> work on the current main branch.  I'll have to ignore most of the incoming
>
> issues for a bit...unlike the last 4 years...this time I mean it. :)
>
>   Let me know if there are any objections to heading down this path now.
>
>
>
>Cheers,
>
>
>
>   Tim
>
>
>
>


Re: [EXTERNAL] Tika 2.0 modularization

2020-08-14 Thread Chris Mattmann
Haha  I’m down and supportive!

 

Time’s TIME FOR 2.x 

 

 

 

From: Tim Allison 
Reply-To: "dev@tika.apache.org" , "Allison, Tim (US 
174B-Affiliate)" 
Date: Friday, August 14, 2020 at 6:06 AM
To: "" 
Subject: [EXTERNAL] Tika 2.0 modularization

 

All,

  I _think_ I might have some time to start working on integrating Bob's

work on the current main branch.  I'll have to ignore most of the incoming

issues for a bit...unlike the last 4 years...this time I mean it. :)

  Let me know if there are any objections to heading down this path now.

 

   Cheers,

 

  Tim

 



Tika 2.0 modularization

2020-08-14 Thread Tim Allison
All,
  I _think_ I might have some time to start working on integrating Bob's
work on the current main branch.  I'll have to ignore most of the incoming
issues for a bit...unlike the last 4 years...this time I mean it. :)
  Let me know if there are any objections to heading down this path now.

   Cheers,

  Tim


[jira] [Commented] (TIKA-2795) Error starting Tika 2.0 server with -spawnChild on Ubuntu

2019-05-02 Thread Mario Bisonti (JIRA)


[ 
https://issues.apache.org/jira/browse/TIKA-2795?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16831443#comment-16831443
 ] 

Mario Bisonti commented on TIKA-2795:
-

Hallo Tim.

I use :
java -Dlog4j.configuration=file:/opt/tika/log4j.xml -jar 
/opt/tika/tika-server-1.20.jar 
-JDlog4j.configuration=file:/opt/tika/log4j_child.xml --host=hostname 
-spawnChild -taskTimeoutMillis 100

tika-server-1.20.jar is a downloaded snapshot version in december 2018 and it 
works fine

 

Thanks a lot

 

Mario

> Error starting Tika 2.0 server with -spawnChild on Ubuntu
> -
>
> Key: TIKA-2795
> URL: https://issues.apache.org/jira/browse/TIKA-2795
> Project: Tika
>  Issue Type: Bug
>  Components: server
>Affects Versions: 2.0
>Reporter: Mario Bisonti
>Assignee: Tim Allison
>Priority: Major
> Fix For: 2.0.0, 1.20
>
>
> Hallo.
> I triend to download Tika server 2.0.0 from here:
> [https://builds.apache.org/job/Tika-trunk/lastStableBuild/org.apache.tika$tika-server/]
>  
> I tried to start on my Ubuntu server but with the -spawnChild, it doesn't 
> work.
>  
> sudo java -jar /opt/tika/tika-server-2.0.0-20181203.205013-340.jar -spawnChild
> Dec 05, 2018 2:22:32 PM org.apache.tika.config.InitializableProblemHandler$3 
> handleInitializableProblem
> WARNING: J2KImageReader not loaded. JPEG2000 files will not be processed.
> See https://pdfbox.apache.org/2.0/dependencies.html#jai-image-io
> for optional dependencies.
> Dec 05, 2018 2:22:32 PM org.apache.tika.config.InitializableProblemHandler$3 
> handleInitializableProblem
> WARNING: org.xerial's sqlite-jdbc is not loaded.
> Please provide the jar on your classpath to parse sqlite files.
> See tika-parsers/pom.xml for the correct version.
> INFO Starting Apache Tika 2.0.0-SNAPSHOT server
> INFO server watch dog is starting up
> Dec 05, 2018 2:22:32 PM org.apache.tika.config.InitializableProblemHandler$3 
> handleInitializableProblem
> WARNING: J2KImageReader not loaded. JPEG2000 files will not be processed.
> See https://pdfbox.apache.org/2.0/dependencies.html#jai-image-io
> for optional dependencies.
> Dec 05, 2018 2:22:32 PM org.apache.tika.config.InitializableProblemHandler$3 
> handleInitializableProblem
> WARNING: org.xerial's sqlite-jdbc is not loaded.
> Please provide the jar on your classpath to parse sqlite files.
> See tika-parsers/pom.xml for the correct version.
> INFO Starting Apache Tika 2.0.0-SNAPSHOT server
> java.nio.file.NoSuchFileException: 
> /tmp/tika-server-child-process-mmap-2180120677326747096
>  at 
> java.base/sun.nio.fs.UnixException.translateToIOException(UnixException.java:92)
>  at 
> java.base/sun.nio.fs.UnixException.rethrowAsIOException(UnixException.java:111)
>  at 
> java.base/sun.nio.fs.UnixException.rethrowAsIOException(UnixException.java:116)
>  at 
> java.base/sun.nio.fs.UnixFileAttributeViews$Basic.readAttributes(UnixFileAttributeViews.java:55)
>  at 
> java.base/sun.nio.fs.UnixFileSystemProvider.readAttributes(UnixFileSystemProvider.java:145)
>  at 
> java.base/sun.nio.fs.LinuxFileSystemProvider.readAttributes(LinuxFileSystemProvider.java:99)
>  at java.base/java.nio.file.Files.readAttributes(Files.java:1755)
>  at java.base/java.nio.file.Files.size(Files.java:2372)
>  at 
> org.apache.tika.server.TikaServerWatchDog$ChildProcess.(TikaServerWatchDog.java:234)
>  at 
> org.apache.tika.server.TikaServerWatchDog$ChildProcess.(TikaServerWatchDog.java:210)
>  at 
> org.apache.tika.server.TikaServerWatchDog.execute(TikaServerWatchDog.java:66)
>  at org.apache.tika.server.TikaServerCli.execute(TikaServerCli.java:146)
>  at org.apache.tika.server.TikaServerCli.main(TikaServerCli.java:127)
> ERROR Can't start:
> java.nio.file.NoSuchFileException: 
> /tmp/tika-server-child-process-mmap-2180120677326747096
> at 
> java.base/sun.nio.fs.UnixException.translateToIOException(UnixException.java:92)
>  at 
> java.base/sun.nio.fs.UnixException.rethrowAsIOException(UnixException.java:111)
>  at 
> java.base/sun.nio.fs.UnixException.rethrowAsIOException(UnixException.java:116)
>  at 
> java.base/sun.nio.fs.UnixFileAttributeViews$Basic.readAttributes(UnixFileAttributeViews.java:55)
>  at 
> java.base/sun.nio.fs.UnixFileSystemProvider.readAttributes(UnixFileSystemProvider.java:145)
>  at 
> java.base/sun.nio.fs.LinuxFileSystemProvider.readAttributes(LinuxFileSystemProvider.java:99)
>  at java.base/java.nio.file.Files.readAttributes(Files.java:1755)
>  at java.base/java.nio.file.Files.size(Files.java:2372)
>  at 
&g

[jira] [Commented] (TIKA-2795) Error starting Tika 2.0 server with -spawnChild on Ubuntu

2019-04-30 Thread Tim Allison (JIRA)


[ 
https://issues.apache.org/jira/browse/TIKA-2795?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16830424#comment-16830424
 ] 

Tim Allison commented on TIKA-2795:
---

[~bisontim], I wanted to follow up to make sure that you're still good to go 
with {{-spawnChild}}.  If I need to fix anything, I'd like to do it before the 
upcoming release of 1.21.  Again, many thanks for reporting this problem.

> Error starting Tika 2.0 server with -spawnChild on Ubuntu
> -
>
> Key: TIKA-2795
> URL: https://issues.apache.org/jira/browse/TIKA-2795
> Project: Tika
>  Issue Type: Bug
>  Components: server
>Affects Versions: 2.0
>Reporter: Mario Bisonti
>Assignee: Tim Allison
>Priority: Major
> Fix For: 2.0.0, 1.20
>
>
> Hallo.
> I triend to download Tika server 2.0.0 from here:
> [https://builds.apache.org/job/Tika-trunk/lastStableBuild/org.apache.tika$tika-server/]
>  
> I tried to start on my Ubuntu server but with the -spawnChild, it doesn't 
> work.
>  
> sudo java -jar /opt/tika/tika-server-2.0.0-20181203.205013-340.jar -spawnChild
> Dec 05, 2018 2:22:32 PM org.apache.tika.config.InitializableProblemHandler$3 
> handleInitializableProblem
> WARNING: J2KImageReader not loaded. JPEG2000 files will not be processed.
> See https://pdfbox.apache.org/2.0/dependencies.html#jai-image-io
> for optional dependencies.
> Dec 05, 2018 2:22:32 PM org.apache.tika.config.InitializableProblemHandler$3 
> handleInitializableProblem
> WARNING: org.xerial's sqlite-jdbc is not loaded.
> Please provide the jar on your classpath to parse sqlite files.
> See tika-parsers/pom.xml for the correct version.
> INFO Starting Apache Tika 2.0.0-SNAPSHOT server
> INFO server watch dog is starting up
> Dec 05, 2018 2:22:32 PM org.apache.tika.config.InitializableProblemHandler$3 
> handleInitializableProblem
> WARNING: J2KImageReader not loaded. JPEG2000 files will not be processed.
> See https://pdfbox.apache.org/2.0/dependencies.html#jai-image-io
> for optional dependencies.
> Dec 05, 2018 2:22:32 PM org.apache.tika.config.InitializableProblemHandler$3 
> handleInitializableProblem
> WARNING: org.xerial's sqlite-jdbc is not loaded.
> Please provide the jar on your classpath to parse sqlite files.
> See tika-parsers/pom.xml for the correct version.
> INFO Starting Apache Tika 2.0.0-SNAPSHOT server
> java.nio.file.NoSuchFileException: 
> /tmp/tika-server-child-process-mmap-2180120677326747096
>  at 
> java.base/sun.nio.fs.UnixException.translateToIOException(UnixException.java:92)
>  at 
> java.base/sun.nio.fs.UnixException.rethrowAsIOException(UnixException.java:111)
>  at 
> java.base/sun.nio.fs.UnixException.rethrowAsIOException(UnixException.java:116)
>  at 
> java.base/sun.nio.fs.UnixFileAttributeViews$Basic.readAttributes(UnixFileAttributeViews.java:55)
>  at 
> java.base/sun.nio.fs.UnixFileSystemProvider.readAttributes(UnixFileSystemProvider.java:145)
>  at 
> java.base/sun.nio.fs.LinuxFileSystemProvider.readAttributes(LinuxFileSystemProvider.java:99)
>  at java.base/java.nio.file.Files.readAttributes(Files.java:1755)
>  at java.base/java.nio.file.Files.size(Files.java:2372)
>  at 
> org.apache.tika.server.TikaServerWatchDog$ChildProcess.(TikaServerWatchDog.java:234)
>  at 
> org.apache.tika.server.TikaServerWatchDog$ChildProcess.(TikaServerWatchDog.java:210)
>  at 
> org.apache.tika.server.TikaServerWatchDog.execute(TikaServerWatchDog.java:66)
>  at org.apache.tika.server.TikaServerCli.execute(TikaServerCli.java:146)
>  at org.apache.tika.server.TikaServerCli.main(TikaServerCli.java:127)
> ERROR Can't start:
> java.nio.file.NoSuchFileException: 
> /tmp/tika-server-child-process-mmap-2180120677326747096
> at 
> java.base/sun.nio.fs.UnixException.translateToIOException(UnixException.java:92)
>  at 
> java.base/sun.nio.fs.UnixException.rethrowAsIOException(UnixException.java:111)
>  at 
> java.base/sun.nio.fs.UnixException.rethrowAsIOException(UnixException.java:116)
>  at 
> java.base/sun.nio.fs.UnixFileAttributeViews$Basic.readAttributes(UnixFileAttributeViews.java:55)
>  at 
> java.base/sun.nio.fs.UnixFileSystemProvider.readAttributes(UnixFileSystemProvider.java:145)
>  at 
> java.base/sun.nio.fs.LinuxFileSystemProvider.readAttributes(LinuxFileSystemProvider.java:99)
>  at java.base/java.nio.file.Files.readAttributes(Files.java:1755)
>  at java.base/java.nio.file.Files.size(Files.java:2372)
>  at 
> org.apache.tika.server.TikaServerWatchDog$ChildProcess.(TikaServerWatchDog.java:234)
>  at 
> org.apache.tika.server.TikaServerWatchDog$Ch

[jira] [Resolved] (TIKA-2795) Error starting Tika 2.0 server with -spawnChild on Ubuntu

2018-12-10 Thread Tim Allison (JIRA)


 [ 
https://issues.apache.org/jira/browse/TIKA-2795?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tim Allison resolved TIKA-2795.
---
   Resolution: Fixed
 Assignee: Tim Allison
Fix Version/s: 1.20
   2.0.0

[~bisontim], many thanks for finding this.  Please let me know what else you 
find!

> Error starting Tika 2.0 server with -spawnChild on Ubuntu
> -
>
> Key: TIKA-2795
> URL: https://issues.apache.org/jira/browse/TIKA-2795
> Project: Tika
>  Issue Type: Bug
>  Components: server
>Affects Versions: 2.0
>Reporter: Mario Bisonti
>Assignee: Tim Allison
>Priority: Major
> Fix For: 2.0.0, 1.20
>
>
> Hallo.
> I triend to download Tika server 2.0.0 from here:
> [https://builds.apache.org/job/Tika-trunk/lastStableBuild/org.apache.tika$tika-server/]
>  
> I tried to start on my Ubuntu server but with the -spawnChild, it doesn't 
> work.
>  
> sudo java -jar /opt/tika/tika-server-2.0.0-20181203.205013-340.jar -spawnChild
> Dec 05, 2018 2:22:32 PM org.apache.tika.config.InitializableProblemHandler$3 
> handleInitializableProblem
> WARNING: J2KImageReader not loaded. JPEG2000 files will not be processed.
> See https://pdfbox.apache.org/2.0/dependencies.html#jai-image-io
> for optional dependencies.
> Dec 05, 2018 2:22:32 PM org.apache.tika.config.InitializableProblemHandler$3 
> handleInitializableProblem
> WARNING: org.xerial's sqlite-jdbc is not loaded.
> Please provide the jar on your classpath to parse sqlite files.
> See tika-parsers/pom.xml for the correct version.
> INFO Starting Apache Tika 2.0.0-SNAPSHOT server
> INFO server watch dog is starting up
> Dec 05, 2018 2:22:32 PM org.apache.tika.config.InitializableProblemHandler$3 
> handleInitializableProblem
> WARNING: J2KImageReader not loaded. JPEG2000 files will not be processed.
> See https://pdfbox.apache.org/2.0/dependencies.html#jai-image-io
> for optional dependencies.
> Dec 05, 2018 2:22:32 PM org.apache.tika.config.InitializableProblemHandler$3 
> handleInitializableProblem
> WARNING: org.xerial's sqlite-jdbc is not loaded.
> Please provide the jar on your classpath to parse sqlite files.
> See tika-parsers/pom.xml for the correct version.
> INFO Starting Apache Tika 2.0.0-SNAPSHOT server
> java.nio.file.NoSuchFileException: 
> /tmp/tika-server-child-process-mmap-2180120677326747096
>  at 
> java.base/sun.nio.fs.UnixException.translateToIOException(UnixException.java:92)
>  at 
> java.base/sun.nio.fs.UnixException.rethrowAsIOException(UnixException.java:111)
>  at 
> java.base/sun.nio.fs.UnixException.rethrowAsIOException(UnixException.java:116)
>  at 
> java.base/sun.nio.fs.UnixFileAttributeViews$Basic.readAttributes(UnixFileAttributeViews.java:55)
>  at 
> java.base/sun.nio.fs.UnixFileSystemProvider.readAttributes(UnixFileSystemProvider.java:145)
>  at 
> java.base/sun.nio.fs.LinuxFileSystemProvider.readAttributes(LinuxFileSystemProvider.java:99)
>  at java.base/java.nio.file.Files.readAttributes(Files.java:1755)
>  at java.base/java.nio.file.Files.size(Files.java:2372)
>  at 
> org.apache.tika.server.TikaServerWatchDog$ChildProcess.(TikaServerWatchDog.java:234)
>  at 
> org.apache.tika.server.TikaServerWatchDog$ChildProcess.(TikaServerWatchDog.java:210)
>  at 
> org.apache.tika.server.TikaServerWatchDog.execute(TikaServerWatchDog.java:66)
>  at org.apache.tika.server.TikaServerCli.execute(TikaServerCli.java:146)
>  at org.apache.tika.server.TikaServerCli.main(TikaServerCli.java:127)
> ERROR Can't start:
> java.nio.file.NoSuchFileException: 
> /tmp/tika-server-child-process-mmap-2180120677326747096
> at 
> java.base/sun.nio.fs.UnixException.translateToIOException(UnixException.java:92)
>  at 
> java.base/sun.nio.fs.UnixException.rethrowAsIOException(UnixException.java:111)
>  at 
> java.base/sun.nio.fs.UnixException.rethrowAsIOException(UnixException.java:116)
>  at 
> java.base/sun.nio.fs.UnixFileAttributeViews$Basic.readAttributes(UnixFileAttributeViews.java:55)
>  at 
> java.base/sun.nio.fs.UnixFileSystemProvider.readAttributes(UnixFileSystemProvider.java:145)
>  at 
> java.base/sun.nio.fs.LinuxFileSystemProvider.readAttributes(LinuxFileSystemProvider.java:99)
>  at java.base/java.nio.file.Files.readAttributes(Files.java:1755)
>  at java.base/java.nio.file.Files.size(Files.java:2372)
>  at 
> org.apache.tika.server.TikaServerWatchDog$ChildProcess.(TikaServerWatchDog.java:234)
>  at 
> org.apache.tika.server.TikaServerWatchDog$ChildProcess.(TikaServerWatchDog.java:210)
>  at 
> org.apache.tika.server.TikaServerW

[jira] [Commented] (TIKA-2795) Error starting Tika 2.0 server with -spawnChild on Ubuntu

2018-12-10 Thread Hudson (JIRA)


[ 
https://issues.apache.org/jira/browse/TIKA-2795?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16715770#comment-16715770
 ] 

Hudson commented on TIKA-2795:
--

SUCCESS: Integrated in Jenkins build Tika-trunk #1606 (See 
[https://builds.apache.org/job/Tika-trunk/1606/])
TIKA-2795 -- swapped memorymapped buffer for traditional open, write (tallison: 
[https://github.com/apache/tika/commit/e921e69c1a484de3036ed4e5a8654b046f54ceea])
* (edit) 
tika-server/src/main/java/org/apache/tika/server/ServerStatusWatcher.java
* (edit) 
tika-server/src/test/java/org/apache/tika/server/TikaServerIntegrationTest.java
* (edit) 
tika-server/src/main/java/org/apache/tika/server/TikaServerWatchDog.java


> Error starting Tika 2.0 server with -spawnChild on Ubuntu
> -
>
> Key: TIKA-2795
> URL: https://issues.apache.org/jira/browse/TIKA-2795
> Project: Tika
>  Issue Type: Bug
>  Components: server
>Affects Versions: 2.0
>Reporter: Mario Bisonti
>Priority: Major
>
> Hallo.
> I triend to download Tika server 2.0.0 from here:
> [https://builds.apache.org/job/Tika-trunk/lastStableBuild/org.apache.tika$tika-server/]
>  
> I tried to start on my Ubuntu server but with the -spawnChild, it doesn't 
> work.
>  
> sudo java -jar /opt/tika/tika-server-2.0.0-20181203.205013-340.jar -spawnChild
> Dec 05, 2018 2:22:32 PM org.apache.tika.config.InitializableProblemHandler$3 
> handleInitializableProblem
> WARNING: J2KImageReader not loaded. JPEG2000 files will not be processed.
> See https://pdfbox.apache.org/2.0/dependencies.html#jai-image-io
> for optional dependencies.
> Dec 05, 2018 2:22:32 PM org.apache.tika.config.InitializableProblemHandler$3 
> handleInitializableProblem
> WARNING: org.xerial's sqlite-jdbc is not loaded.
> Please provide the jar on your classpath to parse sqlite files.
> See tika-parsers/pom.xml for the correct version.
> INFO Starting Apache Tika 2.0.0-SNAPSHOT server
> INFO server watch dog is starting up
> Dec 05, 2018 2:22:32 PM org.apache.tika.config.InitializableProblemHandler$3 
> handleInitializableProblem
> WARNING: J2KImageReader not loaded. JPEG2000 files will not be processed.
> See https://pdfbox.apache.org/2.0/dependencies.html#jai-image-io
> for optional dependencies.
> Dec 05, 2018 2:22:32 PM org.apache.tika.config.InitializableProblemHandler$3 
> handleInitializableProblem
> WARNING: org.xerial's sqlite-jdbc is not loaded.
> Please provide the jar on your classpath to parse sqlite files.
> See tika-parsers/pom.xml for the correct version.
> INFO Starting Apache Tika 2.0.0-SNAPSHOT server
> java.nio.file.NoSuchFileException: 
> /tmp/tika-server-child-process-mmap-2180120677326747096
>  at 
> java.base/sun.nio.fs.UnixException.translateToIOException(UnixException.java:92)
>  at 
> java.base/sun.nio.fs.UnixException.rethrowAsIOException(UnixException.java:111)
>  at 
> java.base/sun.nio.fs.UnixException.rethrowAsIOException(UnixException.java:116)
>  at 
> java.base/sun.nio.fs.UnixFileAttributeViews$Basic.readAttributes(UnixFileAttributeViews.java:55)
>  at 
> java.base/sun.nio.fs.UnixFileSystemProvider.readAttributes(UnixFileSystemProvider.java:145)
>  at 
> java.base/sun.nio.fs.LinuxFileSystemProvider.readAttributes(LinuxFileSystemProvider.java:99)
>  at java.base/java.nio.file.Files.readAttributes(Files.java:1755)
>  at java.base/java.nio.file.Files.size(Files.java:2372)
>  at 
> org.apache.tika.server.TikaServerWatchDog$ChildProcess.(TikaServerWatchDog.java:234)
>  at 
> org.apache.tika.server.TikaServerWatchDog$ChildProcess.(TikaServerWatchDog.java:210)
>  at 
> org.apache.tika.server.TikaServerWatchDog.execute(TikaServerWatchDog.java:66)
>  at org.apache.tika.server.TikaServerCli.execute(TikaServerCli.java:146)
>  at org.apache.tika.server.TikaServerCli.main(TikaServerCli.java:127)
> ERROR Can't start:
> java.nio.file.NoSuchFileException: 
> /tmp/tika-server-child-process-mmap-2180120677326747096
> at 
> java.base/sun.nio.fs.UnixException.translateToIOException(UnixException.java:92)
>  at 
> java.base/sun.nio.fs.UnixException.rethrowAsIOException(UnixException.java:111)
>  at 
> java.base/sun.nio.fs.UnixException.rethrowAsIOException(UnixException.java:116)
>  at 
> java.base/sun.nio.fs.UnixFileAttributeViews$Basic.readAttributes(UnixFileAttributeViews.java:55)
>  at 
> java.base/sun.nio.fs.UnixFileSystemProvider.readAttributes(UnixFileSystemProvider.java:145)
>  at 
> java.base/sun.nio.fs.LinuxFileSystemProvider.readAttributes(LinuxFileSystemProvider.java:99)
>  at java.base/java.nio.file.Files.readAttributes(Files.java:1

[jira] [Commented] (TIKA-2795) Error starting Tika 2.0 server with -spawnChild on Ubuntu

2018-12-10 Thread Hudson (JIRA)


[ 
https://issues.apache.org/jira/browse/TIKA-2795?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16715696#comment-16715696
 ] 

Hudson commented on TIKA-2795:
--

UNSTABLE: Integrated in Jenkins build tika-2.x-windows #360 (See 
[https://builds.apache.org/job/tika-2.x-windows/360/])
TIKA-2795 -- swapped memorymapped buffer for traditional open, write (tallison: 
rev e921e69c1a484de3036ed4e5a8654b046f54ceea)
* (edit) 
tika-server/src/main/java/org/apache/tika/server/TikaServerWatchDog.java
* (edit) 
tika-server/src/main/java/org/apache/tika/server/ServerStatusWatcher.java
* (edit) 
tika-server/src/test/java/org/apache/tika/server/TikaServerIntegrationTest.java


> Error starting Tika 2.0 server with -spawnChild on Ubuntu
> -
>
> Key: TIKA-2795
> URL: https://issues.apache.org/jira/browse/TIKA-2795
> Project: Tika
>  Issue Type: Bug
>  Components: server
>Affects Versions: 2.0
>Reporter: Mario Bisonti
>Priority: Major
>
> Hallo.
> I triend to download Tika server 2.0.0 from here:
> [https://builds.apache.org/job/Tika-trunk/lastStableBuild/org.apache.tika$tika-server/]
>  
> I tried to start on my Ubuntu server but with the -spawnChild, it doesn't 
> work.
>  
> sudo java -jar /opt/tika/tika-server-2.0.0-20181203.205013-340.jar -spawnChild
> Dec 05, 2018 2:22:32 PM org.apache.tika.config.InitializableProblemHandler$3 
> handleInitializableProblem
> WARNING: J2KImageReader not loaded. JPEG2000 files will not be processed.
> See https://pdfbox.apache.org/2.0/dependencies.html#jai-image-io
> for optional dependencies.
> Dec 05, 2018 2:22:32 PM org.apache.tika.config.InitializableProblemHandler$3 
> handleInitializableProblem
> WARNING: org.xerial's sqlite-jdbc is not loaded.
> Please provide the jar on your classpath to parse sqlite files.
> See tika-parsers/pom.xml for the correct version.
> INFO Starting Apache Tika 2.0.0-SNAPSHOT server
> INFO server watch dog is starting up
> Dec 05, 2018 2:22:32 PM org.apache.tika.config.InitializableProblemHandler$3 
> handleInitializableProblem
> WARNING: J2KImageReader not loaded. JPEG2000 files will not be processed.
> See https://pdfbox.apache.org/2.0/dependencies.html#jai-image-io
> for optional dependencies.
> Dec 05, 2018 2:22:32 PM org.apache.tika.config.InitializableProblemHandler$3 
> handleInitializableProblem
> WARNING: org.xerial's sqlite-jdbc is not loaded.
> Please provide the jar on your classpath to parse sqlite files.
> See tika-parsers/pom.xml for the correct version.
> INFO Starting Apache Tika 2.0.0-SNAPSHOT server
> java.nio.file.NoSuchFileException: 
> /tmp/tika-server-child-process-mmap-2180120677326747096
>  at 
> java.base/sun.nio.fs.UnixException.translateToIOException(UnixException.java:92)
>  at 
> java.base/sun.nio.fs.UnixException.rethrowAsIOException(UnixException.java:111)
>  at 
> java.base/sun.nio.fs.UnixException.rethrowAsIOException(UnixException.java:116)
>  at 
> java.base/sun.nio.fs.UnixFileAttributeViews$Basic.readAttributes(UnixFileAttributeViews.java:55)
>  at 
> java.base/sun.nio.fs.UnixFileSystemProvider.readAttributes(UnixFileSystemProvider.java:145)
>  at 
> java.base/sun.nio.fs.LinuxFileSystemProvider.readAttributes(LinuxFileSystemProvider.java:99)
>  at java.base/java.nio.file.Files.readAttributes(Files.java:1755)
>  at java.base/java.nio.file.Files.size(Files.java:2372)
>  at 
> org.apache.tika.server.TikaServerWatchDog$ChildProcess.(TikaServerWatchDog.java:234)
>  at 
> org.apache.tika.server.TikaServerWatchDog$ChildProcess.(TikaServerWatchDog.java:210)
>  at 
> org.apache.tika.server.TikaServerWatchDog.execute(TikaServerWatchDog.java:66)
>  at org.apache.tika.server.TikaServerCli.execute(TikaServerCli.java:146)
>  at org.apache.tika.server.TikaServerCli.main(TikaServerCli.java:127)
> ERROR Can't start:
> java.nio.file.NoSuchFileException: 
> /tmp/tika-server-child-process-mmap-2180120677326747096
> at 
> java.base/sun.nio.fs.UnixException.translateToIOException(UnixException.java:92)
>  at 
> java.base/sun.nio.fs.UnixException.rethrowAsIOException(UnixException.java:111)
>  at 
> java.base/sun.nio.fs.UnixException.rethrowAsIOException(UnixException.java:116)
>  at 
> java.base/sun.nio.fs.UnixFileAttributeViews$Basic.readAttributes(UnixFileAttributeViews.java:55)
>  at 
> java.base/sun.nio.fs.UnixFileSystemProvider.readAttributes(UnixFileSystemProvider.java:145)
>  at 
> java.base/sun.nio.fs.LinuxFileSystemProvider.readAttributes(LinuxFileSystemProvider.java:99)
>  at java.base/java.nio.file.Files.readAttributes(Files.java:1

[jira] [Commented] (TIKA-2795) Error starting Tika 2.0 server with -spawnChild on Ubuntu

2018-12-10 Thread Hudson (JIRA)


[ 
https://issues.apache.org/jira/browse/TIKA-2795?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16715688#comment-16715688
 ] 

Hudson commented on TIKA-2795:
--

SUCCESS: Integrated in Jenkins build tika-branch-1x #139 (See 
[https://builds.apache.org/job/tika-branch-1x/139/])
TIKA-2795 -- swapped memorymapped buffer for traditional open, write (tallison: 
[https://github.com/apache/tika/commit/8475ddb2bf3aafdbb5aadef77f6888e7e4c4b810])
* (edit) 
tika-server/src/test/java/org/apache/tika/server/TikaServerIntegrationTest.java
* (edit) 
tika-server/src/main/java/org/apache/tika/server/ServerStatusWatcher.java
* (edit) 
tika-server/src/main/java/org/apache/tika/server/TikaServerWatchDog.java


> Error starting Tika 2.0 server with -spawnChild on Ubuntu
> -
>
> Key: TIKA-2795
> URL: https://issues.apache.org/jira/browse/TIKA-2795
> Project: Tika
>  Issue Type: Bug
>  Components: server
>Affects Versions: 2.0
>Reporter: Mario Bisonti
>Priority: Major
>
> Hallo.
> I triend to download Tika server 2.0.0 from here:
> [https://builds.apache.org/job/Tika-trunk/lastStableBuild/org.apache.tika$tika-server/]
>  
> I tried to start on my Ubuntu server but with the -spawnChild, it doesn't 
> work.
>  
> sudo java -jar /opt/tika/tika-server-2.0.0-20181203.205013-340.jar -spawnChild
> Dec 05, 2018 2:22:32 PM org.apache.tika.config.InitializableProblemHandler$3 
> handleInitializableProblem
> WARNING: J2KImageReader not loaded. JPEG2000 files will not be processed.
> See https://pdfbox.apache.org/2.0/dependencies.html#jai-image-io
> for optional dependencies.
> Dec 05, 2018 2:22:32 PM org.apache.tika.config.InitializableProblemHandler$3 
> handleInitializableProblem
> WARNING: org.xerial's sqlite-jdbc is not loaded.
> Please provide the jar on your classpath to parse sqlite files.
> See tika-parsers/pom.xml for the correct version.
> INFO Starting Apache Tika 2.0.0-SNAPSHOT server
> INFO server watch dog is starting up
> Dec 05, 2018 2:22:32 PM org.apache.tika.config.InitializableProblemHandler$3 
> handleInitializableProblem
> WARNING: J2KImageReader not loaded. JPEG2000 files will not be processed.
> See https://pdfbox.apache.org/2.0/dependencies.html#jai-image-io
> for optional dependencies.
> Dec 05, 2018 2:22:32 PM org.apache.tika.config.InitializableProblemHandler$3 
> handleInitializableProblem
> WARNING: org.xerial's sqlite-jdbc is not loaded.
> Please provide the jar on your classpath to parse sqlite files.
> See tika-parsers/pom.xml for the correct version.
> INFO Starting Apache Tika 2.0.0-SNAPSHOT server
> java.nio.file.NoSuchFileException: 
> /tmp/tika-server-child-process-mmap-2180120677326747096
>  at 
> java.base/sun.nio.fs.UnixException.translateToIOException(UnixException.java:92)
>  at 
> java.base/sun.nio.fs.UnixException.rethrowAsIOException(UnixException.java:111)
>  at 
> java.base/sun.nio.fs.UnixException.rethrowAsIOException(UnixException.java:116)
>  at 
> java.base/sun.nio.fs.UnixFileAttributeViews$Basic.readAttributes(UnixFileAttributeViews.java:55)
>  at 
> java.base/sun.nio.fs.UnixFileSystemProvider.readAttributes(UnixFileSystemProvider.java:145)
>  at 
> java.base/sun.nio.fs.LinuxFileSystemProvider.readAttributes(LinuxFileSystemProvider.java:99)
>  at java.base/java.nio.file.Files.readAttributes(Files.java:1755)
>  at java.base/java.nio.file.Files.size(Files.java:2372)
>  at 
> org.apache.tika.server.TikaServerWatchDog$ChildProcess.(TikaServerWatchDog.java:234)
>  at 
> org.apache.tika.server.TikaServerWatchDog$ChildProcess.(TikaServerWatchDog.java:210)
>  at 
> org.apache.tika.server.TikaServerWatchDog.execute(TikaServerWatchDog.java:66)
>  at org.apache.tika.server.TikaServerCli.execute(TikaServerCli.java:146)
>  at org.apache.tika.server.TikaServerCli.main(TikaServerCli.java:127)
> ERROR Can't start:
> java.nio.file.NoSuchFileException: 
> /tmp/tika-server-child-process-mmap-2180120677326747096
> at 
> java.base/sun.nio.fs.UnixException.translateToIOException(UnixException.java:92)
>  at 
> java.base/sun.nio.fs.UnixException.rethrowAsIOException(UnixException.java:111)
>  at 
> java.base/sun.nio.fs.UnixException.rethrowAsIOException(UnixException.java:116)
>  at 
> java.base/sun.nio.fs.UnixFileAttributeViews$Basic.readAttributes(UnixFileAttributeViews.java:55)
>  at 
> java.base/sun.nio.fs.UnixFileSystemProvider.readAttributes(UnixFileSystemProvider.java:145)
>  at 
> java.base/sun.nio.fs.LinuxFileSystemProvider.readAttributes(LinuxFileSystemProvider.java:99)
>  at java.base/java.nio.file.Files.readAttributes(Files.java:1

[jira] [Comment Edited] (TIKA-2795) Error starting Tika 2.0 server with -spawnChild on Ubuntu

2018-12-10 Thread Tim Allison (JIRA)


[ 
https://issues.apache.org/jira/browse/TIKA-2795?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16715588#comment-16715588
 ] 

Tim Allison edited comment on TIKA-2795 at 12/10/18 9:38 PM:
-

{{DELETE_ON_CLOSE}} is a bad idea on a file used by different processes.  
There's no guarantee that the file won't be deleted before close.  On linux, 
the file has not yet been closed in the delete_on_close process, and that 
process can still see and write to the file, but at the same time the other 
process can't find the file.  The behavior is different on Windows.

  I tried a number of options, and I found it very hard to guarantee that the 
tmp file was deleted and that nothing seriously bad happened when two processes 
shared the tmp file when I shared a memorymapped file between processes _and_ I 
had to allow either or both processes to be killed.  Different platforms handle 
the details in different ways.

For now, I've chosen the simplest option, which is the writer opens the file, 
waits for trylock, writes the status and closes the file.  The reader does the 
same.  This appears to avoid synchronization issues within a process (if more 
than one thread calls close()) and across processes.  I'm sure that we can 
improve the efficiency of this at some point, but it just shouldn't matter.


was (Author: talli...@mitre.org):
{{DELETE_ON_CLOSE}} is a bad idea on a file used by different processes.  I 
tried a number of options, and I found it very hard to guarantee that the tmp 
file was deleted and that nothing seriously bad happened when two processes 
shared the tmp file when I shared a memorymapped file between processes..  
Different platforms handle the details in different ways.

For now, I've chosen the simplest option, which is the writer opens the file, 
waits for trylock, writes the status and closes the file.  The reader does the 
same.  This appears to avoid synchronization issues within a process (if more 
than one thread calls close()) and across processes.

> Error starting Tika 2.0 server with -spawnChild on Ubuntu
> -
>
> Key: TIKA-2795
> URL: https://issues.apache.org/jira/browse/TIKA-2795
> Project: Tika
>  Issue Type: Bug
>  Components: server
>Affects Versions: 2.0
>Reporter: Mario Bisonti
>Priority: Major
>
> Hallo.
> I triend to download Tika server 2.0.0 from here:
> [https://builds.apache.org/job/Tika-trunk/lastStableBuild/org.apache.tika$tika-server/]
>  
> I tried to start on my Ubuntu server but with the -spawnChild, it doesn't 
> work.
>  
> sudo java -jar /opt/tika/tika-server-2.0.0-20181203.205013-340.jar -spawnChild
> Dec 05, 2018 2:22:32 PM org.apache.tika.config.InitializableProblemHandler$3 
> handleInitializableProblem
> WARNING: J2KImageReader not loaded. JPEG2000 files will not be processed.
> See https://pdfbox.apache.org/2.0/dependencies.html#jai-image-io
> for optional dependencies.
> Dec 05, 2018 2:22:32 PM org.apache.tika.config.InitializableProblemHandler$3 
> handleInitializableProblem
> WARNING: org.xerial's sqlite-jdbc is not loaded.
> Please provide the jar on your classpath to parse sqlite files.
> See tika-parsers/pom.xml for the correct version.
> INFO Starting Apache Tika 2.0.0-SNAPSHOT server
> INFO server watch dog is starting up
> Dec 05, 2018 2:22:32 PM org.apache.tika.config.InitializableProblemHandler$3 
> handleInitializableProblem
> WARNING: J2KImageReader not loaded. JPEG2000 files will not be processed.
> See https://pdfbox.apache.org/2.0/dependencies.html#jai-image-io
> for optional dependencies.
> Dec 05, 2018 2:22:32 PM org.apache.tika.config.InitializableProblemHandler$3 
> handleInitializableProblem
> WARNING: org.xerial's sqlite-jdbc is not loaded.
> Please provide the jar on your classpath to parse sqlite files.
> See tika-parsers/pom.xml for the correct version.
> INFO Starting Apache Tika 2.0.0-SNAPSHOT server
> java.nio.file.NoSuchFileException: 
> /tmp/tika-server-child-process-mmap-2180120677326747096
>  at 
> java.base/sun.nio.fs.UnixException.translateToIOException(UnixException.java:92)
>  at 
> java.base/sun.nio.fs.UnixException.rethrowAsIOException(UnixException.java:111)
>  at 
> java.base/sun.nio.fs.UnixException.rethrowAsIOException(UnixException.java:116)
>  at 
> java.base/sun.nio.fs.UnixFileAttributeViews$Basic.readAttributes(UnixFileAttributeViews.java:55)
>  at 
> java.base/sun.nio.fs.UnixFileSystemProvider.readAttributes(UnixFileSystemProvider.java:145)
>  at 
> java.base/sun.nio.fs.LinuxFileSystemProvider.readAttributes(LinuxFileSystemProvider.java:99)
>  at java.base/java.nio.file.Files.readAttr

[jira] [Commented] (TIKA-2795) Error starting Tika 2.0 server with -spawnChild on Ubuntu

2018-12-10 Thread Tim Allison (JIRA)


[ 
https://issues.apache.org/jira/browse/TIKA-2795?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16715588#comment-16715588
 ] 

Tim Allison commented on TIKA-2795:
---

{{DELETE_ON_CLOSE}} is a bad idea on a file used by different processes.  I 
tried a number of options, and I found it very hard to guarantee that the tmp 
file was deleted and that nothing seriously bad happened when two processes 
shared the tmp file when I shared a memorymapped file between processes..  
Different platforms handle the details in different ways.

For now, I've chosen the simplest option, which is the writer opens the file, 
waits for trylock, writes the status and closes the file.  The reader does the 
same.  This appears to avoid synchronization issues within a process (if more 
than one thread calls close()) and across processes.

> Error starting Tika 2.0 server with -spawnChild on Ubuntu
> -
>
> Key: TIKA-2795
> URL: https://issues.apache.org/jira/browse/TIKA-2795
> Project: Tika
>  Issue Type: Bug
>  Components: server
>Affects Versions: 2.0
>Reporter: Mario Bisonti
>Priority: Major
>
> Hallo.
> I triend to download Tika server 2.0.0 from here:
> [https://builds.apache.org/job/Tika-trunk/lastStableBuild/org.apache.tika$tika-server/]
>  
> I tried to start on my Ubuntu server but with the -spawnChild, it doesn't 
> work.
>  
> sudo java -jar /opt/tika/tika-server-2.0.0-20181203.205013-340.jar -spawnChild
> Dec 05, 2018 2:22:32 PM org.apache.tika.config.InitializableProblemHandler$3 
> handleInitializableProblem
> WARNING: J2KImageReader not loaded. JPEG2000 files will not be processed.
> See https://pdfbox.apache.org/2.0/dependencies.html#jai-image-io
> for optional dependencies.
> Dec 05, 2018 2:22:32 PM org.apache.tika.config.InitializableProblemHandler$3 
> handleInitializableProblem
> WARNING: org.xerial's sqlite-jdbc is not loaded.
> Please provide the jar on your classpath to parse sqlite files.
> See tika-parsers/pom.xml for the correct version.
> INFO Starting Apache Tika 2.0.0-SNAPSHOT server
> INFO server watch dog is starting up
> Dec 05, 2018 2:22:32 PM org.apache.tika.config.InitializableProblemHandler$3 
> handleInitializableProblem
> WARNING: J2KImageReader not loaded. JPEG2000 files will not be processed.
> See https://pdfbox.apache.org/2.0/dependencies.html#jai-image-io
> for optional dependencies.
> Dec 05, 2018 2:22:32 PM org.apache.tika.config.InitializableProblemHandler$3 
> handleInitializableProblem
> WARNING: org.xerial's sqlite-jdbc is not loaded.
> Please provide the jar on your classpath to parse sqlite files.
> See tika-parsers/pom.xml for the correct version.
> INFO Starting Apache Tika 2.0.0-SNAPSHOT server
> java.nio.file.NoSuchFileException: 
> /tmp/tika-server-child-process-mmap-2180120677326747096
>  at 
> java.base/sun.nio.fs.UnixException.translateToIOException(UnixException.java:92)
>  at 
> java.base/sun.nio.fs.UnixException.rethrowAsIOException(UnixException.java:111)
>  at 
> java.base/sun.nio.fs.UnixException.rethrowAsIOException(UnixException.java:116)
>  at 
> java.base/sun.nio.fs.UnixFileAttributeViews$Basic.readAttributes(UnixFileAttributeViews.java:55)
>  at 
> java.base/sun.nio.fs.UnixFileSystemProvider.readAttributes(UnixFileSystemProvider.java:145)
>  at 
> java.base/sun.nio.fs.LinuxFileSystemProvider.readAttributes(LinuxFileSystemProvider.java:99)
>  at java.base/java.nio.file.Files.readAttributes(Files.java:1755)
>  at java.base/java.nio.file.Files.size(Files.java:2372)
>  at 
> org.apache.tika.server.TikaServerWatchDog$ChildProcess.(TikaServerWatchDog.java:234)
>  at 
> org.apache.tika.server.TikaServerWatchDog$ChildProcess.(TikaServerWatchDog.java:210)
>  at 
> org.apache.tika.server.TikaServerWatchDog.execute(TikaServerWatchDog.java:66)
>  at org.apache.tika.server.TikaServerCli.execute(TikaServerCli.java:146)
>  at org.apache.tika.server.TikaServerCli.main(TikaServerCli.java:127)
> ERROR Can't start:
> java.nio.file.NoSuchFileException: 
> /tmp/tika-server-child-process-mmap-2180120677326747096
> at 
> java.base/sun.nio.fs.UnixException.translateToIOException(UnixException.java:92)
>  at 
> java.base/sun.nio.fs.UnixException.rethrowAsIOException(UnixException.java:111)
>  at 
> java.base/sun.nio.fs.UnixException.rethrowAsIOException(UnixException.java:116)
>  at 
> java.base/sun.nio.fs.UnixFileAttributeViews$Basic.readAttributes(UnixFileAttributeViews.java:55)
>  at 
> java.base/sun.nio.fs.UnixFileSystemProvider.readAttributes(UnixFileSystemProvider.java:145)
>  at 
> java.base/sun.nio.fs.LinuxFileSystem

[jira] [Commented] (TIKA-2795) Error starting Tika 2.0 server with -spawnChild on Ubuntu

2018-12-07 Thread Hudson (JIRA)


[ 
https://issues.apache.org/jira/browse/TIKA-2795?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16713155#comment-16713155
 ] 

Hudson commented on TIKA-2795:
--

UNSTABLE: Integrated in Jenkins build tika-branch-1x #138 (See 
[https://builds.apache.org/job/tika-branch-1x/138/])
TIKA-2795 -- catch IOException if child deletes shared file (tallison: 
[https://github.com/apache/tika/commit/582a1d441ea8240e06a552c4cfa315439ea47a45])
* (edit) 
tika-server/src/main/java/org/apache/tika/server/TikaServerWatchDog.java


> Error starting Tika 2.0 server with -spawnChild on Ubuntu
> -
>
> Key: TIKA-2795
> URL: https://issues.apache.org/jira/browse/TIKA-2795
> Project: Tika
>  Issue Type: Bug
>  Components: server
>Affects Versions: 2.0
>Reporter: Mario Bisonti
>Priority: Major
>
> Hallo.
> I triend to download Tika server 2.0.0 from here:
> [https://builds.apache.org/job/Tika-trunk/lastStableBuild/org.apache.tika$tika-server/]
>  
> I tried to start on my Ubuntu server but with the -spawnChild, it doesn't 
> work.
>  
> sudo java -jar /opt/tika/tika-server-2.0.0-20181203.205013-340.jar -spawnChild
> Dec 05, 2018 2:22:32 PM org.apache.tika.config.InitializableProblemHandler$3 
> handleInitializableProblem
> WARNING: J2KImageReader not loaded. JPEG2000 files will not be processed.
> See https://pdfbox.apache.org/2.0/dependencies.html#jai-image-io
> for optional dependencies.
> Dec 05, 2018 2:22:32 PM org.apache.tika.config.InitializableProblemHandler$3 
> handleInitializableProblem
> WARNING: org.xerial's sqlite-jdbc is not loaded.
> Please provide the jar on your classpath to parse sqlite files.
> See tika-parsers/pom.xml for the correct version.
> INFO Starting Apache Tika 2.0.0-SNAPSHOT server
> INFO server watch dog is starting up
> Dec 05, 2018 2:22:32 PM org.apache.tika.config.InitializableProblemHandler$3 
> handleInitializableProblem
> WARNING: J2KImageReader not loaded. JPEG2000 files will not be processed.
> See https://pdfbox.apache.org/2.0/dependencies.html#jai-image-io
> for optional dependencies.
> Dec 05, 2018 2:22:32 PM org.apache.tika.config.InitializableProblemHandler$3 
> handleInitializableProblem
> WARNING: org.xerial's sqlite-jdbc is not loaded.
> Please provide the jar on your classpath to parse sqlite files.
> See tika-parsers/pom.xml for the correct version.
> INFO Starting Apache Tika 2.0.0-SNAPSHOT server
> java.nio.file.NoSuchFileException: 
> /tmp/tika-server-child-process-mmap-2180120677326747096
>  at 
> java.base/sun.nio.fs.UnixException.translateToIOException(UnixException.java:92)
>  at 
> java.base/sun.nio.fs.UnixException.rethrowAsIOException(UnixException.java:111)
>  at 
> java.base/sun.nio.fs.UnixException.rethrowAsIOException(UnixException.java:116)
>  at 
> java.base/sun.nio.fs.UnixFileAttributeViews$Basic.readAttributes(UnixFileAttributeViews.java:55)
>  at 
> java.base/sun.nio.fs.UnixFileSystemProvider.readAttributes(UnixFileSystemProvider.java:145)
>  at 
> java.base/sun.nio.fs.LinuxFileSystemProvider.readAttributes(LinuxFileSystemProvider.java:99)
>  at java.base/java.nio.file.Files.readAttributes(Files.java:1755)
>  at java.base/java.nio.file.Files.size(Files.java:2372)
>  at 
> org.apache.tika.server.TikaServerWatchDog$ChildProcess.(TikaServerWatchDog.java:234)
>  at 
> org.apache.tika.server.TikaServerWatchDog$ChildProcess.(TikaServerWatchDog.java:210)
>  at 
> org.apache.tika.server.TikaServerWatchDog.execute(TikaServerWatchDog.java:66)
>  at org.apache.tika.server.TikaServerCli.execute(TikaServerCli.java:146)
>  at org.apache.tika.server.TikaServerCli.main(TikaServerCli.java:127)
> ERROR Can't start:
> java.nio.file.NoSuchFileException: 
> /tmp/tika-server-child-process-mmap-2180120677326747096
> at 
> java.base/sun.nio.fs.UnixException.translateToIOException(UnixException.java:92)
>  at 
> java.base/sun.nio.fs.UnixException.rethrowAsIOException(UnixException.java:111)
>  at 
> java.base/sun.nio.fs.UnixException.rethrowAsIOException(UnixException.java:116)
>  at 
> java.base/sun.nio.fs.UnixFileAttributeViews$Basic.readAttributes(UnixFileAttributeViews.java:55)
>  at 
> java.base/sun.nio.fs.UnixFileSystemProvider.readAttributes(UnixFileSystemProvider.java:145)
>  at 
> java.base/sun.nio.fs.LinuxFileSystemProvider.readAttributes(LinuxFileSystemProvider.java:99)
>  at java.base/java.nio.file.Files.readAttributes(Files.java:1755)
>  at java.base/java.nio.file.Files.size(Files.java:2372)
>  at 
> org.apache.tika.server.TikaServerWatchDog$ChildProcess.(TikaServerWatchDog.java:2

[jira] [Updated] (TIKA-2795) Error starting Tika 2.0 server with -spawnChild on Ubuntu

2018-12-05 Thread Mario Bisonti (JIRA)


 [ 
https://issues.apache.org/jira/browse/TIKA-2795?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Mario Bisonti updated TIKA-2795:

Description: 
Hallo.

I triend to download Tika server 2.0.0 from here:

[https://builds.apache.org/job/Tika-trunk/lastStableBuild/org.apache.tika$tika-server/]
 

I tried to start on my Ubuntu server but with the -spawnChild, it doesn't work.

 

sudo java -jar /opt/tika/tika-server-2.0.0-20181203.205013-340.jar -spawnChild
Dec 05, 2018 2:22:32 PM org.apache.tika.config.InitializableProblemHandler$3 
handleInitializableProblem
WARNING: J2KImageReader not loaded. JPEG2000 files will not be processed.
See https://pdfbox.apache.org/2.0/dependencies.html#jai-image-io
for optional dependencies.

Dec 05, 2018 2:22:32 PM org.apache.tika.config.InitializableProblemHandler$3 
handleInitializableProblem
WARNING: org.xerial's sqlite-jdbc is not loaded.
Please provide the jar on your classpath to parse sqlite files.
See tika-parsers/pom.xml for the correct version.
INFO Starting Apache Tika 2.0.0-SNAPSHOT server
INFO server watch dog is starting up
Dec 05, 2018 2:22:32 PM org.apache.tika.config.InitializableProblemHandler$3 
handleInitializableProblem
WARNING: J2KImageReader not loaded. JPEG2000 files will not be processed.
See https://pdfbox.apache.org/2.0/dependencies.html#jai-image-io
for optional dependencies.

Dec 05, 2018 2:22:32 PM org.apache.tika.config.InitializableProblemHandler$3 
handleInitializableProblem
WARNING: org.xerial's sqlite-jdbc is not loaded.
Please provide the jar on your classpath to parse sqlite files.
See tika-parsers/pom.xml for the correct version.
INFO Starting Apache Tika 2.0.0-SNAPSHOT server
java.nio.file.NoSuchFileException: 
/tmp/tika-server-child-process-mmap-2180120677326747096
 at 
java.base/sun.nio.fs.UnixException.translateToIOException(UnixException.java:92)
 at 
java.base/sun.nio.fs.UnixException.rethrowAsIOException(UnixException.java:111)
 at 
java.base/sun.nio.fs.UnixException.rethrowAsIOException(UnixException.java:116)
 at 
java.base/sun.nio.fs.UnixFileAttributeViews$Basic.readAttributes(UnixFileAttributeViews.java:55)
 at 
java.base/sun.nio.fs.UnixFileSystemProvider.readAttributes(UnixFileSystemProvider.java:145)
 at 
java.base/sun.nio.fs.LinuxFileSystemProvider.readAttributes(LinuxFileSystemProvider.java:99)
 at java.base/java.nio.file.Files.readAttributes(Files.java:1755)
 at java.base/java.nio.file.Files.size(Files.java:2372)
 at 
org.apache.tika.server.TikaServerWatchDog$ChildProcess.(TikaServerWatchDog.java:234)
 at 
org.apache.tika.server.TikaServerWatchDog$ChildProcess.(TikaServerWatchDog.java:210)
 at 
org.apache.tika.server.TikaServerWatchDog.execute(TikaServerWatchDog.java:66)
 at org.apache.tika.server.TikaServerCli.execute(TikaServerCli.java:146)
 at org.apache.tika.server.TikaServerCli.main(TikaServerCli.java:127)
ERROR Can't start:

java.nio.file.NoSuchFileException: 
/tmp/tika-server-child-process-mmap-2180120677326747096

at 
java.base/sun.nio.fs.UnixException.translateToIOException(UnixException.java:92)
 at 
java.base/sun.nio.fs.UnixException.rethrowAsIOException(UnixException.java:111)
 at 
java.base/sun.nio.fs.UnixException.rethrowAsIOException(UnixException.java:116)
 at 
java.base/sun.nio.fs.UnixFileAttributeViews$Basic.readAttributes(UnixFileAttributeViews.java:55)
 at 
java.base/sun.nio.fs.UnixFileSystemProvider.readAttributes(UnixFileSystemProvider.java:145)
 at 
java.base/sun.nio.fs.LinuxFileSystemProvider.readAttributes(LinuxFileSystemProvider.java:99)
 at java.base/java.nio.file.Files.readAttributes(Files.java:1755)
 at java.base/java.nio.file.Files.size(Files.java:2372)
 at 
org.apache.tika.server.TikaServerWatchDog$ChildProcess.(TikaServerWatchDog.java:234)
 at 
org.apache.tika.server.TikaServerWatchDog$ChildProcess.(TikaServerWatchDog.java:210)
 at 
org.apache.tika.server.TikaServerWatchDog.execute(TikaServerWatchDog.java:66)
 at org.apache.tika.server.TikaServerCli.execute(TikaServerCli.java:146)
 at org.apache.tika.server.TikaServerCli.main(TikaServerCli.java:127)
administrator@sengvivv02:/opt/tika$

 

 

Instead, on a Windows machine, it starts right.

Thanks a lot

Mario

> Error starting Tika 2.0 server with -spawnChild on Ubuntu
> -
>
> Key: TIKA-2795
> URL: https://issues.apache.org/jira/browse/TIKA-2795
> Project: Tika
>  Issue Type: Bug
>  Components: server
>Affects Versions: 2.0
>Reporter: Mario Bisonti
>Priority: Major
>
> Hallo.
> I triend to download Tika server 2.0.0 from here:
> [https://builds.apache.org/job/Tika-trunk/lastStableBuild/org.apache.tika$tika-server/]
>  
> I tried to start on my Ubuntu server but with the -spawnChild, it doesn't 
> work.
>  
> sudo java -jar /opt/tika/tika-server-2.0.0-2

[jira] [Created] (TIKA-2795) Error starting Tika 2.0 server with -spawnChild on Ubuntu

2018-12-05 Thread Mario Bisonti (JIRA)
Mario Bisonti created TIKA-2795:
---

 Summary: Error starting Tika 2.0 server with -spawnChild on Ubuntu
 Key: TIKA-2795
 URL: https://issues.apache.org/jira/browse/TIKA-2795
 Project: Tika
  Issue Type: Bug
  Components: server
Affects Versions: 2.0
Reporter: Mario Bisonti






--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


Re: Build with Java 10, but target 8 in Tika 2.0?

2018-06-20 Thread Bob Paulin
I'd also be a bit concerned with ONLY compiling with Java 10.  There are
some changes to how resources are accessed across module boundaries that
could break some existing functionality if folks decided to RUN with >
Java 9 using the module system.  I covered some of these in my 2016
Apache Con talk[1].  I've got a few of the code changes need based on
the old 2.0 branch [2] but there may be more.  So that said it might be
good to just start with the Automatic Module Name entry in the manifest
[3].  Then proceed to add the module-info.java when we've developed some
tests that run Tika as a module.  Thoughts on this approach?


[1]
https://www.slideshare.net/rpaulin1/clipboards/tika-java-9-resource-loading

[2] https://github.com/bobpaulin/tika/tree/2.x_java9

[3] http://branchandbound.net/blog/java/2017/12/automatic-module-name/


On 6/19/2018 4:10 PM, Tim Allison wrote:
>  Doh...sorry, right.
>
> Ken,
> Tongue in cheek answer: so that I become a little less stupid about
> modern java...see above. :)
>   Real answer: _if_ we can pull it off, given that we plan to
> modularize our parsers anyways, it would be nice to use the language
> support in java >= 9 for actual modularity. I know we have to fix some
> split packages and possibly rename some of our packages.
> I _might_ find some time soon to focus on merging Bob’s awesome 2.0
> work into master, and I thought it would be a good time to try it.
>
> Nick,
>This is good to know.  Thank you!
>
> Cheers,
>   Tim
>
> On Tue, Jun 19, 2018 at 4:59 PM Nick Burch  wrote:
>
>> On 19/06/18 20:46, Tim Allison wrote:
>>> What would you think of requiring Java 10 to build Tika 2.0 but still
>>> setting 8 as the target?  This would allow us to bake modularity in now.
>>> Given that I haven't actually tried modularizing/jigsawizing Tika yet,
>> this
>>> could be a complete disaster, of course. :)
>> I'm not sure how well it'd work given that most of our dependencies
>> aren't java module-ized?
>>
>> David North (from POI) has done quite a bit on java modules for existing
>> codebases, and hit some snags, and IIRC commons have had problems too. I
>> don't mind either way though!
>>
>> Nick
>>




signature.asc
Description: OpenPGP digital signature


Re: Build with Java 10, but target 8 in Tika 2.0?

2018-06-19 Thread Nick Burch

On 19/06/18 20:46, Tim Allison wrote:

What would you think of requiring Java 10 to build Tika 2.0 but still
setting 8 as the target?  This would allow us to bake modularity in now.
Given that I haven't actually tried modularizing/jigsawizing Tika yet, this
could be a complete disaster, of course. :)


I'm not sure how well it'd work given that most of our dependencies 
aren't java module-ized?


David North (from POI) has done quite a bit on java modules for existing 
codebases, and hit some snags, and IIRC commons have had problems too. I 
don't mind either way though!


Nick


Re: Build with Java 10, but target 8 in Tika 2.0?

2018-06-19 Thread Uwe Schindler
Don't set "Target" to 8. Use "Release" flag! This ensures that code is compiled 
against Java 8 method signatures. For module info just add a separate 
compilation source with release 9 or 10 and jar them together.

Uwe

Am June 19, 2018 7:46:45 PM UTC schrieb Tim Allison :
>All,
>  What would you think of requiring Java 10 to build Tika 2.0 but still
>setting 8 as the target?  This would allow us to bake modularity in
>now.
>Given that I haven't actually tried modularizing/jigsawizing Tika yet,
>this
>could be a complete disaster, of course. :)
>
> Cheers,
>
>  Tim

--
Uwe Schindler
Achterdiek 19, 28357 Bremen
https://www.thetaphi.de

Re: Build with Java 10, but target 8 in Tika 2.0?

2018-06-19 Thread Ken Krugler
Hi Tim,

What’s the issue with needing Java 10 for the build?

And yes, I think I can install it, but I’m still on 1.8 :)

— Ken

> On Jun 19, 2018, at 12:46 PM, Tim Allison  wrote:
> 
> All,
>  What would you think of requiring Java 10 to build Tika 2.0 but still
> setting 8 as the target?  This would allow us to bake modularity in now.
> Given that I haven't actually tried modularizing/jigsawizing Tika yet, this
> could be a complete disaster, of course. :)
> 
> Cheers,
> 
>  Tim

--
Ken Krugler
+1 530-210-6378
http://www.scaleunlimited.com
Custom big data solutions & training
Flink, Solr, Hadoop, Cascading & Cassandra



Build with Java 10, but target 8 in Tika 2.0?

2018-06-19 Thread Tim Allison
All,
  What would you think of requiring Java 10 to build Tika 2.0 but still
setting 8 as the target?  This would allow us to bake modularity in now.
Given that I haven't actually tried modularizing/jigsawizing Tika yet, this
could be a complete disaster, of course. :)

 Cheers,

  Tim


[jira] [Closed] (TIKA-2083) Tika 2.0 - Audit master branch against 2.x branch

2018-02-05 Thread Tim Allison (JIRA)

 [ 
https://issues.apache.org/jira/browse/TIKA-2083?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tim Allison closed TIKA-2083.
-
Resolution: Fixed

Current plan is to use 2.x branch as a model, to redo [~bobpaulin]'s awesome 
work on {{master}}.  This will be more work, but we will not risk losing 
anything done to master.

> Tika 2.0 - Audit master branch against 2.x branch
> -
>
> Key: TIKA-2083
> URL: https://issues.apache.org/jira/browse/TIKA-2083
> Project: Tika
>  Issue Type: Sub-task
>Affects Versions: 2.0
>Reporter: Bob Paulin
>Assignee: Bob Paulin
>Priority: Blocker
> Fix For: 2.0
>
>
> At this point Tika has been doing parallel development on master and the 2.x 
> for about 9 months.  We should audit commit logs for that time to make a best 
> effort to identify any commits that may not have been applied in 2.x.  This 
> task should be done prior to the 2.0 release



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Updated] (TIKA-1983) Tika 2.0 - remove tika-app's legacy server

2018-02-05 Thread Tim Allison (JIRA)

 [ 
https://issues.apache.org/jira/browse/TIKA-1983?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tim Allison updated TIKA-1983:
--
Issue Type: Sub-task  (was: Task)
Parent: TIKA-2085

> Tika 2.0 - remove tika-app's legacy server 
> ---
>
> Key: TIKA-1983
> URL: https://issues.apache.org/jira/browse/TIKA-1983
> Project: Tika
>  Issue Type: Sub-task
>Reporter: Tim Allison
>Assignee: Tim Allison
>Priority: Trivial
> Fix For: 2.0.0
>
>
> In the Tika 2.0 road map 
> [discussion|https://wiki.apache.org/tika/Tika2_0RoadMap], we decided to 
> remove tika-app's legacy server.  Users should migrate to the tika-server 
> package.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (TIKA-1983) Tika 2.0 - remove tika-app's legacy server

2018-02-05 Thread Hudson (JIRA)

[ 
https://issues.apache.org/jira/browse/TIKA-1983?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16352513#comment-16352513
 ] 

Hudson commented on TIKA-1983:
--

SUCCESS: Integrated in Jenkins build Tika-trunk #1430 (See 
[https://builds.apache.org/job/Tika-trunk/1430/])
TIKA-1983 -- remove deprecated server from tika-app in Tika 2.0 (tallison: 
[https://github.com/apache/tika/commit/5244cde44e18f2e4565c2b0e29e8083575429084])
* (edit) tika-app/src/main/java/org/apache/tika/cli/TikaCLI.java


> Tika 2.0 - remove tika-app's legacy server 
> ---
>
> Key: TIKA-1983
> URL: https://issues.apache.org/jira/browse/TIKA-1983
> Project: Tika
>  Issue Type: Task
>Reporter: Tim Allison
>Assignee: Tim Allison
>Priority: Trivial
> Fix For: 2.0.0
>
>
> In the Tika 2.0 road map 
> [discussion|https://wiki.apache.org/tika/Tika2_0RoadMap], we decided to 
> remove tika-app's legacy server.  Users should migrate to the tika-server 
> package.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Resolved] (TIKA-1983) Tika 2.0 - remove tika-app's legacy server

2018-02-05 Thread Tim Allison (JIRA)

 [ 
https://issues.apache.org/jira/browse/TIKA-1983?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tim Allison resolved TIKA-1983.
---
   Resolution: Fixed
 Assignee: Tim Allison
Fix Version/s: 2.0.0

Fixed on {{master}}.

> Tika 2.0 - remove tika-app's legacy server 
> ---
>
> Key: TIKA-1983
> URL: https://issues.apache.org/jira/browse/TIKA-1983
> Project: Tika
>  Issue Type: Task
>Reporter: Tim Allison
>Assignee: Tim Allison
>Priority: Trivial
> Fix For: 2.0.0
>
>
> In the Tika 2.0 road map 
> [discussion|https://wiki.apache.org/tika/Tika2_0RoadMap], we decided to 
> remove tika-app's legacy server.  Users should migrate to the tika-server 
> package.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Reopened] (TIKA-1983) Tika 2.0 - remove tika-app's legacy server

2018-02-05 Thread Tim Allison (JIRA)

 [ 
https://issues.apache.org/jira/browse/TIKA-1983?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tim Allison reopened TIKA-1983:
---

This was done on the initial 2.x branch.  We need to redo it on master.

> Tika 2.0 - remove tika-app's legacy server 
> ---
>
> Key: TIKA-1983
> URL: https://issues.apache.org/jira/browse/TIKA-1983
> Project: Tika
>  Issue Type: Task
>Reporter: Tim Allison
>Priority: Trivial
>
> In the Tika 2.0 road map 
> [discussion|https://wiki.apache.org/tika/Tika2_0RoadMap], we decided to 
> remove tika-app's legacy server.  Users should migrate to the tika-server 
> package.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Resolved] (TIKA-1974) Tika 2.0 - remove deprecated metadata properties

2018-01-31 Thread Tim Allison (JIRA)

 [ 
https://issues.apache.org/jira/browse/TIKA-1974?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tim Allison resolved TIKA-1974.
---
   Resolution: Fixed
Fix Version/s: 2.0

If anyone has feedback on this, we can reopen this...or open a separate ticket.

> Tika 2.0 - remove deprecated metadata properties
> 
>
> Key: TIKA-1974
> URL: https://issues.apache.org/jira/browse/TIKA-1974
> Project: Tika
>  Issue Type: Sub-task
>Reporter: Tim Allison
>Priority: Blocker
> Fix For: 2.0
>
>
> We have quite a few metadata properties that are deprecated.  We should 
> remove them for 2.0.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (TIKA-1974) Tika 2.0 - remove deprecated metadata properties

2018-01-26 Thread Hudson (JIRA)

[ 
https://issues.apache.org/jira/browse/TIKA-1974?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16341673#comment-16341673
 ] 

Hudson commented on TIKA-1974:
--

SUCCESS: Integrated in Jenkins build Tika-trunk #1426 (See 
[https://builds.apache.org/job/Tika-trunk/1426/])
TIKA-1974 -- remove deprecated metadata properties/keys for Tika 2.0 (tallison: 
[https://github.com/apache/tika/commit/10a8eec119c7a77be76000b30aaffb96a552cc44])
* (edit) tika-core/src/main/java/org/apache/tika/detect/NameDetector.java
* (edit) tika-xmp/src/test/java/org/apache/tika/xmp/XMPMetadataTest.java
* (edit) 
tika-parsers/src/test/java/org/apache/tika/parser/pkg/ZipParserTest.java
* (edit) tika-parsers/src/main/java/org/apache/tika/parser/pdf/PDF2XHTML.java
* (edit) 
tika-parsers/src/main/java/org/apache/tika/parser/rtf/TextExtractor.java
* (edit) tika-app/src/test/java/org/apache/tika/cli/TikaCLITest.java
* (edit) 
tika-core/src/main/java/org/apache/tika/metadata/TikaCoreProperties.java
* (edit) 
tika-parsers/src/test/java/org/apache/tika/parser/image/xmp/JempboxExtractorTest.java
* (edit) 
tika-parsers/src/test/java/org/apache/tika/parser/microsoft/OutlookParserTest.java
* (edit) 
tika-parsers/src/main/java/org/apache/tika/parser/image/ImageMetadataExtractor.java
* (edit) 
tika-parsers/src/test/java/org/apache/tika/parser/jpeg/JpegParserTest.java
* (edit) 
tika-parsers/src/test/java/org/apache/tika/parser/pdf/PDFParserTest.java
* (edit) 
tika-core/src/test/java/org/apache/tika/mime/ProbabilisticMimeDetectionTestWithTika.java
* (edit) tika-core/src/main/java/org/apache/tika/Tika.java
* (edit) tika-core/src/main/java/org/apache/tika/io/FilenameUtils.java
* (edit) 
tika-parsers/src/main/java/org/apache/tika/parser/microsoft/ooxml/MetadataExtractor.java
* (edit) tika-core/src/test/java/org/apache/tika/mime/MimeTypesReaderTest.java
* (edit) 
tika-parsers/src/test/java/org/apache/tika/parser/odf/ODFParserTest.java
* (edit) tika-core/src/main/java/org/apache/tika/metadata/DublinCore.java
* (edit) tika-core/src/test/java/org/apache/tika/detect/NameDetectorTest.java
* (edit) 
tika-parsers/src/main/java/org/apache/tika/parser/rtf/RTFEmbObjHandler.java
* (edit) 
tika-parsers/src/test/java/org/apache/tika/parser/microsoft/ooxml/SXWPFExtractorTest.java
* (edit) 
tika-batch/src/main/java/org/apache/tika/batch/fs/FSDocumentSelector.java
* (edit) tika-core/src/main/java/org/apache/tika/metadata/RTFMetadata.java
* (edit) tika-parsers/src/main/java/org/apache/tika/parser/html/HtmlHandler.java
* (edit) 
tika-parsers/src/test/java/org/apache/tika/detect/TestContainerAwareDetector.java
* (edit) tika-example/src/main/java/org/apache/tika/example/ParsingExample.java
* (edit) tika-core/src/main/java/org/apache/tika/mime/MimeTypes.java
* (edit) 
tika-core/src/test/java/org/apache/tika/mime/ProbabilisticMimeDetectionTest.java
* (edit) 
tika-parsers/src/test/java/org/apache/tika/parser/BouncyCastleDigestingParserTest.java
* (edit) 
tika-parsers/src/test/java/org/apache/tika/parser/microsoft/ExcelParserTest.java
* (edit) 
tika-server/src/main/java/org/apache/tika/server/resource/DetectorResource.java
* (edit) 
tika-parsers/src/main/java/org/apache/tika/parser/odf/OpenDocumentParser.java
* (edit) tika-core/src/test/java/org/apache/tika/mime/MimeDetectionTest.java
* (edit) tika-parsers/src/main/java/org/apache/tika/parser/xml/DcXMLParser.java
* (edit) 
tika-parsers/src/test/java/org/apache/tika/parser/microsoft/ProjectParserTest.java
* (edit) tika-core/src/main/java/org/apache/tika/metadata/Message.java
* (edit) 
tika-parsers/src/main/java/org/apache/tika/parser/pkg/CompressorParser.java
* (edit) tika-batch/src/main/java/org/apache/tika/batch/fs/FSFileResource.java
* (edit) 
tika-parsers/src/main/java/org/apache/tika/parser/iptc/IptcAnpaParser.java
* (edit) tika-parsers/src/main/java/org/apache/tika/parser/mbox/MboxParser.java
* (edit) tika-eval/src/main/java/org/apache/tika/eval/io/ExtractReader.java
* (edit) tika-bundle/src/test/java/org/apache/tika/bundle/BundleIT.java
* (delete) tika-core/src/main/java/org/apache/tika/metadata/MSOffice.java
* (edit) 
tika-parsers/src/main/java/org/apache/tika/parser/odf/OpenDocumentMetaParser.java
* (edit) 
tika-core/src/main/java/org/apache/tika/extractor/ParsingEmbeddedDocumentExtractor.java
* (edit) 
tika-parsers/src/main/java/org/apache/tika/parser/image/xmp/JempboxExtractor.java
* (edit) 
tika-parsers/src/test/java/org/apache/tika/parser/mp3/Mp3ParserTest.java
* (edit) 
tika-parsers/src/test/java/org/apache/tika/parser/AutoDetectParserTest.java
* (edit) 
tika-parsers/src/test/java/org/apache/tika/parser/executable/ExecutableParserTest.java
* (edit) 
tika-parsers/src/main/java/org/apache/tika/parser/pkg/PackageParser.java
* (edit) 
tika-server/src/main/java/org/apache/tika/server/resource/TikaResource.java
* (edit) tika-parsers/src/main/java/org/apache/tika/parser/hdf/HDFParser.java
* (edit) 
tika-parsers/src/test/java/org/apache/tika/parser/rtf

[jira] [Comment Edited] (TIKA-1974) Tika 2.0 - remove deprecated metadata properties

2018-01-26 Thread Tim Allison (JIRA)

[ 
https://issues.apache.org/jira/browse/TIKA-1974?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16341416#comment-16341416
 ] 

Tim Allison edited comment on TIKA-1974 at 1/26/18 6:53 PM:


I just pushed the [first draft of 
this|https://github.com/apache/tika/commit/e6e3b8817053e981f3843f1d3b7055b4ae30ed73].
  Please take a look and let me know if I botched anything.  [~rgauss]...if you 
have any time, I'd very much appreciate your feedback!


was (Author: talli...@mitre.org):
I just pushed the first draft of this.  Please take a look and let me know if I 
botched anything.  [~rgauss]...if you have any time, I'd very much appreciate 
your feedback!

> Tika 2.0 - remove deprecated metadata properties
> 
>
> Key: TIKA-1974
> URL: https://issues.apache.org/jira/browse/TIKA-1974
> Project: Tika
>  Issue Type: Sub-task
>Reporter: Tim Allison
>Priority: Blocker
>
> We have quite a few metadata properties that are deprecated.  We should 
> remove them for 2.0.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Updated] (TIKA-1974) Tika 2.0 - remove deprecated metadata properties

2018-01-26 Thread Tim Allison (JIRA)

 [ 
https://issues.apache.org/jira/browse/TIKA-1974?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tim Allison updated TIKA-1974:
--
Priority: Blocker  (was: Major)

> Tika 2.0 - remove deprecated metadata properties
> 
>
> Key: TIKA-1974
> URL: https://issues.apache.org/jira/browse/TIKA-1974
> Project: Tika
>  Issue Type: Sub-task
>Reporter: Tim Allison
>Priority: Blocker
>
> We have quite a few metadata properties that are deprecated.  We should 
> remove them for 2.0.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


***UNCHECKED*** [jira] [Commented] (TIKA-1974) Tika 2.0 - remove deprecated metadata properties

2018-01-26 Thread Tim Allison (JIRA)

[ 
https://issues.apache.org/jira/browse/TIKA-1974?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16341416#comment-16341416
 ] 

Tim Allison commented on TIKA-1974:
---

I just pushed the first draft of this.  Please take a look and let me know if I 
botched anything.  [~rgauss]...if you have any time, I'd very much appreciate 
your feedback!

> Tika 2.0 - remove deprecated metadata properties
> 
>
> Key: TIKA-1974
> URL: https://issues.apache.org/jira/browse/TIKA-1974
> Project: Tika
>  Issue Type: Sub-task
>Reporter: Tim Allison
>Priority: Major
>
> We have quite a few metadata properties that are deprecated.  We should 
> remove them for 2.0.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (TIKA-1974) Tika 2.0 - remove deprecated metadata properties

2018-01-24 Thread Tim Allison (JIRATEST)

[ 
https://issues-test.apache.org/jira/browse/TIKA-1974?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16265599#comment-16265599
 ] 

Tim Allison commented on TIKA-1974:
---

Another question: we're currently making quite a few metadata properties 
available in {{Metadata}} via {{implements}}:
{noformat}
public class Metadata implements CreativeCommons, Geographic, HttpHeaders,
Message, ClimateForcast, TIFF, TikaMimeKeys,...
{noformat}
Do we still want to do this?  If so, do we want to add TikaCoreProperties?

> Tika 2.0 - remove deprecated metadata properties
> 
>
> Key: TIKA-1974
> URL: https://issues-test.apache.org/jira/browse/TIKA-1974
> Project: Tika
>  Issue Type: Sub-task
>Reporter: Tim Allison
>Priority: Major
>
> We have quite a few metadata properties that are deprecated.  We should 
> remove them for 2.0.



--
This message was sent by Atlassian JIRA
(v7.6.0#76001)


[jira] [Comment Edited] (TIKA-1974) Tika 2.0 - remove deprecated metadata properties

2018-01-24 Thread Tim Allison (JIRATEST)

[ 
https://issues-test.apache.org/jira/browse/TIKA-1974?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16265598#comment-16265598
 ] 

Tim Allison edited comment on TIKA-1974 at 1/24/18 2:46 PM:


All,

  I'm picking up work on this again, any recs for the question above?  I'm 
guessing based on this:
{noformat}
Property ALTITUDE = Geographic.ALTITUDE;{noformat}
That we should go with, e.g.:
{noformat}
Property DESCRIPTION = DublinCore.DESCRIPTION;{noformat}


was (Author: talli...@mitre.org):
All,

  I'm picking up work on this again, any recs for the question above?

> Tika 2.0 - remove deprecated metadata properties
> 
>
> Key: TIKA-1974
> URL: https://issues-test.apache.org/jira/browse/TIKA-1974
> Project: Tika
>  Issue Type: Sub-task
>Reporter: Tim Allison
>Priority: Major
>
> We have quite a few metadata properties that are deprecated.  We should 
> remove them for 2.0.



--
This message was sent by Atlassian JIRA
(v7.6.0#76001)


[jira] [Commented] (TIKA-1974) Tika 2.0 - remove deprecated metadata properties

2018-01-24 Thread Tim Allison (JIRATEST)

[ 
https://issues-test.apache.org/jira/browse/TIKA-1974?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16265598#comment-16265598
 ] 

Tim Allison commented on TIKA-1974:
---

All,

  I'm picking up work on this again, any recs for the question above?

> Tika 2.0 - remove deprecated metadata properties
> 
>
> Key: TIKA-1974
> URL: https://issues-test.apache.org/jira/browse/TIKA-1974
> Project: Tika
>  Issue Type: Sub-task
>Reporter: Tim Allison
>Priority: Major
>
> We have quite a few metadata properties that are deprecated.  We should 
> remove them for 2.0.



--
This message was sent by Atlassian JIRA
(v7.6.0#76001)


RE: steps for Tika 2.0

2017-12-13 Thread Allison, Timothy B.
> I have a 4 week old branch that I've started applying changes to that I could 
> push up called tika-2.0-demo-update that might provide a head start for you.  
+1 

Also, if you have the stomach and time to redo your work, please take the lead. 
 

tf is such a massive dependency that we should break it into its own module, 
IMHO.

I just updated the version in master to 2.0.0-SNAPSHOT. 

Onward!

-Original Message-
From: Bob Paulin [mailto:b...@bobpaulin.com] 
Sent: Wednesday, December 13, 2017 9:52 AM
To: dev@tika.apache.org
Subject: Re: steps for Tika 2.0

Hey Tim,

Happy to help with this effort.  I have a 4 week old branch that I've started 
applying changes to that I could push up called tika-2.0-demo-update that might 
provide a head start for you.  I think we do have to make some decisions on 
where the captioning, recognition, and sentiment packages go.  There was quite 
a bit of work done integrating all the cool new tensorflow stuff.  My initial 
thought was tika-parser-advanced-module but we could even consider breaking the 
tensorflow work into it's own.  Excited to see this work start in master!

- Bob


On 12/13/2017 7:51 AM, Allison, Timothy B. wrote:
> All,
>
> I just created branch_1x, where we can put bug fixes and anything else we 
> want to go into 1.17.1 or 1.18.  Unless there are objections, I’m going to 
> start making some radical changes to master to prep for 2.0.0-BETA over the 
> next few weeks/months.  These changes are all based on Bob Paulin’s amazing 
> 2.x branch work.
>
> So, rather than having to remember to make updates to 2.x, we’ll now 
> have to remember to make updates to branch_1x.  Hopefully, this will 
> get us to 2.0.0 sooner. 
>
> Just keep working on master and treating it as master. 
>
> Let me know if I mis-remembered our earlier conversations about steps to take 
> for 2.0.0 and/or if you have any other recommendations.  Onward!
>
> Thank you!
>
>  Cheers,
>
> Tim
>
>




Re: steps for Tika 2.0

2017-12-13 Thread Bob Paulin
Hey Tim,

Happy to help with this effort.  I have a 4 week old branch that I've
started applying changes to that I could push up called
tika-2.0-demo-update that might provide a head start for you.  I think
we do have to make some decisions on where the captioning, recognition,
and sentiment packages go.  There was quite a bit of work done
integrating all the cool new tensorflow stuff.  My initial thought was
tika-parser-advanced-module but we could even consider breaking the
tensorflow work into it's own.  Excited to see this work start in master!

- Bob


On 12/13/2017 7:51 AM, Allison, Timothy B. wrote:
> All,
>
> I just created branch_1x, where we can put bug fixes and anything else we 
> want to go into 1.17.1 or 1.18.  Unless there are objections, I’m going to 
> start making some radical changes to master to prep for 2.0.0-BETA over the 
> next few weeks/months.  These changes are all based on Bob Paulin’s amazing 
> 2.x branch work.
>
> So, rather than having to remember to make updates to 2.x, we’ll now have to 
> remember to make updates to branch_1x.  Hopefully, this will get us to 2.0.0 
> sooner. 
>
> Just keep working on master and treating it as master. 
>
> Let me know if I mis-remembered our earlier conversations about steps to take 
> for 2.0.0 and/or if you have any other recommendations.  Onward!
>
> Thank you!
>
>  Cheers,
>
> Tim
>
>




signature.asc
Description: OpenPGP digital signature


steps for Tika 2.0

2017-12-13 Thread Allison, Timothy B.
All,

I just created branch_1x, where we can put bug fixes and anything else we want 
to go into 1.17.1 or 1.18.  Unless there are objections, I’m going to start 
making some radical changes to master to prep for 2.0.0-BETA over the next few 
weeks/months.  These changes are all based on Bob Paulin’s amazing 2.x branch 
work.

So, rather than having to remember to make updates to 2.x, we’ll now have to 
remember to make updates to branch_1x.  Hopefully, this will get us to 2.0.0 
sooner. 

Just keep working on master and treating it as master. 

Let me know if I mis-remembered our earlier conversations about steps to take 
for 2.0.0 and/or if you have any other recommendations.  Onward!

Thank you!

 Cheers,

Tim




Re: Tika 2.0?

2017-09-12 Thread Chris Mattmann
B it is, proceed (



On 9/12/17, 5:10 AM, "Allison, Timothy B." <talli...@mitre.org> wrote:

I'd strongly advocate for 2.  I _think_ the hard work was laying out the 
general structure and adding the ProxyParser workaround.  Copying and 
pasting/reworking into that structure will be: 

A) far less dangerous than 1 
And
B) we'll have a cleaner history.

On A), I know that we didn't add some major components including: 
configurability of parsers, completely cleaned up logging, numerous bug fixes 
and even entire modules (tika-dl).

On B), there were a few times where I "caught a parser up" in 2.0 not by 
individual commits based on the original history but based on a copy/paste from 
the contemporaneous master.  This obliterated the history of some commits on 
the 2.0 branch and would force us to look back at master.

-Original Message-
From: Bob Paulin [mailto:b...@bobpaulin.com] 
Sent: Monday, September 11, 2017 9:48 PM
To: dev@tika.apache.org
Subject: Re: Tika 2.0?

Just so it's clear are we going to:

1) Rename the 2.0 branch over to master

or

2) Re-apply the changes on master. 

I recall Chris' preference was 1 which would be quicker.  However there is 
very likely missed patches.  2 will be more time consuming but it would be more 
likely to include all the most recent code.  I'm open to either.  Not sure how 
far out of date 2.0 branch is so I defer to Tim on the risk of going with #1.


- Bob


On 9/11/2017 5:15 PM, Chris Mattmann wrote:
> +1000
>
>
>
> On 9/11/17, 12:03 PM, "Allison, Timothy B." <talli...@mitre.org> wrote:
>
> Y, well, I didn't say _which_ September...
> 
> Given my limited availability to work on this in Sept and POI's 
decision to move to Java 1.8, I propose releasing Tika 1.17 after the release 
of POI 3.17 and PDFBox 2.0.8.  This would be the last version of Tika at the 
Java 1.7 level, and then we bump the Java requirement to 1.8, switch master to 
the 2.0 layout and create a 1.x maintenance branch (with Java 1.8) for quick 
critical bug fixes/security vulnerabilities until we release 2.0.
> 
> What do you all think?
> 
>  
> -Original Message-
> From: Allison, Timothy B. [mailto:talli...@mitre.org] 
> Sent: Monday, August 28, 2017 9:33 AM
> To: dev@tika.apache.org
> Subject: Tika 2.0?
> 
> All,
> 
>   We're getting some increasing deltas btwn the 2.0 and trunk 
branches.  Many of these are my fault; I gave up making updates to 2.0 around 
April/May, I think.
> 
>   What would people think of punting on some of the desired goals of 
2.0 (e.g. chaining parsers, more structured but still simple metadata) and 
releasing 2.0 soonish...say 2.0-BETA end of September?
> 
>   We've been able to make some major improvements to Tika without 
breaking backwards compatibility.  We _might_ be able to do that with the 
outstanding issues for 2.0 when someone has time.
> 
>   We could also do the upgrade to jdk 8 with Tika 2.0.
> 
>   If this sounds reasonable, I propose creating a 1.x branch from 
trunk for 1.x maintenance and then reworking trunk to the 2.x structure that 
Bob Paulin so elegantly worked out.  I figure we can either copy/paste from 
trunk to the current 2.x (and _hope_ we get all the updates) or use Bob's 2.0 
as a model for restructuring trunk.  At this point, I'd prefer the second 
option.  The key here is to switch "trunk" to 2.0 so that we all have the 
mindset that 2.0 is what we're focused on.
> 
>The main benefit of this proposal is that we'd have a more modular 
Tika soon.
> 
>What do you think?
> 
>  Best,
> 
>Tim
> 
>
>
>







RE: Tika 2.0?

2017-09-12 Thread Allison, Timothy B.
I'd strongly advocate for 2.  I _think_ the hard work was laying out the 
general structure and adding the ProxyParser workaround.  Copying and 
pasting/reworking into that structure will be: 

A) far less dangerous than 1 
And
B) we'll have a cleaner history.

On A), I know that we didn't add some major components including: 
configurability of parsers, completely cleaned up logging, numerous bug fixes 
and even entire modules (tika-dl).

On B), there were a few times where I "caught a parser up" in 2.0 not by 
individual commits based on the original history but based on a copy/paste from 
the contemporaneous master.  This obliterated the history of some commits on 
the 2.0 branch and would force us to look back at master.

-Original Message-
From: Bob Paulin [mailto:b...@bobpaulin.com] 
Sent: Monday, September 11, 2017 9:48 PM
To: dev@tika.apache.org
Subject: Re: Tika 2.0?

Just so it's clear are we going to:

1) Rename the 2.0 branch over to master

or

2) Re-apply the changes on master. 

I recall Chris' preference was 1 which would be quicker.  However there is very 
likely missed patches.  2 will be more time consuming but it would be more 
likely to include all the most recent code.  I'm open to either.  Not sure how 
far out of date 2.0 branch is so I defer to Tim on the risk of going with #1.


- Bob


On 9/11/2017 5:15 PM, Chris Mattmann wrote:
> +1000
>
>
>
> On 9/11/17, 12:03 PM, "Allison, Timothy B." <talli...@mitre.org> wrote:
>
> Y, well, I didn't say _which_ September...
> 
> Given my limited availability to work on this in Sept and POI's decision 
> to move to Java 1.8, I propose releasing Tika 1.17 after the release of POI 
> 3.17 and PDFBox 2.0.8.  This would be the last version of Tika at the Java 
> 1.7 level, and then we bump the Java requirement to 1.8, switch master to the 
> 2.0 layout and create a 1.x maintenance branch (with Java 1.8) for quick 
> critical bug fixes/security vulnerabilities until we release 2.0.
> 
> What do you all think?
> 
>  
> -Original Message-
> From: Allison, Timothy B. [mailto:talli...@mitre.org] 
>     Sent: Monday, August 28, 2017 9:33 AM
> To: dev@tika.apache.org
> Subject: Tika 2.0?
> 
> All,
> 
>   We're getting some increasing deltas btwn the 2.0 and trunk branches.  
> Many of these are my fault; I gave up making updates to 2.0 around April/May, 
> I think.
> 
>   What would people think of punting on some of the desired goals of 2.0 
> (e.g. chaining parsers, more structured but still simple metadata) and 
> releasing 2.0 soonish...say 2.0-BETA end of September?
> 
>   We've been able to make some major improvements to Tika without 
> breaking backwards compatibility.  We _might_ be able to do that with the 
> outstanding issues for 2.0 when someone has time.
> 
>   We could also do the upgrade to jdk 8 with Tika 2.0.
> 
>   If this sounds reasonable, I propose creating a 1.x branch from trunk 
> for 1.x maintenance and then reworking trunk to the 2.x structure that Bob 
> Paulin so elegantly worked out.  I figure we can either copy/paste from trunk 
> to the current 2.x (and _hope_ we get all the updates) or use Bob's 2.0 as a 
> model for restructuring trunk.  At this point, I'd prefer the second option.  
> The key here is to switch "trunk" to 2.0 so that we all have the mindset that 
> 2.0 is what we're focused on.
> 
>The main benefit of this proposal is that we'd have a more modular 
> Tika soon.
> 
>What do you think?
> 
>  Best,
> 
>Tim
> 
>
>
>




Re: Tika 2.0?

2017-09-11 Thread Bob Paulin
Just so it's clear are we going to:

1) Rename the 2.0 branch over to master

or

2) Re-apply the changes on master. 

I recall Chris' preference was 1 which would be quicker.  However there
is very likely missed patches.  2 will be more time consuming but it
would be more likely to include all the most recent code.  I'm open to
either.  Not sure how far out of date 2.0 branch is so I defer to Tim on
the risk of going with #1.


- Bob


On 9/11/2017 5:15 PM, Chris Mattmann wrote:
> +1000
>
>
>
> On 9/11/17, 12:03 PM, "Allison, Timothy B." <talli...@mitre.org> wrote:
>
> Y, well, I didn't say _which_ September...
> 
> Given my limited availability to work on this in Sept and POI's decision 
> to move to Java 1.8, I propose releasing Tika 1.17 after the release of POI 
> 3.17 and PDFBox 2.0.8.  This would be the last version of Tika at the Java 
> 1.7 level, and then we bump the Java requirement to 1.8, switch master to the 
> 2.0 layout and create a 1.x maintenance branch (with Java 1.8) for quick 
> critical bug fixes/security vulnerabilities until we release 2.0.
> 
> What do you all think?
> 
>  
> -Original Message-
> From: Allison, Timothy B. [mailto:talli...@mitre.org] 
> Sent: Monday, August 28, 2017 9:33 AM
> To: dev@tika.apache.org
> Subject: Tika 2.0?
> 
> All,
> 
>   We're getting some increasing deltas btwn the 2.0 and trunk branches.  
> Many of these are my fault; I gave up making updates to 2.0 around April/May, 
> I think.
> 
>   What would people think of punting on some of the desired goals of 2.0 
> (e.g. chaining parsers, more structured but still simple metadata) and 
> releasing 2.0 soonish...say 2.0-BETA end of September?
> 
>   We've been able to make some major improvements to Tika without 
> breaking backwards compatibility.  We _might_ be able to do that with the 
> outstanding issues for 2.0 when someone has time.
> 
>   We could also do the upgrade to jdk 8 with Tika 2.0.
> 
>   If this sounds reasonable, I propose creating a 1.x branch from trunk 
> for 1.x maintenance and then reworking trunk to the 2.x structure that Bob 
> Paulin so elegantly worked out.  I figure we can either copy/paste from trunk 
> to the current 2.x (and _hope_ we get all the updates) or use Bob's 2.0 as a 
> model for restructuring trunk.  At this point, I'd prefer the second option.  
> The key here is to switch "trunk" to 2.0 so that we all have the mindset that 
> 2.0 is what we're focused on.
> 
>The main benefit of this proposal is that we'd have a more modular 
> Tika soon.
> 
>What do you think?
> 
>  Best,
> 
>Tim
> 
>
>
>




signature.asc
Description: OpenPGP digital signature


Re: Tika 2.0?

2017-09-11 Thread Chris Mattmann
+1000



On 9/11/17, 12:03 PM, "Allison, Timothy B." <talli...@mitre.org> wrote:

Y, well, I didn't say _which_ September...

Given my limited availability to work on this in Sept and POI's decision to 
move to Java 1.8, I propose releasing Tika 1.17 after the release of POI 3.17 
and PDFBox 2.0.8.  This would be the last version of Tika at the Java 1.7 
level, and then we bump the Java requirement to 1.8, switch master to the 2.0 
layout and create a 1.x maintenance branch (with Java 1.8) for quick critical 
bug fixes/security vulnerabilities until we release 2.0.

What do you all think?

 
-Original Message-
From: Allison, Timothy B. [mailto:talli...@mitre.org] 
Sent: Monday, August 28, 2017 9:33 AM
To: dev@tika.apache.org
    Subject: Tika 2.0?

All,

  We're getting some increasing deltas btwn the 2.0 and trunk branches.  
Many of these are my fault; I gave up making updates to 2.0 around April/May, I 
think.

  What would people think of punting on some of the desired goals of 2.0 
(e.g. chaining parsers, more structured but still simple metadata) and 
releasing 2.0 soonish...say 2.0-BETA end of September?

  We've been able to make some major improvements to Tika without breaking 
backwards compatibility.  We _might_ be able to do that with the outstanding 
issues for 2.0 when someone has time.

  We could also do the upgrade to jdk 8 with Tika 2.0.

  If this sounds reasonable, I propose creating a 1.x branch from trunk for 
1.x maintenance and then reworking trunk to the 2.x structure that Bob Paulin 
so elegantly worked out.  I figure we can either copy/paste from trunk to the 
current 2.x (and _hope_ we get all the updates) or use Bob's 2.0 as a model for 
restructuring trunk.  At this point, I'd prefer the second option.  The key 
here is to switch "trunk" to 2.0 so that we all have the mindset that 2.0 is 
what we're focused on.

   The main benefit of this proposal is that we'd have a more modular Tika 
soon.

   What do you think?

 Best,

   Tim





RE: Tika 2.0?

2017-09-11 Thread Allison, Timothy B.
Y, well, I didn't say _which_ September...

Given my limited availability to work on this in Sept and POI's decision to 
move to Java 1.8, I propose releasing Tika 1.17 after the release of POI 3.17 
and PDFBox 2.0.8.  This would be the last version of Tika at the Java 1.7 
level, and then we bump the Java requirement to 1.8, switch master to the 2.0 
layout and create a 1.x maintenance branch (with Java 1.8) for quick critical 
bug fixes/security vulnerabilities until we release 2.0.

What do you all think?

 
-Original Message-
From: Allison, Timothy B. [mailto:talli...@mitre.org] 
Sent: Monday, August 28, 2017 9:33 AM
To: dev@tika.apache.org
Subject: Tika 2.0?

All,

  We're getting some increasing deltas btwn the 2.0 and trunk branches.  Many 
of these are my fault; I gave up making updates to 2.0 around April/May, I 
think.

  What would people think of punting on some of the desired goals of 2.0 (e.g. 
chaining parsers, more structured but still simple metadata) and releasing 2.0 
soonish...say 2.0-BETA end of September?

  We've been able to make some major improvements to Tika without breaking 
backwards compatibility.  We _might_ be able to do that with the outstanding 
issues for 2.0 when someone has time.

  We could also do the upgrade to jdk 8 with Tika 2.0.

  If this sounds reasonable, I propose creating a 1.x branch from trunk for 1.x 
maintenance and then reworking trunk to the 2.x structure that Bob Paulin so 
elegantly worked out.  I figure we can either copy/paste from trunk to the 
current 2.x (and _hope_ we get all the updates) or use Bob's 2.0 as a model for 
restructuring trunk.  At this point, I'd prefer the second option.  The key 
here is to switch "trunk" to 2.0 so that we all have the mindset that 2.0 is 
what we're focused on.

   The main benefit of this proposal is that we'd have a more modular Tika soon.

   What do you think?

 Best,

   Tim


Re: Tika 2.0?

2017-08-29 Thread Mattmann, Chris A (3010)
I am cool to finally get on the 2.0 kool aid and execute the plan as described 
by Tim
below for our next release.

+1.

Cheers,
Chris


++
Chris Mattmann, Ph.D.
Principal Data Scientist, Engineering Administrative Office (3010)
Manager, NSF & Open Source Projects Formulation and Development Offices (8212)
NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA
Office: 180-503E, Mailstop: 180-503
Email: chris.a.mattm...@nasa.gov
WWW:  http://sunset.usc.edu/~mattmann/
++
Director, Information Retrieval and Data Science Group (IRDS)
Adjunct Associate Professor, Computer Science Department
University of Southern California, Los Angeles, CA 90089 USA
WWW: http://irds.usc.edu/
++
 

On 8/28/17, 8:12 AM, "Konstantin Gribov" <gros...@gmail.com> wrote:

Tim,

+1 to making restructuring master to 2.x shape. If we can at least migrate
modularization patches, dependency changes and move to java 8 it certainly
will be a good step forward and big reduction of technical debt.

On пн, 28 авг. 2017, 16:52 Bob Paulin <b...@bobpaulin.com> wrote:

> Tim,
>
> +1 You've done an admirable job of dual maintenance but it sounds like
> it became a heavy tax on development.  Releasing would allow us to get
> back to "trunk" based development again.  Then we could focus on porting
> any missed patches and start looking for any regressions.  I also like
> the idea of picking up Java 8 as many other projects are starting to do
> this.
>
> - Bob
>
>
>
> On 8/28/2017 8:32 AM, Allison, Timothy B. wrote:
> > All,
> >
> >   We're getting some increasing deltas btwn the 2.0 and trunk branches.
> Many of these are my fault; I gave up making updates to 2.0 around
> April/May, I think.
> >
> >   What would people think of punting on some of the desired goals of 2.0
> (e.g. chaining parsers, more structured but still simple metadata) and
> releasing 2.0 soonish...say 2.0-BETA end of September?
> >
> >   We've been able to make some major improvements to Tika without
> breaking backwards compatibility.  We _might_ be able to do that with the
> outstanding issues for 2.0 when someone has time.
> >
> >   We could also do the upgrade to jdk 8 with Tika 2.0.
> >
> >   If this sounds reasonable, I propose creating a 1.x branch from trunk
> for 1.x maintenance and then reworking trunk to the 2.x structure that Bob
> Paulin so elegantly worked out.  I figure we can either copy/paste from
> trunk to the current 2.x (and _hope_ we get all the updates) or use Bob's
> 2.0 as a model for restructuring trunk.  At this point, I'd prefer the
> second option.  The key here is to switch "trunk" to 2.0 so that we all
> have the mindset that 2.0 is what we're focused on.
> >
> >The main benefit of this proposal is that we'd have a more modular
> Tika soon.
> >
> >What do you think?
> >
> >  Best,
> >
> >Tim
> >
>
>
> --

Best regards,
Konstantin Gribov




Re: Tika 2.0?

2017-08-28 Thread Konstantin Gribov
Tim,

+1 to making restructuring master to 2.x shape. If we can at least migrate
modularization patches, dependency changes and move to java 8 it certainly
will be a good step forward and big reduction of technical debt.

On пн, 28 авг. 2017, 16:52 Bob Paulin <b...@bobpaulin.com> wrote:

> Tim,
>
> +1 You've done an admirable job of dual maintenance but it sounds like
> it became a heavy tax on development.  Releasing would allow us to get
> back to "trunk" based development again.  Then we could focus on porting
> any missed patches and start looking for any regressions.  I also like
> the idea of picking up Java 8 as many other projects are starting to do
> this.
>
> - Bob
>
>
>
> On 8/28/2017 8:32 AM, Allison, Timothy B. wrote:
> > All,
> >
> >   We're getting some increasing deltas btwn the 2.0 and trunk branches.
> Many of these are my fault; I gave up making updates to 2.0 around
> April/May, I think.
> >
> >   What would people think of punting on some of the desired goals of 2.0
> (e.g. chaining parsers, more structured but still simple metadata) and
> releasing 2.0 soonish...say 2.0-BETA end of September?
> >
> >   We've been able to make some major improvements to Tika without
> breaking backwards compatibility.  We _might_ be able to do that with the
> outstanding issues for 2.0 when someone has time.
> >
> >   We could also do the upgrade to jdk 8 with Tika 2.0.
> >
> >   If this sounds reasonable, I propose creating a 1.x branch from trunk
> for 1.x maintenance and then reworking trunk to the 2.x structure that Bob
> Paulin so elegantly worked out.  I figure we can either copy/paste from
> trunk to the current 2.x (and _hope_ we get all the updates) or use Bob's
> 2.0 as a model for restructuring trunk.  At this point, I'd prefer the
> second option.  The key here is to switch "trunk" to 2.0 so that we all
> have the mindset that 2.0 is what we're focused on.
> >
> >The main benefit of this proposal is that we'd have a more modular
> Tika soon.
> >
> >What do you think?
> >
> >  Best,
> >
> >Tim
> >
>
>
> --

Best regards,
Konstantin Gribov


Re: Tika 2.0?

2017-08-28 Thread Bob Paulin
Tim,

+1 You've done an admirable job of dual maintenance but it sounds like
it became a heavy tax on development.  Releasing would allow us to get
back to "trunk" based development again.  Then we could focus on porting
any missed patches and start looking for any regressions.  I also like
the idea of picking up Java 8 as many other projects are starting to do
this.

- Bob



On 8/28/2017 8:32 AM, Allison, Timothy B. wrote:
> All,
>
>   We're getting some increasing deltas btwn the 2.0 and trunk branches.  Many 
> of these are my fault; I gave up making updates to 2.0 around April/May, I 
> think.
>
>   What would people think of punting on some of the desired goals of 2.0 
> (e.g. chaining parsers, more structured but still simple metadata) and 
> releasing 2.0 soonish...say 2.0-BETA end of September?
>
>   We've been able to make some major improvements to Tika without breaking 
> backwards compatibility.  We _might_ be able to do that with the outstanding 
> issues for 2.0 when someone has time.
>
>   We could also do the upgrade to jdk 8 with Tika 2.0.
>
>   If this sounds reasonable, I propose creating a 1.x branch from trunk for 
> 1.x maintenance and then reworking trunk to the 2.x structure that Bob Paulin 
> so elegantly worked out.  I figure we can either copy/paste from trunk to the 
> current 2.x (and _hope_ we get all the updates) or use Bob's 2.0 as a model 
> for restructuring trunk.  At this point, I'd prefer the second option.  The 
> key here is to switch "trunk" to 2.0 so that we all have the mindset that 2.0 
> is what we're focused on.
>
>The main benefit of this proposal is that we'd have a more modular Tika 
> soon.
>
>What do you think?
>
>  Best,
>
>Tim
>




signature.asc
Description: OpenPGP digital signature


Re: Tika 2.0?

2017-08-28 Thread Sergey Beryozkin

Hi Tim

Having a new major 2.0 master is a good idea IMHO. It will take time to 
make it final but it's better to finally make it 'mainstream' and start 
having new ideas realized or finalized...


Sergey
On 28/08/17 14:32, Allison, Timothy B. wrote:

All,

   We're getting some increasing deltas btwn the 2.0 and trunk branches.  Many 
of these are my fault; I gave up making updates to 2.0 around April/May, I 
think.

   What would people think of punting on some of the desired goals of 2.0 (e.g. 
chaining parsers, more structured but still simple metadata) and releasing 2.0 
soonish...say 2.0-BETA end of September?

   We've been able to make some major improvements to Tika without breaking 
backwards compatibility.  We _might_ be able to do that with the outstanding 
issues for 2.0 when someone has time.

   We could also do the upgrade to jdk 8 with Tika 2.0.

   If this sounds reasonable, I propose creating a 1.x branch from trunk for 1.x 
maintenance and then reworking trunk to the 2.x structure that Bob Paulin so elegantly 
worked out.  I figure we can either copy/paste from trunk to the current 2.x (and _hope_ 
we get all the updates) or use Bob's 2.0 as a model for restructuring trunk.  At this 
point, I'd prefer the second option.  The key here is to switch "trunk" to 2.0 
so that we all have the mindset that 2.0 is what we're focused on.

The main benefit of this proposal is that we'd have a more modular Tika 
soon.

What do you think?

  Best,

Tim



Tika 2.0?

2017-08-28 Thread Allison, Timothy B.
All,

  We're getting some increasing deltas btwn the 2.0 and trunk branches.  Many 
of these are my fault; I gave up making updates to 2.0 around April/May, I 
think.

  What would people think of punting on some of the desired goals of 2.0 (e.g. 
chaining parsers, more structured but still simple metadata) and releasing 2.0 
soonish...say 2.0-BETA end of September?

  We've been able to make some major improvements to Tika without breaking 
backwards compatibility.  We _might_ be able to do that with the outstanding 
issues for 2.0 when someone has time.

  We could also do the upgrade to jdk 8 with Tika 2.0.

  If this sounds reasonable, I propose creating a 1.x branch from trunk for 1.x 
maintenance and then reworking trunk to the 2.x structure that Bob Paulin so 
elegantly worked out.  I figure we can either copy/paste from trunk to the 
current 2.x (and _hope_ we get all the updates) or use Bob's 2.0 as a model for 
restructuring trunk.  At this point, I'd prefer the second option.  The key 
here is to switch "trunk" to 2.0 so that we all have the mindset that 2.0 is 
what we're focused on.

   The main benefit of this proposal is that we'd have a more modular Tika soon.

   What do you think?

 Best,

   Tim


[jira] [Updated] (TIKA-2096) Tika 2.0 -- Supply AutoDetectParser for embedded documents if user forgets to pass it in via ParseContext

2016-11-10 Thread Tim Allison (JIRA)

 [ 
https://issues.apache.org/jira/browse/TIKA-2096?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tim Allison updated TIKA-2096:
--
Issue Type: Improvement  (was: Sub-task)
Parent: (was: TIKA-2085)

> Tika 2.0 -- Supply AutoDetectParser for embedded documents if user forgets to 
> pass it in via ParseContext
> -
>
> Key: TIKA-2096
> URL: https://issues.apache.org/jira/browse/TIKA-2096
> Project: Tika
>  Issue Type: Improvement
>Reporter: Tim Allison
>
> Currently, if users don't specify a Parser.class or an 
> EmbeddedDocumentExtractor in the ParseContext, then embedded documents will 
> not be parsed.  I propose that we add an AutoDetectParser automatically if a 
> Parser or EmbeddedDocumentExtractor is not included in the ParseContext.
> If a user doesn't want to parse embedded objects, s/he could pass in an 
> EmptyParser for the Parser.class.
> In short, let's make the default be "parse everything", and the user has to 
> figure out how to parse only the container document if that's the desired 
> behavior.
> This is a breaking change.  I propose adding it to 2.0 only.
> We were bitten by this on tika-server (TIKA-1584).  Solr (SOLR-7189) has been 
> bitten by this. [Kite|https://github.com/kite-sdk/kite/issues/397] is still 
> suffering from this.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (TIKA-2096) Tika 2.0 -- Supply AutoDetectParser for embedded documents if user forgets to pass it in via ParseContext

2016-11-09 Thread Tim Allison (JIRA)

[ 
https://issues.apache.org/jira/browse/TIKA-2096?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15652854#comment-15652854
 ] 

Tim Allison commented on TIKA-2096:
---

We may want to accelerate this and put it into Tika 1.15.  I just found that 
the MailContentHandler was supplying an AutoDetectParser, but the others 
aren't.  On TIKA-2159, I removed this from the MailContentHandler.  Any 
objections, if we add this to all parsers now?

> Tika 2.0 -- Supply AutoDetectParser for embedded documents if user forgets to 
> pass it in via ParseContext
> -
>
> Key: TIKA-2096
> URL: https://issues.apache.org/jira/browse/TIKA-2096
> Project: Tika
>  Issue Type: Sub-task
>Reporter: Tim Allison
>
> Currently, if users don't specify a Parser.class or an 
> EmbeddedDocumentExtractor in the ParseContext, then embedded documents will 
> not be parsed.  I propose that we add an AutoDetectParser automatically if a 
> Parser or EmbeddedDocumentExtractor is not included in the ParseContext.
> If a user doesn't want to parse embedded objects, s/he could pass in an 
> EmptyParser for the Parser.class.
> In short, let's make the default be "parse everything", and the user has to 
> figure out how to parse only the container document if that's the desired 
> behavior.
> This is a breaking change.  I propose adding it to 2.0 only.
> We were bitten by this on tika-server (TIKA-1584).  Solr (SOLR-7189) has been 
> bitten by this. [Kite|https://github.com/kite-sdk/kite/issues/397] is still 
> suffering from this.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (TIKA-1974) Tika 2.0 - remove deprecated metadata properties

2016-09-26 Thread Tim Allison (JIRA)

[ 
https://issues.apache.org/jira/browse/TIKA-1974?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15523845#comment-15523845
 ] 

Tim Allison commented on TIKA-1974:
---

I'm starting to work on this a bit.  For metadata items that map directly to 
Dublin Core, do we want to have copies of them in TikaCoreProperties, e.g.:

{noformat}
/**
 * @see DublinCore#FORMAT
 */
public static final Property FORMAT = DublinCore.FORMAT;

   /**
* @see DublinCore#IDENTIFIER
*/
   public static final Property IDENTIFIER = DublinCore.IDENTIFIER;
{noformat}

Or, should we delete these in TikaCoreProperties and just use DC?

> Tika 2.0 - remove deprecated metadata properties
> 
>
> Key: TIKA-1974
> URL: https://issues.apache.org/jira/browse/TIKA-1974
> Project: Tika
>  Issue Type: Sub-task
>Reporter: Tim Allison
>
> We have quite a few metadata properties that are deprecated.  We should 
> remove them for 2.0.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (TIKA-2096) Tika 2.0 -- Supply AutoDetectParser for embedded documents if user forgets to pass it in via ParseContext

2016-09-26 Thread Tim Allison (JIRA)

 [ 
https://issues.apache.org/jira/browse/TIKA-2096?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tim Allison updated TIKA-2096:
--
Issue Type: Sub-task  (was: Improvement)
Parent: TIKA-2085

> Tika 2.0 -- Supply AutoDetectParser for embedded documents if user forgets to 
> pass it in via ParseContext
> -
>
> Key: TIKA-2096
> URL: https://issues.apache.org/jira/browse/TIKA-2096
> Project: Tika
>  Issue Type: Sub-task
>Reporter: Tim Allison
>
> Currently, if users don't specify a Parser.class or an 
> EmbeddedDocumentExtractor in the ParseContext, then embedded documents will 
> not be parsed.  I propose that we add an AutoDetectParser automatically if a 
> Parser or EmbeddedDocumentExtractor is not included in the ParseContext.
> If a user doesn't want to parse embedded objects, s/he could pass in an 
> EmptyParser for the Parser.class.
> In short, let's make the default be "parse everything", and the user has to 
> figure out how to parse only the container document if that's the desired 
> behavior.
> This is a breaking change.  I propose adding it to 2.0 only.
> We were bitten by this on tika-server (TIKA-1584).  Solr (SOLR-7189) has been 
> bitten by this. [Kite|https://github.com/kite-sdk/kite/issues/397] is still 
> suffering from this.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Created] (TIKA-2096) Tika 2.0 -- Supply AutoDetectParser for embedded documents if user forgets to pass it in via ParseContext

2016-09-26 Thread Tim Allison (JIRA)
Tim Allison created TIKA-2096:
-

 Summary: Tika 2.0 -- Supply AutoDetectParser for embedded 
documents if user forgets to pass it in via ParseContext
 Key: TIKA-2096
 URL: https://issues.apache.org/jira/browse/TIKA-2096
 Project: Tika
  Issue Type: Improvement
Reporter: Tim Allison


Currently, if users don't specify a Parser.class or an 
EmbeddedDocumentExtractor in the ParseContext, then embedded documents will not 
be parsed.  I propose that we add an AutoDetectParser automatically if a Parser 
or EmbeddedDocumentExtractor is not included in the ParseContext.

If a user doesn't want to parse embedded objects, s/he could pass in an 
EmptyParser for the Parser.class.

In short, let's make the default be "parse everything", and the user has to 
figure out how to parse only the container document if that's the desired 
behavior.

This is a breaking change.  I propose adding it to 2.0 only.

We were bitten by this on tika-server (TIKA-1584).  Solr (SOLR-7189) has been 
bitten by this. [Kite|https://github.com/kite-sdk/kite/issues/397] is still 
suffering from this.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


Re: Plans for the first Tika 2.0 release

2016-09-21 Thread Mattmann, Chris A (3980)
NLP/NER is as high a priority to me as the OCR stuff..we have a whole meta 
framework
for doing NER/NLP with NERRecogniser and really cool Tensorflow and other stuff.
Hoping 2.0 can help solve this! ☺

++
Chris Mattmann, Ph.D.
Chief Architect, Instrument Software and Science Data Systems Section (398)
Manager, Open Source Projects Formulation and Development Office (8212)
NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA
Office: 168-519, Mailstop: 168-527
Email: chris.a.mattm...@nasa.gov
WWW:  http://sunset.usc.edu/~mattmann/
++
Director, Information Retrieval and Data Science Group (IRDS)
Adjunct Associate Professor, Computer Science Department
University of Southern California, Los Angeles, CA 90089 USA
WWW: http://irds.usc.edu/
++
 


On 9/21/16, 7:40 AM, "Nick Burch"  wrote:

On Mon, 19 Sep 2016, Bob Paulin wrote:
> I think it's a good thing to discuss.  I know there are other features 
> that are targeted for 2.0.  Do we have a general sense of where those 
> features are at?

I think the big one we need to crack is allowing multiple parsers to run 
against a file. OCR is probably the most critical of these from the 
modularisation perspective, with all those nasty interlinkings between the 
parsers to allow the manual delegation. If we can crack the problem of 
multiple parsers, those proxy issues should go away (or at least get 
better!)

As a bonus, it ought to also improve things for error cases (fallback 
parsers etc), but for your needs, the simplification for "ocr + image 
metadata" is likely your biggest win!

(I think it might also let us tidy up some of the enhancement parsers too, 
like how the NLP stuff fits into the parsing framework)

Nick





Re: Plans for the first Tika 2.0 release

2016-09-21 Thread Nick Burch

On Mon, 19 Sep 2016, Bob Paulin wrote:
I think it's a good thing to discuss.  I know there are other features 
that are targeted for 2.0.  Do we have a general sense of where those 
features are at?


I think the big one we need to crack is allowing multiple parsers to run 
against a file. OCR is probably the most critical of these from the 
modularisation perspective, with all those nasty interlinkings between the 
parsers to allow the manual delegation. If we can crack the problem of 
multiple parsers, those proxy issues should go away (or at least get 
better!)


As a bonus, it ought to also improve things for error cases (fallback 
parsers etc), but for your needs, the simplification for "ocr + image 
metadata" is likely your biggest win!


(I think it might also let us tidy up some of the enhancement parsers too, 
like how the NLP stuff fits into the parsing framework)


Nick


Re: Plans for the first Tika 2.0 release

2016-09-19 Thread Bob Paulin

I think that could work!  I've also created a custom filter that might help

https://issues.apache.org/jira/browse/TIKA-2083?filter=12338448

Logic is as follows:

project = TIKA AND affectedVersion = 2.0 AND priority >= Blocker AND 
status != Closed AND status != Fixed



- Bob


On 9/19/2016 1:40 PM, Allison, Timothy B. wrote:

Should we create a tika-2_0-blocker label to differentiate from regular 
"blockers"?

How about a single master issue: TIKA-2085.

What else do we need to add?




RE: Plans for the first Tika 2.0 release

2016-09-19 Thread Allison, Timothy B.
> Should we create a tika-2_0-blocker label to differentiate from regular 
> "blockers"?

How about a single master issue: TIKA-2085.

What else do we need to add?


[jira] [Updated] (TIKA-1974) Tika 2.0 - remove deprecated metadata properties

2016-09-19 Thread Tim Allison (JIRA)

 [ 
https://issues.apache.org/jira/browse/TIKA-1974?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tim Allison updated TIKA-1974:
--
Issue Type: Sub-task  (was: Task)
Parent: TIKA-2085

> Tika 2.0 - remove deprecated metadata properties
> 
>
> Key: TIKA-1974
> URL: https://issues.apache.org/jira/browse/TIKA-1974
> Project: Tika
>  Issue Type: Sub-task
>Reporter: Tim Allison
>
> We have quite a few metadata properties that are deprecated.  We should 
> remove them for 2.0.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (TIKA-2083) Tika 2.0 - Audit master branch against 2.x branch

2016-09-19 Thread Tim Allison (JIRA)

 [ 
https://issues.apache.org/jira/browse/TIKA-2083?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tim Allison updated TIKA-2083:
--
Issue Type: Sub-task  (was: Task)
Parent: TIKA-2085

> Tika 2.0 - Audit master branch against 2.x branch
> -
>
> Key: TIKA-2083
> URL: https://issues.apache.org/jira/browse/TIKA-2083
> Project: Tika
>  Issue Type: Sub-task
>Affects Versions: 2.0
>Reporter: Bob Paulin
>Assignee: Bob Paulin
>Priority: Blocker
> Fix For: 2.0
>
>
> At this point Tika has been doing parallel development on master and the 2.x 
> for about 9 months.  We should audit commit logs for that time to make a best 
> effort to identify any commits that may not have been applied in 2.x.  This 
> task should be done prior to the 2.0 release



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Created] (TIKA-2085) Tika 2.0 -- Overarching task list for what we need to do before 2.0

2016-09-19 Thread Tim Allison (JIRA)
Tim Allison created TIKA-2085:
-

 Summary: Tika 2.0 -- Overarching task list for what we need to do 
before 2.0
 Key: TIKA-2085
 URL: https://issues.apache.org/jira/browse/TIKA-2085
 Project: Tika
  Issue Type: Task
Reporter: Tim Allison


Let's use this issue to track issues that absolutely, positively have to be 
completed before we release Tika 2.0.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


RE: Plans for the first Tika 2.0 release

2016-09-19 Thread Allison, Timothy B.
>> 1) Implement various strategies for chaining multiple parsers against 
>> individual files.  Much of this has been implemented, but what's holding us 
>> up on this one (I think?) is a resettable outputstream.
>I think we need a JIRA for this.  Is there any existing design ideas on how 
>this would be achieved?
Opened TIKA-2084 as subtask of TIKA-1509

> 2) Rich metadata (TIKA-1607)
This is great.  I think we need to ensure we have JIRAs for all the features we 
consider blockers and label them as such.  This looks like there's a lot of 
good discussion.  It also references TIKA-1903 so is that also a Tika 2.0 
blocker?
TIKA-1903 is not a blocker on 2.0, and may be obviated by TIKA-1607.

>> 1) Get rid of old metadata tags in favor of "new" Dublin core
>Need JIRA?
Sorry, opened a good while ago: TIKA-1974

> If we can't get a date we should at least try to eliminate the ???. I think 
> we need to close down the feature set.
Y, completely agree.

Should we create a tika-2_0-blocker label to differentiate from regular 
"blockers"?


[jira] [Created] (TIKA-2083) Tika 2.0 - Audit master branch against 2.x branch

2016-09-19 Thread Bob Paulin (JIRA)
Bob Paulin created TIKA-2083:


 Summary: Tika 2.0 - Audit master branch against 2.x branch
 Key: TIKA-2083
 URL: https://issues.apache.org/jira/browse/TIKA-2083
 Project: Tika
  Issue Type: Task
Affects Versions: 2.0
Reporter: Bob Paulin
Assignee: Bob Paulin
Priority: Blocker
 Fix For: 2.0


At this point Tika has been doing parallel development on master and the 2.x 
for about 9 months.  We should audit commit logs for that time to make a best 
effort to identify any commits that may not have been applied in 2.x.  This 
task should be done prior to the 2.0 release



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


Re: Plans for the first Tika 2.0 release

2016-09-19 Thread Bob Paulin

Thanks Tim!  Replies in line.

- Bob
On 9/19/2016 12:33 PM, Allison, Timothy B. wrote:

Bob,
   As always, thank you for driving 2.0!


My concern is we have been dual maintaining 2 branches for about 9 months.  I 
think the longer we do this the more risk there is that we miss something.

Agreed.  I think we're already missing a few things.
Yikes is there a way we can audit what we might have missed? Perhaps we 
need a JIRA to do an audit of the commits in master and do a best effort 
of what might have been missed?  I can create the JIRA for this.



Would it make sense to at least put a date out there for a feature cut off?

I'd be hesitant to do this.  To my mind, the key is the actual features and 
devs who have time to implement them.
Ok this is a start to understand what the blocking features are. The key 
will be creating concrete JIRAs for them and identifying where we are at.


For me, the blocking new features are:

1) Implement various strategies for chaining multiple parsers against 
individual files.  Much of this has been implemented, but what's holding us up 
on this one (I think?) is a resettable outputstream.
I think we need a JIRA for this.  Is there any existing design ideas on 
how this would be achieved?


2) Rich metadata (TIKA-1607)
This is great.  I think we need to ensure we have JIRAs for all the 
features we consider blockers and label them as such.  This looks like 
there's a lot of good discussion.  It also references TIKA-1903 so is 
that also a Tika 2.0 blocker?


The blocking tasks:
1) Get rid of old metadata tags in favor of "new" Dublin core

Need JIRA?

2) ???
If we can't get a date we should at least try to eliminate the ???. I 
think we need to close down the feature set.


I'm full up on other stuff at the moment, perhaps after we get 1.14 out, I can 
turn to 2.0-specific development.

What else do we have to do? Anyone else have some time?


Yes please would be great to see if there are people that want to own 
work on the above features.  Once we have JIRAs we can post to the 
Apache Help Wanted page as well.


Thanks!



Cheers,

Tim

-Original Message-
From: Bob Paulin [mailto:b...@bobpaulin.com]
Sent: Monday, September 19, 2016 10:32 AM
To: dev@tika.apache.org
Subject: Re: Plans for the first Tika 2.0 release

Hi,

I think it's a good thing to discuss.  I know there are other features that are 
targeted for 2.0.  Do we have a general sense of where those features are at?  
My concern is we have been dual maintaining 2 branches for about 9 months.  I 
think the longer we do this the more risk there is that we miss something.  
Would it make sense to at least put a date
out there for a feature cut off?   There's always 3.0 if things are not
close to being ready.


- Bob






RE: Plans for the first Tika 2.0 release

2016-09-19 Thread Allison, Timothy B.
Bob,
  As always, thank you for driving 2.0!

> My concern is we have been dual maintaining 2 branches for about 9 months.  I 
> think the longer we do this the more risk there is that we miss something.  

Agreed.  I think we're already missing a few things.

> Would it make sense to at least put a date out there for a feature cut off?

I'd be hesitant to do this.  To my mind, the key is the actual features and 
devs who have time to implement them.

For me, the blocking new features are:

1) Implement various strategies for chaining multiple parsers against 
individual files.  Much of this has been implemented, but what's holding us up 
on this one (I think?) is a resettable outputstream.

2) Rich metadata (TIKA-1607)

The blocking tasks:
1) Get rid of old metadata tags in favor of "new" Dublin core
2) ???

I'm full up on other stuff at the moment, perhaps after we get 1.14 out, I can 
turn to 2.0-specific development.

What else do we have to do? Anyone else have some time?

Cheers,

   Tim

-Original Message-
From: Bob Paulin [mailto:b...@bobpaulin.com] 
Sent: Monday, September 19, 2016 10:32 AM
To: dev@tika.apache.org
Subject: Re: Plans for the first Tika 2.0 release

Hi,

I think it's a good thing to discuss.  I know there are other features that are 
targeted for 2.0.  Do we have a general sense of where those features are at?  
My concern is we have been dual maintaining 2 branches for about 9 months.  I 
think the longer we do this the more risk there is that we miss something.  
Would it make sense to at least put a date 
out there for a feature cut off?   There's always 3.0 if things are not 
close to being ready.


- Bob




Re: Plans for the first Tika 2.0 release

2016-09-19 Thread Bob Paulin

Hi,

I think it's a good thing to discuss.  I know there are other features 
that are targeted for 2.0.  Do we have a general sense of where those 
features are at?  My concern is we have been dual maintaining 2 branches 
for about 9 months.  I think the longer we do this the more risk there 
is that we miss something.  Would it make sense to at least put a date 
out there for a feature cut off?   There's always 3.0 if things are not 
close to being ready.



- Bob


On 9/19/2016 4:32 AM, Sergey Beryozkin wrote:

Hi All

Back in May I updated one of our CXF demos on the master 3.2 branch to 
depend on Tika 2.0 SNAPSHOT to verify the new module system works well.
It is feasible that CXF 3.2.0 may be released by the end of the year 
or early next year.
As far as Tika 2.0 dependencies are concerned it will be easy for me 
to update the demo to temporarily depend on Tika 1.13 or 1.14. But if 
Tika 2.0 is released by the time CXF 3.2 is about to be released then 
I'll be happy to keep 2.0 deps.

Are there any plans to get Tika 2.0 out in the next few months ?

Cheers, Sergey








Plans for the first Tika 2.0 release

2016-09-19 Thread Sergey Beryozkin

Hi All

Back in May I updated one of our CXF demos on the master 3.2 branch to 
depend on Tika 2.0 SNAPSHOT to verify the new module system works well.
It is feasible that CXF 3.2.0 may be released by the end of the year or 
early next year.
As far as Tika 2.0 dependencies are concerned it will be easy for me to 
update the demo to temporarily depend on Tika 1.13 or 1.14. But if Tika 
2.0 is released by the time CXF 3.2 is about to be released then I'll be 
happy to keep 2.0 deps.

Are there any plans to get Tika 2.0 out in the next few months ?

Cheers, Sergey





Re: PDF with embedded attachments and Tika 2.0 modularity

2016-09-16 Thread Sergey Beryozkin
, your point is well taken.

Y, you'd need most parsers, but you can _probably_ live without
advanced or scientific (sorry, Chris!).

I'd be hesitant to change the structure much.  We should definitely
document this well, though!

-Original Message-
From: Sergey Beryozkin [mailto:sberyoz...@gmail.com]
Sent: Thursday, September 15, 2016 12:15 PM
To: dev@tika.apache.org
Subject: PDF with embedded attachments and Tika 2.0 modularity

Hi All

As Tim educated me, PDF (and indeed other formats) may have all sort
of embedded attachments.

In my demo I've been working with Tika 2.0-SNAPSHOT which offers a
nice option for users to pick up only individual parsers. So I've
added PDFParser & OpenDocumentParser and tike-core to the project
dependencies and all works very nice when I submit to the demo a
simple PDF.

But if I were to write the code which can handle the embedded
attachments really well then I think I'll probably need to revert to
depending on all of tika-parsers - otherwise how would I know which
additional parser modules I should add ? If this reasoning is right
then one can only use individual modules in the production if it is
well-known the files to be processed will have no unexpected formats
embedded in them...

I've been wondering - would it make sense, for Tika 2.0, add few more
'helper' modules for most used formats, which would offer less than
tika-parsers but more than individual modules, for example:

this is what 2.x already has:


tika-parser-modules/
tika-parser-pdf-module
(individual parser modules for the most used ones)

tika-parsers
(all of the parsers)

and now add:

tika-parser-pdf-module-all
(or similarly named)

this

tika-parser-pdf-module-all

will depend on tika-parser-pdf-module plus the parsers which will be
needed to process various PDF attachments ? This list of the extra
deps will be based on the accumulated knowledge. Similarly for few
other most used formats

tika-parser-pdf-module-all will be a 'compromise', it will pull more
modules than tika-parser-pdf-module but significantly less than
tike-parsers


Cheers, Sergey












--
Sergey Beryozkin

Talend Community Coders
http://coders.talend.com/


Re: PDF with embedded attachments and Tika 2.0 modularity

2016-09-16 Thread Bob Paulin

Hi Sergey,


On 9/15/2016 3:33 PM, Sergey Beryozkin wrote:

Hi Bob, Tim, All,
On 15/09/16 18:06, Bob Paulin wrote:

Hi Sergey,

I definitely get the challenges.  In fact recently we merged the PDF
module into the Multimedia module due to the tight coupling around the
TesseractOCR[1] [2].  We could look into separating the PDF parser out
again but I'm a bit short on a simple way to do it with TesseractOCR in
play.  Like Tim I'm hesitant to change structure but we definitely need
to address how we handle embedded parsers.  I've done some work with the
ParserProxy class to remove some of the hard dependencies between
parsers.  With that we  only pull in the parsers available on the class
path.  There an example in the JackcessExtractor class in the office
module.

What is the motivation behind excluding the other parsers in your
usecase?  Smaller footprint?  Incompatibility?  Performance?

Depending on the the driver there may be other ways to get you to a
similar place.

Smaller footprint

This is the one, it is not a big deal to have all of tika-parsers 
included in my demo, but I've been curious how the smaller footprint 
can indeed be achieved in Tika 2.x given it already does the best 
effort at supporting more modular Tika applications...
Totally makes sense.  I think you'll end up getting most of what you 
need by just pulling in the tika-parser-multimedia-module.  It's already 
got all the image parsers for embedded images and TesseractOCR so you 
can take your demo as far as reading all the images and converting some 
of the images to text if you have Tesseract installed.



You could just include the modules you need and any embedded parsers
from other modules could be added via a ParserProxy.  This might not
remove all the parsers you don't need but might be a good start.


I haven't heard of ParserProxy yet, sorry :-). As a Tika user I'm just 
learning. How would one use ParserProxy to minimize the dependencies ?

Just found
https://issues.apache.org/jira/browse/TIKA-1904

Sorry I took you for a Tika veteran based on your concerns for embedded 
parsers!  The ParserProxy is new in 2.x and  would actually not need to 
worry about it for coding your demo or a client application.  It more 
for the framework to allow the modules to compile without parsers from 
other modules on the classpath.  It pulls them in via reflection at 
runtime or if they are not present fallsback to a no-op.



The most trimmed down way is what you've provided below in your example
creating a tika-parser-pdf-module-all.  I'm concerned about the number
of combinations we might end up creating.

Sure, if such an option would ever be considered then I'd imagine 
there would have to be a limit set. Ex, 5 most widely used formats 
which may have embedded attachments would have an extra module support 
(core parser like PDF parser plus the support parsers for the embedded 
attachments).


I agree that a limit would be needed  Would it make sense to hold on 
including them in Tika for now and see if some popular combinations 
emerge?  Your demo is a great first step to get some feedback; I think 
we need more in order to ensure we're making the correct combinations.




But I'm OK with selecting the individual parser modules that may be 
needed to have a nearly complete PDF parsing coverage, as long as I 
know which modules I have to select :-)


Yes lets start with the multimedia module.  I think you'll get quite a 
bit of cool things within that.  Tim do you know of any other modules 
that would make sense?

Incompatibility

You might want to look at the tika-parser-bundle projects since putting
the modules in an OSGi container will allow you isolate the 
classloaders.


Performance

A combination of the above or you might look to include a
tika-config.xml and just exclude the parsers you don't want. That
should prevent them from being a part of your pipeline.

Other ideas on this?  I think it's an important thing to discuss.


Many thanks, Sergey


Thank you for the feedback!


- Bob

[1] http://markmail.org/message/e4ncuid7zrvlitp5

[2] https://issues.apache.org/jira/browse/TIKA-2059


On 9/15/2016 11:20 AM, Allison, Timothy B. wrote:

Sergey, your point is well taken.

Y, you'd need most parsers, but you can _probably_ live without
advanced or scientific (sorry, Chris!).

I'd be hesitant to change the structure much.  We should definitely
document this well, though!

-Original Message-
From: Sergey Beryozkin [mailto:sberyoz...@gmail.com]
Sent: Thursday, September 15, 2016 12:15 PM
To: dev@tika.apache.org
Subject: PDF with embedded attachments and Tika 2.0 modularity

Hi All

As Tim educated me, PDF (and indeed other formats) may have all sort
of embedded attachments.

In my demo I've been working with Tika 2.0-SNAPSHOT which offers a
nice option for users to pick up only individual parsers. So I've
added PDFParser & OpenDocumentParser and tike-core to the project
dependencies and all works very nice when I su

Re: PDF with embedded attachments and Tika 2.0 modularity

2016-09-15 Thread Bob Paulin

Hi Sergey,

I definitely get the challenges.  In fact recently we merged the PDF 
module into the Multimedia module due to the tight coupling around the 
TesseractOCR[1] [2].  We could look into separating the PDF parser out 
again but I'm a bit short on a simple way to do it with TesseractOCR in 
play.  Like Tim I'm hesitant to change structure but we definitely need 
to address how we handle embedded parsers.  I've done some work with the 
ParserProxy class to remove some of the hard dependencies between 
parsers.  With that we  only pull in the parsers available on the class 
path.  There an example in the JackcessExtractor class in the office module.


What is the motivation behind excluding the other parsers in your 
usecase?  Smaller footprint?  Incompatibility?  Performance?


Depending on the the driver there may be other ways to get you to a 
similar place.


Smaller footprint

You could just include the modules you need and any embedded parsers 
from other modules could be added via a ParserProxy.  This might not 
remove all the parsers you don't need but might be a good start.  The 
most trimmed down way is what you've provided below in your example 
creating a tika-parser-pdf-module-all.  I'm concerned about the number 
of combinations we might end up creating.


Incompatibility

You might want to look at the tika-parser-bundle projects since putting 
the modules in an OSGi container will allow you isolate the classloaders.


Performance

A combination of the above or you might look to include a 
tika-config.xml and just exclude the parsers you don't want.  That 
should prevent them from being a part of your pipeline.


Other ideas on this?  I think it's an important thing to discuss.


- Bob

[1] http://markmail.org/message/e4ncuid7zrvlitp5

[2] https://issues.apache.org/jira/browse/TIKA-2059


On 9/15/2016 11:20 AM, Allison, Timothy B. wrote:

Sergey, your point is well taken.

Y, you'd need most parsers, but you can _probably_ live without advanced or 
scientific (sorry, Chris!).

I'd be hesitant to change the structure much.  We should definitely document 
this well, though!

-Original Message-
From: Sergey Beryozkin [mailto:sberyoz...@gmail.com]
Sent: Thursday, September 15, 2016 12:15 PM
To: dev@tika.apache.org
Subject: PDF with embedded attachments and Tika 2.0 modularity

Hi All

As Tim educated me, PDF (and indeed other formats) may have all sort of 
embedded attachments.

In my demo I've been working with Tika 2.0-SNAPSHOT which offers a nice option for 
users to pick up only individual parsers. So I've added PDFParser & 
OpenDocumentParser and tike-core to the project dependencies and all works very 
nice when I submit to the demo a simple PDF.

But if I were to write the code which can handle the embedded attachments 
really well then I think I'll probably need to revert to depending on all of 
tika-parsers - otherwise how would I know which additional parser modules I 
should add ? If this reasoning is right then one can only use individual 
modules in the production if it is well-known the files to be processed will 
have no unexpected formats embedded in them...

I've been wondering - would it make sense, for Tika 2.0, add few more 'helper' 
modules for most used formats, which would offer less than tika-parsers but 
more than individual modules, for example:

this is what 2.x already has:


tika-parser-modules/
tika-parser-pdf-module
(individual parser modules for the most used ones)

tika-parsers
(all of the parsers)

and now add:

tika-parser-pdf-module-all
(or similarly named)

this

tika-parser-pdf-module-all

will depend on tika-parser-pdf-module plus the parsers which will be needed to 
process various PDF attachments ? This list of the extra deps will be based on 
the accumulated knowledge. Similarly for few other most used formats

tika-parser-pdf-module-all will be a 'compromise', it will pull more modules 
than tika-parser-pdf-module but significantly less than tike-parsers


Cheers, Sergey






  1   2   3   4   >