[jira] [Commented] (TIKA-3560) Tika 2.0 (and 2.1) parses doc with less fidelity than using 1.24.1
[ https://issues.apache.org/jira/browse/TIKA-3560?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17425171#comment-17425171 ] Josh Burchard commented on TIKA-3560: - Thank you Tim. The wiki looks good so far and I appreciate you creating it. I'll let you if there are any issues I see moving forward. > Tika 2.0 (and 2.1) parses doc with less fidelity than using 1.24.1 > -- > > Key: TIKA-3560 > URL: https://issues.apache.org/jira/browse/TIKA-3560 > Project: Tika > Issue Type: Bug > Components: parser >Affects Versions: 2.0.0, 2.1.0 > Environment: Windows 10 >Reporter: Josh Burchard >Priority: Major > Attachments: Capture.jpg > > > I'm parsing an old .doc file and I'm sending my request to the /rmeta/text > endpoint. I see that some metadata fields that were returned to me from Tika > 1.24.1 are no longer returned in 2.0 and above. > I diffed the output between 1.24.1, 2.0 and 2.1. Please see the screenshot > I've attached. > -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (TIKA-3560) Tika 2.0 (and 2.1) parses doc with less fidelity than using 1.24.1
[ https://issues.apache.org/jira/browse/TIKA-3560?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17424178#comment-17424178 ] Tim Allison commented on TIKA-3560: --- I updated the metadata section in our wiki page "migrating to tika 2.x" today. I looked into subject, and it looks like we were putting "keywords" into subject in 1.x as well as into keywords. We've kept that behavior in 2.x. I'm not sure why there's an array in 2.x but not in 1.x. Those should be the same. In 2.1.1-SNAPSHOT, I added empty checks for subject, keywords, title and other keys in the MSOffice parsers. They used to allow an empty string for string based metadata values. > Tika 2.0 (and 2.1) parses doc with less fidelity than using 1.24.1 > -- > > Key: TIKA-3560 > URL: https://issues.apache.org/jira/browse/TIKA-3560 > Project: Tika > Issue Type: Bug > Components: parser >Affects Versions: 2.0.0, 2.1.0 > Environment: Windows 10 >Reporter: Josh Burchard >Priority: Major > Attachments: Capture.jpg > > > I'm parsing an old .doc file and I'm sending my request to the /rmeta/text > endpoint. I see that some metadata fields that were returned to me from Tika > 1.24.1 are no longer returned in 2.0 and above. > I diffed the output between 1.24.1, 2.0 and 2.1. Please see the screenshot > I've attached. > -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Resolved] (TIKA-3560) Tika 2.0 (and 2.1) parses doc with less fidelity than using 1.24.1
[ https://issues.apache.org/jira/browse/TIKA-3560?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Tim Allison resolved TIKA-3560. --- Resolution: Fixed Please reopen if there are any surprises and/or if there's anything I can do on our wiki to improve the documentation in migrating to 2.x. > Tika 2.0 (and 2.1) parses doc with less fidelity than using 1.24.1 > -- > > Key: TIKA-3560 > URL: https://issues.apache.org/jira/browse/TIKA-3560 > Project: Tika > Issue Type: Bug > Components: parser >Affects Versions: 2.0.0, 2.1.0 > Environment: Windows 10 >Reporter: Josh Burchard >Priority: Major > Attachments: Capture.jpg > > > I'm parsing an old .doc file and I'm sending my request to the /rmeta/text > endpoint. I see that some metadata fields that were returned to me from Tika > 1.24.1 are no longer returned in 2.0 and above. > I diffed the output between 1.24.1, 2.0 and 2.1. Please see the screenshot > I've attached. > -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (TIKA-3560) Tika 2.0 (and 2.1) parses doc with less fidelity than using 1.24.1
[ https://issues.apache.org/jira/browse/TIKA-3560?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17422697#comment-17422697 ] Tim Allison commented on TIKA-3560: --- I'm sorry for my delay. I should have a chance to look at dc:subject. I definitely should document the major changes. Where should I do that? https://cwiki.apache.org/confluence/display/TIKA/Migrating+to+Tika+2.0.0 ? > Tika 2.0 (and 2.1) parses doc with less fidelity than using 1.24.1 > -- > > Key: TIKA-3560 > URL: https://issues.apache.org/jira/browse/TIKA-3560 > Project: Tika > Issue Type: Bug > Components: parser >Affects Versions: 2.0.0, 2.1.0 > Environment: Windows 10 >Reporter: Josh Burchard >Priority: Major > Attachments: Capture.jpg > > > I'm parsing an old .doc file and I'm sending my request to the /rmeta/text > endpoint. I see that some metadata fields that were returned to me from Tika > 1.24.1 are no longer returned in 2.0 and above. > I diffed the output between 1.24.1, 2.0 and 2.1. Please see the screenshot > I've attached. > -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (TIKA-3560) Tika 2.0 (and 2.1) parses doc with less fidelity than using 1.24.1
[ https://issues.apache.org/jira/browse/TIKA-3560?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17419473#comment-17419473 ] Josh Burchard commented on TIKA-3560: - Thank you for all the comments, Tim. Author is one that we were using in our application and that's what first got my attention. OK, so it's now _only_ dc:creator. That's fine. I guess I just have a bunch of code adjustments to make on our end as the consumer. ;) Just two more questions: # Is there any cross-reference doc that was used during the task to slim down these duplicated attributes? # dc:subject looks like it's now an array where in 1.24.1 it was just a simple string. Is that intentional? Feel free to close this as a non-issue. Thanks again. > Tika 2.0 (and 2.1) parses doc with less fidelity than using 1.24.1 > -- > > Key: TIKA-3560 > URL: https://issues.apache.org/jira/browse/TIKA-3560 > Project: Tika > Issue Type: Bug > Components: parser >Affects Versions: 2.0.0, 2.1.0 > Environment: Windows 10 >Reporter: Josh Burchard >Priority: Major > Attachments: Capture.jpg > > > I'm parsing an old .doc file and I'm sending my request to the /rmeta/text > endpoint. I see that some metadata fields that were returned to me from Tika > 1.24.1 are no longer returned in 2.0 and above. > I diffed the output between 1.24.1, 2.0 and 2.1. Please see the screenshot > I've attached. > -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Comment Edited] (TIKA-3560) Tika 2.0 (and 2.1) parses doc with less fidelity than using 1.24.1
[ https://issues.apache.org/jira/browse/TIKA-3560?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17419338#comment-17419338 ] Tim Allison edited comment on TIKA-3560 at 9/23/21, 5:16 PM: - As background, before my time on the project, IIUC, we used a file-format specific keys, some formats may have had "author", others "creator", etc. Then we had a massive contribution from Ray Gauss II which normalized everything as much as possible to DublinCore (e.g. dcterms:created). To enable backward compatibility at that time, we left in the old keys and added the new "standard" keys; so at that time we had duplicate/triplicate keys for the same information. In 2.x, we tried to remove the old duplicate/triplicate keys and use only the "standard" keys. was (Author: talli...@mitre.org): As background, before my time on the project, IIUC, we used a file-format specific keys, some formats may have had "author", others "creator", etc. Then we had a massive contribution from Ray Gauss II which normalized everything as much as possible to DublinCore (e.g. dcterms:created). To enable backward compatibility at that time, we left in the old keys and added the new "standard" keys. In 2.x, we tried to remove the old duplicate/triplicate keys. > Tika 2.0 (and 2.1) parses doc with less fidelity than using 1.24.1 > -- > > Key: TIKA-3560 > URL: https://issues.apache.org/jira/browse/TIKA-3560 > Project: Tika > Issue Type: Bug > Components: parser >Affects Versions: 2.0.0, 2.1.0 > Environment: Windows 10 >Reporter: Josh Burchard >Priority: Major > Attachments: Capture.jpg > > > I'm parsing an old .doc file and I'm sending my request to the /rmeta/text > endpoint. I see that some metadata fields that were returned to me from Tika > 1.24.1 are no longer returned in 2.0 and above. > I diffed the output between 1.24.1, 2.0 and 2.1. Please see the screenshot > I've attached. > -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Comment Edited] (TIKA-3560) Tika 2.0 (and 2.1) parses doc with less fidelity than using 1.24.1
[ https://issues.apache.org/jira/browse/TIKA-3560?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17419334#comment-17419334 ] Tim Allison edited comment on TIKA-3560 at 9/23/21, 5:15 PM: - As I look at the image a bit more, there are other cases where we've removed the duplicate or even triplicate keys for the same information. {{Application-Name}} used to have {{Application-Name}} and {{extended-properties:Application}}. We've slimmed down in favor of {{extended-properties:Application}}. {{Edit-Time}} is now {{extended-properties:TotalTime}} {{Creation-Date}}, {{meta:creation-date}} and {{dcterms:created}} have been reduced to {{dcterms:created}} {{Last-Save-Date}} is now {{dcterms:modified}} Which are the concerning keys that do not at all exist in 2.x? -{{Template}}-, -{{Revision-Number}}-... Anything else? Sorry, Template is there: {{extended-properties:Template}}. Revision number is there too: {{cp:revision}}. So, are there any keys in 1.x that do not have a value in 2.x? was (Author: talli...@mitre.org): As I look at the image a bit more, there are other cases where we've removed the duplicate or even triplicate keys for the same information. {{Application-Name}} used to have {{Application-Name}} and {{extended-properties:Application}}. We've slimmed down in favor of {{extended-properties:Application}}. {{Edit-Time}} is now {{extended-properties:TotalTime}} {{Creation-Date}}, {{meta:creation-date}} and {{dcterms:created}} have been reduced to {{dcterms:created}} {{Last-Save-Date}} is now {{dcterms:modified}} Which are the concerning keys that do not at all exist in 2.x? {{Template}}, {{Revision-Number}}... Anything else? > Tika 2.0 (and 2.1) parses doc with less fidelity than using 1.24.1 > -- > > Key: TIKA-3560 > URL: https://issues.apache.org/jira/browse/TIKA-3560 > Project: Tika > Issue Type: Bug > Components: parser >Affects Versions: 2.0.0, 2.1.0 > Environment: Windows 10 >Reporter: Josh Burchard >Priority: Major > Attachments: Capture.jpg > > > I'm parsing an old .doc file and I'm sending my request to the /rmeta/text > endpoint. I see that some metadata fields that were returned to me from Tika > 1.24.1 are no longer returned in 2.0 and above. > I diffed the output between 1.24.1, 2.0 and 2.1. Please see the screenshot > I've attached. > -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (TIKA-3560) Tika 2.0 (and 2.1) parses doc with less fidelity than using 1.24.1
[ https://issues.apache.org/jira/browse/TIKA-3560?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17419338#comment-17419338 ] Tim Allison commented on TIKA-3560: --- As background, before my time on the project, IIUC, we used a file-format specific keys, some formats may have had "author", others "creator", etc. Then we had a massive contribution from Ray Gauss II which normalized everything as much as possible to DublinCore (e.g. dcterms:created). To enable backward compatibility at that time, we left in the old keys and added the new "standard" keys. In 2.x, we tried to remove the old duplicate/triplicate keys. > Tika 2.0 (and 2.1) parses doc with less fidelity than using 1.24.1 > -- > > Key: TIKA-3560 > URL: https://issues.apache.org/jira/browse/TIKA-3560 > Project: Tika > Issue Type: Bug > Components: parser >Affects Versions: 2.0.0, 2.1.0 > Environment: Windows 10 >Reporter: Josh Burchard >Priority: Major > Attachments: Capture.jpg > > > I'm parsing an old .doc file and I'm sending my request to the /rmeta/text > endpoint. I see that some metadata fields that were returned to me from Tika > 1.24.1 are no longer returned in 2.0 and above. > I diffed the output between 1.24.1, 2.0 and 2.1. Please see the screenshot > I've attached. > -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Comment Edited] (TIKA-3560) Tika 2.0 (and 2.1) parses doc with less fidelity than using 1.24.1
[ https://issues.apache.org/jira/browse/TIKA-3560?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17419334#comment-17419334 ] Tim Allison edited comment on TIKA-3560 at 9/23/21, 5:11 PM: - As I look at the image a bit more, there are other cases where we've removed the duplicate or even triplicate keys for the same information. {{Application-Name}} used to have {{Application-Name}} and {{extended-properties:Application}}. We've slimmed down in favor of {{extended-properties:Application}}. {{Edit-Time}} is now {{extended-properties:TotalTime}} {{Creation-Date}}, {{meta:creation-date}} and {{dcterms:created}} have been reduced to {{dcterms:created}} {{Last-Save-Date}} is now {{dcterms:modified}} Which are the concerning keys that do not at all exist in 2.x? {{Template}}, {{Revision-Number}}... Anything else? was (Author: talli...@mitre.org): As I look at the image a bit more, there are other cases where we've removed the duplicate keys. {{Application-Name}} used to have {{Application-Name}} and {{extended-properties:Application}}. We've slimmed down in favor of {{extended-properties:Application}}. {{Edit-Time}} is now {{extended-properties:TotalTime}} {{Creation-Date}}, {{meta:creation-date}} and {{dcterms:created}} have been reduced to {{dcterms:created}} {{Last-Save-Date}} is now {{dcterms:modified}} Which are the concerning keys that do not at all exist in 2.x? {{Template}}, {{Revision-Number}}... Anything else? > Tika 2.0 (and 2.1) parses doc with less fidelity than using 1.24.1 > -- > > Key: TIKA-3560 > URL: https://issues.apache.org/jira/browse/TIKA-3560 > Project: Tika > Issue Type: Bug > Components: parser >Affects Versions: 2.0.0, 2.1.0 > Environment: Windows 10 >Reporter: Josh Burchard >Priority: Major > Attachments: Capture.jpg > > > I'm parsing an old .doc file and I'm sending my request to the /rmeta/text > endpoint. I see that some metadata fields that were returned to me from Tika > 1.24.1 are no longer returned in 2.0 and above. > I diffed the output between 1.24.1, 2.0 and 2.1. Please see the screenshot > I've attached. > -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (TIKA-3560) Tika 2.0 (and 2.1) parses doc with less fidelity than using 1.24.1
[ https://issues.apache.org/jira/browse/TIKA-3560?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17419334#comment-17419334 ] Tim Allison commented on TIKA-3560: --- As I look at the image a bit more, there are other cases where we've removed the duplicate keys. {{Application-Name}} used to have {{Application-Name}} and {{extended-properties:Application}}. We've slimmed down in favor of {{extended-properties:Application}}. {{Edit-Time}} is now {{extended-properties:TotalTime}} {{Creation-Date}}, {{meta:creation-date}} and {{dcterms:created}} have been reduced to {{dcterms:created}} {{Last-Save-Date}} is now {{dcterms:modified}} Which are the concerning keys that do not at all exist in 2.x? {{Template}}, {{Revision-Number}}... Anything else? > Tika 2.0 (and 2.1) parses doc with less fidelity than using 1.24.1 > -- > > Key: TIKA-3560 > URL: https://issues.apache.org/jira/browse/TIKA-3560 > Project: Tika > Issue Type: Bug > Components: parser >Affects Versions: 2.0.0, 2.1.0 > Environment: Windows 10 >Reporter: Josh Burchard >Priority: Major > Attachments: Capture.jpg > > > I'm parsing an old .doc file and I'm sending my request to the /rmeta/text > endpoint. I see that some metadata fields that were returned to me from Tika > 1.24.1 are no longer returned in 2.0 and above. > I diffed the output between 1.24.1, 2.0 and 2.1. Please see the screenshot > I've attached. > -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (TIKA-3560) Tika 2.0 (and 2.1) parses doc with less fidelity than using 1.24.1
[ https://issues.apache.org/jira/browse/TIKA-3560?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17419327#comment-17419327 ] Tim Allison commented on TIKA-3560: --- Not a problem. We probably have some example files within our unit test sets or possibly regression. > Tika 2.0 (and 2.1) parses doc with less fidelity than using 1.24.1 > -- > > Key: TIKA-3560 > URL: https://issues.apache.org/jira/browse/TIKA-3560 > Project: Tika > Issue Type: Bug > Components: parser >Affects Versions: 2.0.0, 2.1.0 > Environment: Windows 10 >Reporter: Josh Burchard >Priority: Major > Attachments: Capture.jpg > > > I'm parsing an old .doc file and I'm sending my request to the /rmeta/text > endpoint. I see that some metadata fields that were returned to me from Tika > 1.24.1 are no longer returned in 2.0 and above. > I diffed the output between 1.24.1, 2.0 and 2.1. Please see the screenshot > I've attached. > -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (TIKA-3560) Tika 2.0 (and 2.1) parses doc with less fidelity than using 1.24.1
[ https://issues.apache.org/jira/browse/TIKA-3560?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17419297#comment-17419297 ] Josh Burchard commented on TIKA-3560: - It looks like it contains some confidential info as well as some PII, so I'd better not upload this particular one. That's disappointing. I'll see if I have another file that prompts similar output when parsed. > Tika 2.0 (and 2.1) parses doc with less fidelity than using 1.24.1 > -- > > Key: TIKA-3560 > URL: https://issues.apache.org/jira/browse/TIKA-3560 > Project: Tika > Issue Type: Bug > Components: parser >Affects Versions: 2.0.0, 2.1.0 > Environment: Windows 10 >Reporter: Josh Burchard >Priority: Major > Attachments: Capture.jpg > > > I'm parsing an old .doc file and I'm sending my request to the /rmeta/text > endpoint. I see that some metadata fields that were returned to me from Tika > 1.24.1 are no longer returned in 2.0 and above. > I diffed the output between 1.24.1, 2.0 and 2.1. Please see the screenshot > I've attached. > -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (TIKA-3560) Tika 2.0 (and 2.1) parses doc with less fidelity than using 1.24.1
[ https://issues.apache.org/jira/browse/TIKA-3560?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17419293#comment-17419293 ] Tim Allison commented on TIKA-3560: --- K. If you can't share it publicly, but can share it with me privately, let me know. > Tika 2.0 (and 2.1) parses doc with less fidelity than using 1.24.1 > -- > > Key: TIKA-3560 > URL: https://issues.apache.org/jira/browse/TIKA-3560 > Project: Tika > Issue Type: Bug > Components: parser >Affects Versions: 2.0.0, 2.1.0 > Environment: Windows 10 >Reporter: Josh Burchard >Priority: Major > Attachments: Capture.jpg > > > I'm parsing an old .doc file and I'm sending my request to the /rmeta/text > endpoint. I see that some metadata fields that were returned to me from Tika > 1.24.1 are no longer returned in 2.0 and above. > I diffed the output between 1.24.1, 2.0 and 2.1. Please see the screenshot > I've attached. > -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (TIKA-3560) Tika 2.0 (and 2.1) parses doc with less fidelity than using 1.24.1
[ https://issues.apache.org/jira/browse/TIKA-3560?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17419290#comment-17419290 ] Josh Burchard commented on TIKA-3560: - It's a pretty old file that's used in a test suite that I inherited, AND it's in Japanese so I'll need to translate it and check that it's ok to upload. > Tika 2.0 (and 2.1) parses doc with less fidelity than using 1.24.1 > -- > > Key: TIKA-3560 > URL: https://issues.apache.org/jira/browse/TIKA-3560 > Project: Tika > Issue Type: Bug > Components: parser >Affects Versions: 2.0.0, 2.1.0 > Environment: Windows 10 >Reporter: Josh Burchard >Priority: Major > Attachments: Capture.jpg > > > I'm parsing an old .doc file and I'm sending my request to the /rmeta/text > endpoint. I see that some metadata fields that were returned to me from Tika > 1.24.1 are no longer returned in 2.0 and above. > I diffed the output between 1.24.1, 2.0 and 2.1. Please see the screenshot > I've attached. > -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (TIKA-3560) Tika 2.0 (and 2.1) parses doc with less fidelity than using 1.24.1
[ https://issues.apache.org/jira/browse/TIKA-3560?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17418847#comment-17418847 ] Tim Allison commented on TIKA-3560: --- We streamlined the double-entry key names for "created" and "modified" and a few others in 2.x. There are several in there, though, that are more perplexing. Any chance you can share the file? > Tika 2.0 (and 2.1) parses doc with less fidelity than using 1.24.1 > -- > > Key: TIKA-3560 > URL: https://issues.apache.org/jira/browse/TIKA-3560 > Project: Tika > Issue Type: Bug > Components: parser >Affects Versions: 2.0.0, 2.1.0 > Environment: Windows 10 >Reporter: Josh Burchard >Priority: Major > Attachments: Capture.jpg > > > I'm parsing an old .doc file and I'm sending my request to the /rmeta/text > endpoint. I see that some metadata fields that were returned to me from Tika > 1.24.1 are no longer returned in 2.0 and above. > I diffed the output between 1.24.1, 2.0 and 2.1. Please see the screenshot > I've attached. > -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Updated] (TIKA-3560) Tika 2.0 (and 2.1) parses doc with less fidelity than using 1.24.1
[ https://issues.apache.org/jira/browse/TIKA-3560?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Josh Burchard updated TIKA-3560: Attachment: Capture.jpg > Tika 2.0 (and 2.1) parses doc with less fidelity than using 1.24.1 > -- > > Key: TIKA-3560 > URL: https://issues.apache.org/jira/browse/TIKA-3560 > Project: Tika > Issue Type: Bug > Components: parser >Affects Versions: 2.0.0, 2.1.0 > Environment: Windows 10 >Reporter: Josh Burchard >Priority: Major > Attachments: Capture.jpg > > > I'm parsing an old .doc file and I'm sending my request to the /rmeta/text > endpoint. I see that some metadata fields that were returned to me from Tika > 1.24.1 are no longer returned in 2.0 and above. > I diffed the output between 1.24.1, 2.0 and 2.1. Please see the screenshot > I've attached. > -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (TIKA-3560) Tika 2.0 (and 2.1) parses doc with less fidelity than using 1.24.1
Josh Burchard created TIKA-3560: --- Summary: Tika 2.0 (and 2.1) parses doc with less fidelity than using 1.24.1 Key: TIKA-3560 URL: https://issues.apache.org/jira/browse/TIKA-3560 Project: Tika Issue Type: Bug Components: parser Affects Versions: 2.1.0, 2.0.0 Environment: Windows 10 Reporter: Josh Burchard I'm parsing an old .doc file and I'm sending my request to the /rmeta/text endpoint. I see that some metadata fields that were returned to me from Tika 1.24.1 are no longer returned in 2.0 and above. I diffed the output between 1.24.1, 2.0 and 2.1. Please see the screenshot I've attached. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Updated] (TIKA-2071) Tika 2.0 - DefaultParser and CompositeParser does not filter excludedParsers from dynamic ServiceLoader Parsers
[ https://issues.apache.org/jira/browse/TIKA-2071?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Tim Allison updated TIKA-2071: -- Fix Version/s: (was: 2.0.1) (was: 2.0.0-BETA) > Tika 2.0 - DefaultParser and CompositeParser does not filter excludedParsers > from dynamic ServiceLoader Parsers > --- > > Key: TIKA-2071 > URL: https://issues.apache.org/jira/browse/TIKA-2071 > Project: Tika > Issue Type: Bug >Reporter: Bob Paulin >Assignee: Bob Paulin >Priority: Major > > The DefaultParser and CompositeParser do not filter dynamic services using > the excludedParser List. The exclude list should be applied here as well. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Updated] (TIKA-2071) Tika 2.0 - DefaultParser and CompositeParser does not filter excludedParsers from dynamic ServiceLoader Parsers
[ https://issues.apache.org/jira/browse/TIKA-2071?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Tim Allison updated TIKA-2071: -- Fix Version/s: (was: 2.0.0) 2.0.0-BETA > Tika 2.0 - DefaultParser and CompositeParser does not filter excludedParsers > from dynamic ServiceLoader Parsers > --- > > Key: TIKA-2071 > URL: https://issues.apache.org/jira/browse/TIKA-2071 > Project: Tika > Issue Type: Bug >Reporter: Bob Paulin >Assignee: Bob Paulin >Priority: Major > Fix For: 2.0.0-BETA, 2.0.1 > > > The DefaultParser and CompositeParser do not filter dynamic services using > the excludedParser List. The exclude list should be applied here as well. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Updated] (TIKA-2071) Tika 2.0 - DefaultParser and CompositeParser does not filter excludedParsers from dynamic ServiceLoader Parsers
[ https://issues.apache.org/jira/browse/TIKA-2071?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Tim Allison updated TIKA-2071: -- Fix Version/s: (was: 2.0.0) 2.0.1 > Tika 2.0 - DefaultParser and CompositeParser does not filter excludedParsers > from dynamic ServiceLoader Parsers > --- > > Key: TIKA-2071 > URL: https://issues.apache.org/jira/browse/TIKA-2071 > Project: Tika > Issue Type: Bug >Reporter: Bob Paulin >Assignee: Bob Paulin >Priority: Major > Fix For: 2.0.0-BETA, 2.0.1 > > > The DefaultParser and CompositeParser do not filter dynamic services using > the excludedParser List. The exclude list should be applied here as well. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (TIKA-3343) Move Tika's legacy lang id to its own submodule for Tika 2.0
[ https://issues.apache.org/jira/browse/TIKA-3343?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17318489#comment-17318489 ] Hudson commented on TIKA-3343: -- SUCCESS: Integrated in Jenkins build Tika » tika-main-jdk8 #191 (See [https://ci-builds.apache.org/job/Tika/job/tika-main-jdk8/191/]) TIKA-3343 -- move Tika's legacy lang detector to its own submodule in tika-langdetect -- git add lang models (tallison: [https://github.com/apache/tika/commit/bd6cbb56b9ee65bb1ef72a23cad5961f02223a9a]) * (add) tika-langdetect/tika-langdetect-tika/src/main/resources/org/apache/tika/langdetect/tika/eo.ngp * (add) tika-langdetect/tika-langdetect-tika/src/main/resources/org/apache/tika/langdetect/tika/it.ngp * (add) tika-langdetect/tika-langdetect-tika/src/main/resources/org/apache/tika/langdetect/tika/be.ngp * (add) tika-langdetect/tika-langdetect-tika/src/main/resources/org/apache/tika/langdetect/tika/el.ngp * (add) tika-langdetect/tika-langdetect-tika/src/main/resources/org/apache/tika/langdetect/tika/nl.ngp * (add) tika-langdetect/tika-langdetect-tika/src/main/resources/org/apache/tika/langdetect/tika/no.ngp * (add) tika-langdetect/tika-langdetect-tika/src/main/resources/org/apache/tika/langdetect/tika/de.ngp * (add) tika-langdetect/tika-langdetect-tika/src/main/resources/org/apache/tika/langdetect/tika/ru.ngp * (add) tika-langdetect/tika-langdetect-tika/src/main/resources/org/apache/tika/langdetect/tika/fi.ngp * (add) tika-langdetect/tika-langdetect-tika/src/main/resources/org/apache/tika/langdetect/tika/es.ngp * (add) tika-langdetect/tika-langdetect-tika/src/main/resources/org/apache/tika/langdetect/tika/en.ngp * (add) tika-langdetect/tika-langdetect-tika/src/main/resources/org/apache/tika/langdetect/tika/pt.ngp * (add) tika-langdetect/tika-langdetect-tika/src/main/resources/org/apache/tika/langdetect/tika/th.ngp * (add) tika-langdetect/tika-langdetect-tika/src/main/resources/org/apache/tika/langdetect/tika/ro.ngp * (add) tika-langdetect/tika-langdetect-tika/src/main/resources/org/apache/tika/langdetect/tika/fa.ngp * (add) tika-langdetect/tika-langdetect-tika/src/main/resources/org/apache/tika/langdetect/tika/hu.ngp * (add) tika-langdetect/tika-langdetect-tika/src/main/resources/org/apache/tika/langdetect/tika/pl.ngp * (add) tika-langdetect/tika-langdetect-tika/src/main/resources/org/apache/tika/langdetect/tika/ca.ngp * (add) tika-langdetect/tika-langdetect-tika/src/main/resources/org/apache/tika/langdetect/tika/da.ngp * (add) tika-langdetect/tika-langdetect-tika/src/main/resources/org/apache/tika/langdetect/tika/tika.language.properties * (add) tika-langdetect/tika-langdetect-tika/src/main/resources/org/apache/tika/langdetect/tika/fr.ngp * (add) tika-langdetect/tika-langdetect-tika/src/main/resources/org/apache/tika/langdetect/tika/is.ngp * (add) tika-langdetect/tika-langdetect-tika/src/main/resources/org/apache/tika/langdetect/tika/sk.ngp * (add) tika-langdetect/tika-langdetect-tika/src/main/resources/org/apache/tika/langdetect/tika/uk.ngp * (add) tika-langdetect/tika-langdetect-tika/src/main/resources/org/apache/tika/langdetect/tika/sv.ngp * (add) tika-langdetect/tika-langdetect-tika/src/main/resources/org/apache/tika/langdetect/tika/et.ngp * (add) tika-langdetect/tika-langdetect-tika/src/main/resources/org/apache/tika/langdetect/tika/sl.ngp * (add) tika-langdetect/tika-langdetect-tika/src/main/resources/org/apache/tika/langdetect/tika/lt.ngp * (add) tika-langdetect/tika-langdetect-tika/src/main/resources/org/apache/tika/langdetect/tika/gl.ngp > Move Tika's legacy lang id to its own submodule for Tika 2.0 > > > Key: TIKA-3343 > URL: https://issues.apache.org/jira/browse/TIKA-3343 > Project: Tika > Issue Type: Task >Reporter: Tim Allison >Priority: Major > Fix For: 2.0.0 > > > In the back of my mind, this was an agreed upon change for 2.x. I can't find > documentation, tho, so I'm opening this issue to discuss. > My memory is that we agreed that we should outsource language id to other > tools and remove our own lang ider for 2.x. If my memory is wrong, or if > there's a good reason to keep our language detection algorithm and data, > let's discuss. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (TIKA-3343) Move Tika's legacy lang id to its own submodule for Tika 2.0
/apache/tika/language/eo.ngp * (add) tika-langdetect/tika-langdetect-tika/src/test/resources/org/apache/tika/langdetect/tika/da.test * (delete) tika-core/src/main/resources/org/apache/tika/language/pl.ngp * (delete) tika-core/src/test/resources/org/apache/tika/language/fr.test * (delete) tika-core/src/test/resources/org/apache/tika/language/langbuilder/welsh_corpus.txt * (delete) tika-core/src/main/resources/org/apache/tika/language/el.ngp * (add) tika-langdetect/tika-langdetect-tika/pom.xml * (delete) tika-core/src/main/resources/org/apache/tika/language/it.ngp * (add) tika-langdetect/tika-langdetect-tika/src/main/java/org/apache/tika/langdetect/tika/ProfilingWriter.java * (delete) tika-core/src/test/java/org/apache/tika/language/LanguageProfilerBuilderTest.java * (delete) tika-core/src/test/resources/org/apache/tika/language/de.test * (add) tika-langdetect/tika-langdetect-tika/src/test/java/org/apache/tika/langdetect/tika/ProfilingHandler.java * (delete) tika-core/src/main/resources/org/apache/tika/language/ro.ngp * (add) tika-langdetect/tika-langdetect-tika/src/test/resources/org/apache/tika/langdetect/tika/pt.test * (delete) tika-core/src/main/java/org/apache/tika/language/ProfilingWriter.java * (add) tika-langdetect/tika-langdetect-tika/src/test/java/org/apache/tika/langdetect/tika/LanguageProfilerBuilderTest.java * (delete) tika-core/src/main/resources/org/apache/tika/language/sk.ngp * (delete) tika-core/src/main/resources/org/apache/tika/language/tika.language.properties * (add) tika-langdetect/tika-langdetect-tika/src/test/resources/org/apache/tika/langdetect/tika/fi.test * (add) tika-langdetect/tika-langdetect-tika/src/test/resources/org/apache/tika/langdetect/tika/fr.test * (delete) tika-core/src/main/resources/org/apache/tika/language/es.ngp * (add) tika-langdetect/tika-langdetect-tika/src/test/resources/org/apache/tika/langdetect/tika/es.test * (delete) tika-core/src/main/resources/org/apache/tika/language/fi.ngp * (delete) tika-core/src/main/java/org/apache/tika/language/LanguageIdentifier.java * (add) tika-langdetect/tika-langdetect-tika/src/main/java/org/apache/tika/langdetect/tika/LanguageIdentifier.java * (delete) tika-core/src/test/resources/org/apache/tika/language/fi.test * (delete) tika-core/src/test/resources/org/apache/tika/language/sv.test * (delete) tika-core/src/test/resources/org/apache/tika/language/es.test > Move Tika's legacy lang id to its own submodule for Tika 2.0 > > > Key: TIKA-3343 > URL: https://issues.apache.org/jira/browse/TIKA-3343 > Project: Tika > Issue Type: Task >Reporter: Tim Allison >Priority: Major > Fix For: 2.0.0 > > > In the back of my mind, this was an agreed upon change for 2.x. I can't find > documentation, tho, so I'm opening this issue to discuss. > My memory is that we agreed that we should outsource language id to other > tools and remove our own lang ider for 2.x. If my memory is wrong, or if > there's a good reason to keep our language detection algorithm and data, > let's discuss. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Updated] (TIKA-3343) Move Tika's legacy lang id to its own submodule for Tika 2.0
[ https://issues.apache.org/jira/browse/TIKA-3343?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Tim Allison updated TIKA-3343: -- Summary: Move Tika's legacy lang id to its own submodule for Tika 2.0 (was: Move Tika's legacy lang id to its own module) > Move Tika's legacy lang id to its own submodule for Tika 2.0 > > > Key: TIKA-3343 > URL: https://issues.apache.org/jira/browse/TIKA-3343 > Project: Tika > Issue Type: Task >Reporter: Tim Allison >Priority: Major > > In the back of my mind, this was an agreed upon change for 2.x. I can't find > documentation, tho, so I'm opening this issue to discuss. > My memory is that we agreed that we should outsource language id to other > tools and remove our own lang ider for 2.x. If my memory is wrong, or if > there's a good reason to keep our language detection algorithm and data, > let's discuss. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Resolved] (TIKA-3343) Move Tika's legacy lang id to its own submodule for Tika 2.0
[ https://issues.apache.org/jira/browse/TIKA-3343?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Tim Allison resolved TIKA-3343. --- Fix Version/s: 2.0.0 Resolution: Fixed > Move Tika's legacy lang id to its own submodule for Tika 2.0 > > > Key: TIKA-3343 > URL: https://issues.apache.org/jira/browse/TIKA-3343 > Project: Tika > Issue Type: Task >Reporter: Tim Allison >Priority: Major > Fix For: 2.0.0 > > > In the back of my mind, this was an agreed upon change for 2.x. I can't find > documentation, tho, so I'm opening this issue to discuss. > My memory is that we agreed that we should outsource language id to other > tools and remove our own lang ider for 2.x. If my memory is wrong, or if > there's a good reason to keep our language detection algorithm and data, > let's discuss. -- This message was sent by Atlassian Jira (v8.3.4#803005)
Re: OSGi support in Tika 2.0
Bob, Thank you for taking the lead on this discussion! tl:dr -- I somewhat prefer tighter modularization at the risk of duplicate dependencies, too. The simplicity of higher level bundles might make sense if we do a slight refactoring of the tika-parsers module. After a week away, I'm thinking it might make sense to refactor the tika-parsers module a bit to make explicit some of my underlying design choices. This basic design is somewhat already in main, but it is hidden. Based on user feedback over the years, it feels like there are three categories of parsers. 1) tika-parsers-module * pure Java ... no native libs * no parsers that are network dependent/rely on rest clients * "heavy" dependencies should be justified by the utility for the "general user" -- this is admittedly and regrettably qualitative/hand-wavy, but I want to allow POI's ooxml-schemas but disallow other large dependencies for more niche formats * no ML, entity extraction or "recognizers" * packaging: separate modules as we now have, but packaged and shipped as a single jar, and used as default by tika-app and tika-server EXCEPTION: OCR. Justification: this has become such a basic expectation for many users and is tightly coupled in the PDFParser. Further, the current "dependency" requires a user to install tesseract, meaning that the user has to choose to add this dependency. 2) tika-parsers-extended-module * native libs are allowed * parsers that are network dependent/rely on rest clients are allowed * no ML, entity extraction or recognizers * may have dependencies on parsers in tika-parsers-module * packaging: separate modules as we now have, packaged and released per module (e.g. there will be a sqlite-parser jar which includes our parser _and_ the native xerial.org's dependency); users will need to add their chosen parsers to their classpath if they're using tika-app or tika-server 3) tika-parsers-advanced-module * enormous dependencies, native libs and rest clients are allowed * ML, entity extraction, recognizers are allowed * may have dependencies on parsers in tika-parsers-module * packaging: separate jars per sub module. These jars will not be part of the release. I'll work on this in a separate branch today so that we can look at it together. I think it is important to get this consolidated before we make OSGi decisions. Note: I do not mean to hijack the OSGi discussion! And, I'm sorry for not realizing this earlier/including it in the refactoring a week ago, but here we are. :D Thank you, Bob and all, again! Cheers, Tim On Fri, Aug 28, 2020 at 10:46 AM Yegor Kozlov wrote: > Hi Bob, > > I'd say decomposition into smaller bundles is the way to go. In my > experience, OSGi bundles with too many dependencies are fragile and hard to > maintain. In the worst case, a regression in a maven-bundle-plugin > configuration would break a parser bundle instead of breaking all of them > in the uber-jar. > > Static linking of dependencies should be fine, however it can increase > the total size of the Tika distro because different parser bundles may > embed the same transitive dependencies like Apache-Commons, etc. The huge > pros is that static linking will make the bundles self-contained. > The alternative is to make dependencies optional, but in this case clients > will have to solve the puzzle of adding them into their OSGi containers. > It's doable, but will kill acceptance. > > > Regards, > Yegor > > On Thu, Aug 27, 2020 at 5:24 AM Bob Paulin wrote: > > > Hi, > > > > I wanted to discuss OSGi support in Tika 2.0. My current thought is to > > start with the minimum support which is to add bundle packaging to each > of > > the modules [1]. This will make the bundles usable is OSGi but will > leave > > users on there own for putting the right dependencies together for usage. > > From there we either stop or we can choose from a few different options: > > 1) Tika Bundle > > > > This is an all encompassing uber jar with all the parsers and > > dependencies we can legally get away with shipping with an Apache > license. > > > > Pros > > > > Low bar to entry for novice OSGi users > > > > Already exists in Tika 1.x > > > > Cons > > > > Difficult to maintain (very complicated maven-bundle-plugin config). > This > > has broken in several releases leaving it unusable. > > > > > > 2) Tika module convenience bundles > > > > This was part of the early 2.0 POC branch where each module had it's own > > tika-bundle with just it's dependencies statically included. > > > > Pros > > > > Less sophisticated maven-bundle-plugin configuration > > > > Low bar for novice OSGi users > > > > Cons > > > > More sub-modules to maintain. > > > > > > There are of course other options but I think it's important to decide if > > either, neither, or both of these options should be considered for the > > initial 2.0 release. > > > > > > - Bob > > > > > > [1] https://github.com/apache/tika/pull/344 > > > > > > >
Re: OSGi support in Tika 2.0
Hi Bob, I'd say decomposition into smaller bundles is the way to go. In my experience, OSGi bundles with too many dependencies are fragile and hard to maintain. In the worst case, a regression in a maven-bundle-plugin configuration would break a parser bundle instead of breaking all of them in the uber-jar. Static linking of dependencies should be fine, however it can increase the total size of the Tika distro because different parser bundles may embed the same transitive dependencies like Apache-Commons, etc. The huge pros is that static linking will make the bundles self-contained. The alternative is to make dependencies optional, but in this case clients will have to solve the puzzle of adding them into their OSGi containers. It's doable, but will kill acceptance. Regards, Yegor On Thu, Aug 27, 2020 at 5:24 AM Bob Paulin wrote: > Hi, > > I wanted to discuss OSGi support in Tika 2.0. My current thought is to > start with the minimum support which is to add bundle packaging to each of > the modules [1]. This will make the bundles usable is OSGi but will leave > users on there own for putting the right dependencies together for usage. > From there we either stop or we can choose from a few different options: > 1) Tika Bundle > > This is an all encompassing uber jar with all the parsers and > dependencies we can legally get away with shipping with an Apache license. > > Pros > > Low bar to entry for novice OSGi users > > Already exists in Tika 1.x > > Cons > > Difficult to maintain (very complicated maven-bundle-plugin config). This > has broken in several releases leaving it unusable. > > > 2) Tika module convenience bundles > > This was part of the early 2.0 POC branch where each module had it's own > tika-bundle with just it's dependencies statically included. > > Pros > > Less sophisticated maven-bundle-plugin configuration > > Low bar for novice OSGi users > > Cons > > More sub-modules to maintain. > > > There are of course other options but I think it's important to decide if > either, neither, or both of these options should be considered for the > initial 2.0 release. > > > - Bob > > > [1] https://github.com/apache/tika/pull/344 > > >
OSGi support in Tika 2.0
Hi, I wanted to discuss OSGi support in Tika 2.0. My current thought is to start with the minimum support which is to add bundle packaging to each of the modules [1]. This will make the bundles usable is OSGi but will leave users on there own for putting the right dependencies together for usage. From there we either stop or we can choose from a few different options: 1) Tika Bundle This is an all encompassing uber jar with all the parsers and dependencies we can legally get away with shipping with an Apache license. Pros Low bar to entry for novice OSGi users Already exists in Tika 1.x Cons Difficult to maintain (very complicated maven-bundle-plugin config). This has broken in several releases leaving it unusable. 2) Tika module convenience bundles This was part of the early 2.0 POC branch where each module had it's own tika-bundle with just it's dependencies statically included. Pros Less sophisticated maven-bundle-plugin configuration Low bar for novice OSGi users Cons More sub-modules to maintain. There are of course other options but I think it's important to decide if either, neither, or both of these options should be considered for the initial 2.0 release. - Bob [1] https://github.com/apache/tika/pull/344 signature.asc Description: OpenPGP digital signature
Re: [EXTERNAL] Tika 2.0 modularization
Hi Tim It looks good. Perfect. Do you plant to have tika-parsers reuse the new module as its dependencies ? Cheers, Sergey On Tue, Aug 18, 2020 at 3:41 PM Tim Allison wrote: > If anyone has any time, please take a look here: > https://github.com/apache/tika/tree/branch_2x/tika-parser-modules > > Does this basically look ok? > > I've put the integration tests in > https://github.com/apache/tika/tree/branch_2x/tika-parser-integration-tests > ... that doesn't build yet. > > I've flipped Bob's design so that the integration tests pull test files > from the individual parser modules via test-jar. > > On Fri, Aug 14, 2020 at 3:30 PM Bob Paulin wrote: > > > +1 excited about this. > > > > - Bob > > On 8/14/2020 11:29 AM, Sergey Beryozkin wrote: > > > > +1 > > > > Cheers Sergey > > > > On Fri 14 Aug 2020, 18:26 Chris Mattmann, < > mattm...@apache.org> wrote: > > > > > > Haha I’m down and supportive! > > > > > > > > Time’s TIME FOR 2.x > > > > > > > > > > > > > > > > From: Tim Allison > > Reply-To: "dev@tika.apache.org" < > dev@tika.apache.org> , "Allison, Tim (US > > 174B-Affiliate)" < > timothy.b.alli...@jpl.nasa.gov> > > Date: Friday, August 14, 2020 at 6:06 AM > > To: " " > > > Subject: [EXTERNAL] Tika 2.0 modularization > > > > > > > > All, > > > > I _think_ I might have some time to start working on integrating Bob's > > > > work on the current main branch. I'll have to ignore most of the > incoming > > > > issues for a bit...unlike the last 4 years...this time I mean it. :) > > > > Let me know if there are any objections to heading down this path now. > > > > > > > >Cheers, > > > > > > > > Tim > > > > > > > > > > > > >
Re: [EXTERNAL] Tika 2.0 modularization
Hey Tim, Just started taking a look. The test-jar approach could work but I recall I ran into some issues with getting access to some of the test files inside the test-jars for some of the junits. For many tests this was simple but for some I think it would require larger functional changes to the code that I was not comfortable proposing at the time. Makes sense to try this path again and see if you can get further than I did. - Bob On 8/18/2020 9:40 AM, Tim Allison wrote: > If anyone has any time, please take a look here: > https://github.com/apache/tika/tree/branch_2x/tika-parser-modules > > Does this basically look ok? > > I've put the integration tests in > https://github.com/apache/tika/tree/branch_2x/tika-parser-integration-tests > ... that doesn't build yet. > > I've flipped Bob's design so that the integration tests pull test files > from the individual parser modules via test-jar. > > On Fri, Aug 14, 2020 at 3:30 PM Bob Paulin wrote: > >> +1 excited about this. >> >> - Bob >> On 8/14/2020 11:29 AM, Sergey Beryozkin wrote: >> >> +1 >> >> Cheers Sergey >> >> On Fri 14 Aug 2020, 18:26 Chris Mattmann, >> wrote: >> >> >> Haha I’m down and supportive! >> >> >> >> Time’s TIME FOR 2.x >> >> >> >> >> >> >> >> From: Tim Allison >> Reply-To: "dev@tika.apache.org" >> , "Allison, Tim (US >> 174B-Affiliate)" >> >> Date: Friday, August 14, 2020 at 6:06 AM >> To: " " >> >> Subject: [EXTERNAL] Tika 2.0 modularization >> >> >> >> All, >> >> I _think_ I might have some time to start working on integrating Bob's >> >> work on the current main branch. I'll have to ignore most of the incoming >> >> issues for a bit...unlike the last 4 years...this time I mean it. :) >> >> Let me know if there are any objections to heading down this path now. >> >> >> >>Cheers, >> >> >> >> Tim >> >> >> >> >> >> signature.asc Description: OpenPGP digital signature
Re: [EXTERNAL] Tika 2.0 modularization
Hi Tim, looks awesome. Somehow I did not find a couple of parsers, probably it is because of on-going work ... In addition, I was thinking about "getting rid of" maven. If we are going to make Tika more modern, maybe gradle can do a trick? Do we plan to add new Java "gooddies" like lambdas, foreign-memory access API, records ... WDYT? BR, Oleg On Tue, Aug 18, 2020 at 5:41 PM Tim Allison wrote: > If anyone has any time, please take a look here: > https://github.com/apache/tika/tree/branch_2x/tika-parser-modules > > Does this basically look ok? > > I've put the integration tests in > https://github.com/apache/tika/tree/branch_2x/tika-parser-integration-tests > ... that doesn't build yet. > > I've flipped Bob's design so that the integration tests pull test files > from the individual parser modules via test-jar. > > On Fri, Aug 14, 2020 at 3:30 PM Bob Paulin wrote: > > > +1 excited about this. > > > > - Bob > > On 8/14/2020 11:29 AM, Sergey Beryozkin wrote: > > > > +1 > > > > Cheers Sergey > > > > On Fri 14 Aug 2020, 18:26 Chris Mattmann, < > mattm...@apache.org> wrote: > > > > > > Haha I’m down and supportive! > > > > > > > > Time’s TIME FOR 2.x > > > > > > > > > > > > > > > > From: Tim Allison > > Reply-To: "dev@tika.apache.org" < > dev@tika.apache.org> , "Allison, Tim (US > > 174B-Affiliate)" < > timothy.b.alli...@jpl.nasa.gov> > > Date: Friday, August 14, 2020 at 6:06 AM > > To: " " > > > Subject: [EXTERNAL] Tika 2.0 modularization > > > > > > > > All, > > > > I _think_ I might have some time to start working on integrating Bob's > > > > work on the current main branch. I'll have to ignore most of the > incoming > > > > issues for a bit...unlike the last 4 years...this time I mean it. :) > > > > Let me know if there are any objections to heading down this path now. > > > > > > > >Cheers, > > > > > > > > Tim > > > > > > > > > > > > >
Re: [EXTERNAL] Tika 2.0 modularization
Hi Tim, I looked at the HTML module, and seems logical/straightforward. Thanks for pushing on this. — Ken > On Aug 18, 2020, at 7:40 AM, Tim Allison wrote: > > If anyone has any time, please take a look here: > https://github.com/apache/tika/tree/branch_2x/tika-parser-modules > > Does this basically look ok? > > I've put the integration tests in > https://github.com/apache/tika/tree/branch_2x/tika-parser-integration-tests > ... that doesn't build yet. > > I've flipped Bob's design so that the integration tests pull test files > from the individual parser modules via test-jar. > > On Fri, Aug 14, 2020 at 3:30 PM Bob Paulin wrote: > >> +1 excited about this. >> >> - Bob >> On 8/14/2020 11:29 AM, Sergey Beryozkin wrote: >> >> +1 >> >> Cheers Sergey >> >> On Fri 14 Aug 2020, 18:26 Chris Mattmann, >> wrote: >> >> >> Haha I’m down and supportive! >> >> >> >> Time’s TIME FOR 2.x >> >> From: Tim Allison >> Reply-To: "dev@tika.apache.org" >> , "Allison, Tim (US >> 174B-Affiliate)" >> >> Date: Friday, August 14, 2020 at 6:06 AM >> To: " " >> >> Subject: [EXTERNAL] Tika 2.0 modularization >> >> >> >> All, >> >> I _think_ I might have some time to start working on integrating Bob's >> >> work on the current main branch. I'll have to ignore most of the incoming >> >> issues for a bit...unlike the last 4 years...this time I mean it. :) >> >> Let me know if there are any objections to heading down this path now. >> >> >> >> Cheers, >> >> >> >> Tim -- Ken Krugler http://www.scaleunlimited.com custom big data solutions & training Hadoop, Cascading, Cassandra & Solr
Re: [EXTERNAL] Tika 2.0 modularization
Thank you! >Somehow I did not find a couple of parsers, probably it is because of on-going work ... Yep. Exactly. I didn't want to put in the work in this direction if there were any showstoppers. >If we are going to make Tika more modern, maybe gradle can do a trick? My gradle isn't as strong as maven, but if you or anyone else wants to translate, I'd be good with that. Let me do the maven modularization first? How much effort would this be? >Do we plan to add new Java "gooddies" like lambdas, foreign-memory access API, records Elasticsearch is already at 11, and the next version of Solr requires 11. I'm happy keeping Tika at 1.8 or moving to 11. I think 14 is a bit too cutting edge for Tika 2.0.0...maybe 3.0.0? Any thoughts on what we do with Jigsaw? Should we shoot the moon and move to 11 and jigsaw, go with multi-version jars or just go with what we have and make modest changes so that we are hostile to folks using jigsaw? On Tue, Aug 18, 2020 at 11:38 AM Oleg Tikhonov wrote: > Hi Tim, > looks awesome. > Somehow I did not find a couple of parsers, probably it is because of > on-going work ... > In addition, I was thinking about "getting rid of" maven. If we are going > to make Tika more modern, maybe gradle can do a trick? > Do we plan to add new Java "gooddies" like lambdas, foreign-memory access > API, records ... > > WDYT? > BR, > Oleg > > > > > On Tue, Aug 18, 2020 at 5:41 PM Tim Allison wrote: > >> If anyone has any time, please take a look here: >> https://github.com/apache/tika/tree/branch_2x/tika-parser-modules >> >> Does this basically look ok? >> >> I've put the integration tests in >> >> https://github.com/apache/tika/tree/branch_2x/tika-parser-integration-tests >> ... that doesn't build yet. >> >> I've flipped Bob's design so that the integration tests pull test files >> from the individual parser modules via test-jar. >> >> On Fri, Aug 14, 2020 at 3:30 PM Bob Paulin wrote: >> >> > +1 excited about this. >> > >> > - Bob >> > On 8/14/2020 11:29 AM, Sergey Beryozkin wrote: >> > >> > +1 >> > >> > Cheers Sergey >> > >> > On Fri 14 Aug 2020, 18:26 Chris Mattmann, < >> mattm...@apache.org> wrote: >> > >> > >> > Haha I’m down and supportive! >> > >> > >> > >> > Time’s TIME FOR 2.x >> > >> > >> > >> > >> > >> > >> > >> > From: Tim Allison >> > Reply-To: "dev@tika.apache.org" < >> dev@tika.apache.org> , "Allison, Tim (US >> > 174B-Affiliate)" < >> timothy.b.alli...@jpl.nasa.gov> >> > Date: Friday, August 14, 2020 at 6:06 AM >> > To: " " >> >> > Subject: [EXTERNAL] Tika 2.0 modularization >> > >> > >> > >> > All, >> > >> > I _think_ I might have some time to start working on integrating Bob's >> > >> > work on the current main branch. I'll have to ignore most of the >> incoming >> > >> > issues for a bit...unlike the last 4 years...this time I mean it. :) >> > >> > Let me know if there are any objections to heading down this path now. >> > >> > >> > >> >Cheers, >> > >> > >> > >> > Tim >> > >> > >> > >> > >> > >> > >> >
Re: [EXTERNAL] Tika 2.0 modularization
If anyone has any time, please take a look here: https://github.com/apache/tika/tree/branch_2x/tika-parser-modules Does this basically look ok? I've put the integration tests in https://github.com/apache/tika/tree/branch_2x/tika-parser-integration-tests ... that doesn't build yet. I've flipped Bob's design so that the integration tests pull test files from the individual parser modules via test-jar. On Fri, Aug 14, 2020 at 3:30 PM Bob Paulin wrote: > +1 excited about this. > > - Bob > On 8/14/2020 11:29 AM, Sergey Beryozkin wrote: > > +1 > > Cheers Sergey > > On Fri 14 Aug 2020, 18:26 Chris Mattmann, > wrote: > > > Haha I’m down and supportive! > > > > Time’s TIME FOR 2.x > > > > > > > > From: Tim Allison > Reply-To: "dev@tika.apache.org" > , "Allison, Tim (US > 174B-Affiliate)" > > Date: Friday, August 14, 2020 at 6:06 AM > To: " " > > Subject: [EXTERNAL] Tika 2.0 modularization > > > > All, > > I _think_ I might have some time to start working on integrating Bob's > > work on the current main branch. I'll have to ignore most of the incoming > > issues for a bit...unlike the last 4 years...this time I mean it. :) > > Let me know if there are any objections to heading down this path now. > > > >Cheers, > > > > Tim > > > > > >
Re: [EXTERNAL] Tika 2.0 modularization
+1 excited about this. - Bob On 8/14/2020 11:29 AM, Sergey Beryozkin wrote: > +1 > > Cheers Sergey > > On Fri 14 Aug 2020, 18:26 Chris Mattmann, wrote: > >> Haha I’m down and supportive! >> >> >> >> Time’s TIME FOR 2.x >> >> >> >> >> >> >> >> From: Tim Allison >> Reply-To: "dev@tika.apache.org" , "Allison, Tim (US >> 174B-Affiliate)" >> Date: Friday, August 14, 2020 at 6:06 AM >> To: "" >> Subject: [EXTERNAL] Tika 2.0 modularization >> >> >> >> All, >> >> I _think_ I might have some time to start working on integrating Bob's >> >> work on the current main branch. I'll have to ignore most of the incoming >> >> issues for a bit...unlike the last 4 years...this time I mean it. :) >> >> Let me know if there are any objections to heading down this path now. >> >> >> >>Cheers, >> >> >> >> Tim >> >> >> >> signature.asc Description: OpenPGP digital signature
Re: [EXTERNAL] Tika 2.0 modularization
+1 Cheers Sergey On Fri 14 Aug 2020, 18:26 Chris Mattmann, wrote: > Haha I’m down and supportive! > > > > Time’s TIME FOR 2.x > > > > > > > > From: Tim Allison > Reply-To: "dev@tika.apache.org" , "Allison, Tim (US > 174B-Affiliate)" > Date: Friday, August 14, 2020 at 6:06 AM > To: "" > Subject: [EXTERNAL] Tika 2.0 modularization > > > > All, > > I _think_ I might have some time to start working on integrating Bob's > > work on the current main branch. I'll have to ignore most of the incoming > > issues for a bit...unlike the last 4 years...this time I mean it. :) > > Let me know if there are any objections to heading down this path now. > > > >Cheers, > > > > Tim > > > >
Re: [EXTERNAL] Tika 2.0 modularization
Haha I’m down and supportive! Time’s TIME FOR 2.x From: Tim Allison Reply-To: "dev@tika.apache.org" , "Allison, Tim (US 174B-Affiliate)" Date: Friday, August 14, 2020 at 6:06 AM To: "" Subject: [EXTERNAL] Tika 2.0 modularization All, I _think_ I might have some time to start working on integrating Bob's work on the current main branch. I'll have to ignore most of the incoming issues for a bit...unlike the last 4 years...this time I mean it. :) Let me know if there are any objections to heading down this path now. Cheers, Tim
Tika 2.0 modularization
All, I _think_ I might have some time to start working on integrating Bob's work on the current main branch. I'll have to ignore most of the incoming issues for a bit...unlike the last 4 years...this time I mean it. :) Let me know if there are any objections to heading down this path now. Cheers, Tim
[jira] [Commented] (TIKA-2795) Error starting Tika 2.0 server with -spawnChild on Ubuntu
[ https://issues.apache.org/jira/browse/TIKA-2795?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16831443#comment-16831443 ] Mario Bisonti commented on TIKA-2795: - Hallo Tim. I use : java -Dlog4j.configuration=file:/opt/tika/log4j.xml -jar /opt/tika/tika-server-1.20.jar -JDlog4j.configuration=file:/opt/tika/log4j_child.xml --host=hostname -spawnChild -taskTimeoutMillis 100 tika-server-1.20.jar is a downloaded snapshot version in december 2018 and it works fine Thanks a lot Mario > Error starting Tika 2.0 server with -spawnChild on Ubuntu > - > > Key: TIKA-2795 > URL: https://issues.apache.org/jira/browse/TIKA-2795 > Project: Tika > Issue Type: Bug > Components: server >Affects Versions: 2.0 >Reporter: Mario Bisonti >Assignee: Tim Allison >Priority: Major > Fix For: 2.0.0, 1.20 > > > Hallo. > I triend to download Tika server 2.0.0 from here: > [https://builds.apache.org/job/Tika-trunk/lastStableBuild/org.apache.tika$tika-server/] > > I tried to start on my Ubuntu server but with the -spawnChild, it doesn't > work. > > sudo java -jar /opt/tika/tika-server-2.0.0-20181203.205013-340.jar -spawnChild > Dec 05, 2018 2:22:32 PM org.apache.tika.config.InitializableProblemHandler$3 > handleInitializableProblem > WARNING: J2KImageReader not loaded. JPEG2000 files will not be processed. > See https://pdfbox.apache.org/2.0/dependencies.html#jai-image-io > for optional dependencies. > Dec 05, 2018 2:22:32 PM org.apache.tika.config.InitializableProblemHandler$3 > handleInitializableProblem > WARNING: org.xerial's sqlite-jdbc is not loaded. > Please provide the jar on your classpath to parse sqlite files. > See tika-parsers/pom.xml for the correct version. > INFO Starting Apache Tika 2.0.0-SNAPSHOT server > INFO server watch dog is starting up > Dec 05, 2018 2:22:32 PM org.apache.tika.config.InitializableProblemHandler$3 > handleInitializableProblem > WARNING: J2KImageReader not loaded. JPEG2000 files will not be processed. > See https://pdfbox.apache.org/2.0/dependencies.html#jai-image-io > for optional dependencies. > Dec 05, 2018 2:22:32 PM org.apache.tika.config.InitializableProblemHandler$3 > handleInitializableProblem > WARNING: org.xerial's sqlite-jdbc is not loaded. > Please provide the jar on your classpath to parse sqlite files. > See tika-parsers/pom.xml for the correct version. > INFO Starting Apache Tika 2.0.0-SNAPSHOT server > java.nio.file.NoSuchFileException: > /tmp/tika-server-child-process-mmap-2180120677326747096 > at > java.base/sun.nio.fs.UnixException.translateToIOException(UnixException.java:92) > at > java.base/sun.nio.fs.UnixException.rethrowAsIOException(UnixException.java:111) > at > java.base/sun.nio.fs.UnixException.rethrowAsIOException(UnixException.java:116) > at > java.base/sun.nio.fs.UnixFileAttributeViews$Basic.readAttributes(UnixFileAttributeViews.java:55) > at > java.base/sun.nio.fs.UnixFileSystemProvider.readAttributes(UnixFileSystemProvider.java:145) > at > java.base/sun.nio.fs.LinuxFileSystemProvider.readAttributes(LinuxFileSystemProvider.java:99) > at java.base/java.nio.file.Files.readAttributes(Files.java:1755) > at java.base/java.nio.file.Files.size(Files.java:2372) > at > org.apache.tika.server.TikaServerWatchDog$ChildProcess.(TikaServerWatchDog.java:234) > at > org.apache.tika.server.TikaServerWatchDog$ChildProcess.(TikaServerWatchDog.java:210) > at > org.apache.tika.server.TikaServerWatchDog.execute(TikaServerWatchDog.java:66) > at org.apache.tika.server.TikaServerCli.execute(TikaServerCli.java:146) > at org.apache.tika.server.TikaServerCli.main(TikaServerCli.java:127) > ERROR Can't start: > java.nio.file.NoSuchFileException: > /tmp/tika-server-child-process-mmap-2180120677326747096 > at > java.base/sun.nio.fs.UnixException.translateToIOException(UnixException.java:92) > at > java.base/sun.nio.fs.UnixException.rethrowAsIOException(UnixException.java:111) > at > java.base/sun.nio.fs.UnixException.rethrowAsIOException(UnixException.java:116) > at > java.base/sun.nio.fs.UnixFileAttributeViews$Basic.readAttributes(UnixFileAttributeViews.java:55) > at > java.base/sun.nio.fs.UnixFileSystemProvider.readAttributes(UnixFileSystemProvider.java:145) > at > java.base/sun.nio.fs.LinuxFileSystemProvider.readAttributes(LinuxFileSystemProvider.java:99) > at java.base/java.nio.file.Files.readAttributes(Files.java:1755) > at java.base/java.nio.file.Files.size(Files.java:2372) > at &g
[jira] [Commented] (TIKA-2795) Error starting Tika 2.0 server with -spawnChild on Ubuntu
[ https://issues.apache.org/jira/browse/TIKA-2795?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16830424#comment-16830424 ] Tim Allison commented on TIKA-2795: --- [~bisontim], I wanted to follow up to make sure that you're still good to go with {{-spawnChild}}. If I need to fix anything, I'd like to do it before the upcoming release of 1.21. Again, many thanks for reporting this problem. > Error starting Tika 2.0 server with -spawnChild on Ubuntu > - > > Key: TIKA-2795 > URL: https://issues.apache.org/jira/browse/TIKA-2795 > Project: Tika > Issue Type: Bug > Components: server >Affects Versions: 2.0 >Reporter: Mario Bisonti >Assignee: Tim Allison >Priority: Major > Fix For: 2.0.0, 1.20 > > > Hallo. > I triend to download Tika server 2.0.0 from here: > [https://builds.apache.org/job/Tika-trunk/lastStableBuild/org.apache.tika$tika-server/] > > I tried to start on my Ubuntu server but with the -spawnChild, it doesn't > work. > > sudo java -jar /opt/tika/tika-server-2.0.0-20181203.205013-340.jar -spawnChild > Dec 05, 2018 2:22:32 PM org.apache.tika.config.InitializableProblemHandler$3 > handleInitializableProblem > WARNING: J2KImageReader not loaded. JPEG2000 files will not be processed. > See https://pdfbox.apache.org/2.0/dependencies.html#jai-image-io > for optional dependencies. > Dec 05, 2018 2:22:32 PM org.apache.tika.config.InitializableProblemHandler$3 > handleInitializableProblem > WARNING: org.xerial's sqlite-jdbc is not loaded. > Please provide the jar on your classpath to parse sqlite files. > See tika-parsers/pom.xml for the correct version. > INFO Starting Apache Tika 2.0.0-SNAPSHOT server > INFO server watch dog is starting up > Dec 05, 2018 2:22:32 PM org.apache.tika.config.InitializableProblemHandler$3 > handleInitializableProblem > WARNING: J2KImageReader not loaded. JPEG2000 files will not be processed. > See https://pdfbox.apache.org/2.0/dependencies.html#jai-image-io > for optional dependencies. > Dec 05, 2018 2:22:32 PM org.apache.tika.config.InitializableProblemHandler$3 > handleInitializableProblem > WARNING: org.xerial's sqlite-jdbc is not loaded. > Please provide the jar on your classpath to parse sqlite files. > See tika-parsers/pom.xml for the correct version. > INFO Starting Apache Tika 2.0.0-SNAPSHOT server > java.nio.file.NoSuchFileException: > /tmp/tika-server-child-process-mmap-2180120677326747096 > at > java.base/sun.nio.fs.UnixException.translateToIOException(UnixException.java:92) > at > java.base/sun.nio.fs.UnixException.rethrowAsIOException(UnixException.java:111) > at > java.base/sun.nio.fs.UnixException.rethrowAsIOException(UnixException.java:116) > at > java.base/sun.nio.fs.UnixFileAttributeViews$Basic.readAttributes(UnixFileAttributeViews.java:55) > at > java.base/sun.nio.fs.UnixFileSystemProvider.readAttributes(UnixFileSystemProvider.java:145) > at > java.base/sun.nio.fs.LinuxFileSystemProvider.readAttributes(LinuxFileSystemProvider.java:99) > at java.base/java.nio.file.Files.readAttributes(Files.java:1755) > at java.base/java.nio.file.Files.size(Files.java:2372) > at > org.apache.tika.server.TikaServerWatchDog$ChildProcess.(TikaServerWatchDog.java:234) > at > org.apache.tika.server.TikaServerWatchDog$ChildProcess.(TikaServerWatchDog.java:210) > at > org.apache.tika.server.TikaServerWatchDog.execute(TikaServerWatchDog.java:66) > at org.apache.tika.server.TikaServerCli.execute(TikaServerCli.java:146) > at org.apache.tika.server.TikaServerCli.main(TikaServerCli.java:127) > ERROR Can't start: > java.nio.file.NoSuchFileException: > /tmp/tika-server-child-process-mmap-2180120677326747096 > at > java.base/sun.nio.fs.UnixException.translateToIOException(UnixException.java:92) > at > java.base/sun.nio.fs.UnixException.rethrowAsIOException(UnixException.java:111) > at > java.base/sun.nio.fs.UnixException.rethrowAsIOException(UnixException.java:116) > at > java.base/sun.nio.fs.UnixFileAttributeViews$Basic.readAttributes(UnixFileAttributeViews.java:55) > at > java.base/sun.nio.fs.UnixFileSystemProvider.readAttributes(UnixFileSystemProvider.java:145) > at > java.base/sun.nio.fs.LinuxFileSystemProvider.readAttributes(LinuxFileSystemProvider.java:99) > at java.base/java.nio.file.Files.readAttributes(Files.java:1755) > at java.base/java.nio.file.Files.size(Files.java:2372) > at > org.apache.tika.server.TikaServerWatchDog$ChildProcess.(TikaServerWatchDog.java:234) > at > org.apache.tika.server.TikaServerWatchDog$Ch
[jira] [Resolved] (TIKA-2795) Error starting Tika 2.0 server with -spawnChild on Ubuntu
[ https://issues.apache.org/jira/browse/TIKA-2795?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Tim Allison resolved TIKA-2795. --- Resolution: Fixed Assignee: Tim Allison Fix Version/s: 1.20 2.0.0 [~bisontim], many thanks for finding this. Please let me know what else you find! > Error starting Tika 2.0 server with -spawnChild on Ubuntu > - > > Key: TIKA-2795 > URL: https://issues.apache.org/jira/browse/TIKA-2795 > Project: Tika > Issue Type: Bug > Components: server >Affects Versions: 2.0 >Reporter: Mario Bisonti >Assignee: Tim Allison >Priority: Major > Fix For: 2.0.0, 1.20 > > > Hallo. > I triend to download Tika server 2.0.0 from here: > [https://builds.apache.org/job/Tika-trunk/lastStableBuild/org.apache.tika$tika-server/] > > I tried to start on my Ubuntu server but with the -spawnChild, it doesn't > work. > > sudo java -jar /opt/tika/tika-server-2.0.0-20181203.205013-340.jar -spawnChild > Dec 05, 2018 2:22:32 PM org.apache.tika.config.InitializableProblemHandler$3 > handleInitializableProblem > WARNING: J2KImageReader not loaded. JPEG2000 files will not be processed. > See https://pdfbox.apache.org/2.0/dependencies.html#jai-image-io > for optional dependencies. > Dec 05, 2018 2:22:32 PM org.apache.tika.config.InitializableProblemHandler$3 > handleInitializableProblem > WARNING: org.xerial's sqlite-jdbc is not loaded. > Please provide the jar on your classpath to parse sqlite files. > See tika-parsers/pom.xml for the correct version. > INFO Starting Apache Tika 2.0.0-SNAPSHOT server > INFO server watch dog is starting up > Dec 05, 2018 2:22:32 PM org.apache.tika.config.InitializableProblemHandler$3 > handleInitializableProblem > WARNING: J2KImageReader not loaded. JPEG2000 files will not be processed. > See https://pdfbox.apache.org/2.0/dependencies.html#jai-image-io > for optional dependencies. > Dec 05, 2018 2:22:32 PM org.apache.tika.config.InitializableProblemHandler$3 > handleInitializableProblem > WARNING: org.xerial's sqlite-jdbc is not loaded. > Please provide the jar on your classpath to parse sqlite files. > See tika-parsers/pom.xml for the correct version. > INFO Starting Apache Tika 2.0.0-SNAPSHOT server > java.nio.file.NoSuchFileException: > /tmp/tika-server-child-process-mmap-2180120677326747096 > at > java.base/sun.nio.fs.UnixException.translateToIOException(UnixException.java:92) > at > java.base/sun.nio.fs.UnixException.rethrowAsIOException(UnixException.java:111) > at > java.base/sun.nio.fs.UnixException.rethrowAsIOException(UnixException.java:116) > at > java.base/sun.nio.fs.UnixFileAttributeViews$Basic.readAttributes(UnixFileAttributeViews.java:55) > at > java.base/sun.nio.fs.UnixFileSystemProvider.readAttributes(UnixFileSystemProvider.java:145) > at > java.base/sun.nio.fs.LinuxFileSystemProvider.readAttributes(LinuxFileSystemProvider.java:99) > at java.base/java.nio.file.Files.readAttributes(Files.java:1755) > at java.base/java.nio.file.Files.size(Files.java:2372) > at > org.apache.tika.server.TikaServerWatchDog$ChildProcess.(TikaServerWatchDog.java:234) > at > org.apache.tika.server.TikaServerWatchDog$ChildProcess.(TikaServerWatchDog.java:210) > at > org.apache.tika.server.TikaServerWatchDog.execute(TikaServerWatchDog.java:66) > at org.apache.tika.server.TikaServerCli.execute(TikaServerCli.java:146) > at org.apache.tika.server.TikaServerCli.main(TikaServerCli.java:127) > ERROR Can't start: > java.nio.file.NoSuchFileException: > /tmp/tika-server-child-process-mmap-2180120677326747096 > at > java.base/sun.nio.fs.UnixException.translateToIOException(UnixException.java:92) > at > java.base/sun.nio.fs.UnixException.rethrowAsIOException(UnixException.java:111) > at > java.base/sun.nio.fs.UnixException.rethrowAsIOException(UnixException.java:116) > at > java.base/sun.nio.fs.UnixFileAttributeViews$Basic.readAttributes(UnixFileAttributeViews.java:55) > at > java.base/sun.nio.fs.UnixFileSystemProvider.readAttributes(UnixFileSystemProvider.java:145) > at > java.base/sun.nio.fs.LinuxFileSystemProvider.readAttributes(LinuxFileSystemProvider.java:99) > at java.base/java.nio.file.Files.readAttributes(Files.java:1755) > at java.base/java.nio.file.Files.size(Files.java:2372) > at > org.apache.tika.server.TikaServerWatchDog$ChildProcess.(TikaServerWatchDog.java:234) > at > org.apache.tika.server.TikaServerWatchDog$ChildProcess.(TikaServerWatchDog.java:210) > at > org.apache.tika.server.TikaServerW
[jira] [Commented] (TIKA-2795) Error starting Tika 2.0 server with -spawnChild on Ubuntu
[ https://issues.apache.org/jira/browse/TIKA-2795?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16715770#comment-16715770 ] Hudson commented on TIKA-2795: -- SUCCESS: Integrated in Jenkins build Tika-trunk #1606 (See [https://builds.apache.org/job/Tika-trunk/1606/]) TIKA-2795 -- swapped memorymapped buffer for traditional open, write (tallison: [https://github.com/apache/tika/commit/e921e69c1a484de3036ed4e5a8654b046f54ceea]) * (edit) tika-server/src/main/java/org/apache/tika/server/ServerStatusWatcher.java * (edit) tika-server/src/test/java/org/apache/tika/server/TikaServerIntegrationTest.java * (edit) tika-server/src/main/java/org/apache/tika/server/TikaServerWatchDog.java > Error starting Tika 2.0 server with -spawnChild on Ubuntu > - > > Key: TIKA-2795 > URL: https://issues.apache.org/jira/browse/TIKA-2795 > Project: Tika > Issue Type: Bug > Components: server >Affects Versions: 2.0 >Reporter: Mario Bisonti >Priority: Major > > Hallo. > I triend to download Tika server 2.0.0 from here: > [https://builds.apache.org/job/Tika-trunk/lastStableBuild/org.apache.tika$tika-server/] > > I tried to start on my Ubuntu server but with the -spawnChild, it doesn't > work. > > sudo java -jar /opt/tika/tika-server-2.0.0-20181203.205013-340.jar -spawnChild > Dec 05, 2018 2:22:32 PM org.apache.tika.config.InitializableProblemHandler$3 > handleInitializableProblem > WARNING: J2KImageReader not loaded. JPEG2000 files will not be processed. > See https://pdfbox.apache.org/2.0/dependencies.html#jai-image-io > for optional dependencies. > Dec 05, 2018 2:22:32 PM org.apache.tika.config.InitializableProblemHandler$3 > handleInitializableProblem > WARNING: org.xerial's sqlite-jdbc is not loaded. > Please provide the jar on your classpath to parse sqlite files. > See tika-parsers/pom.xml for the correct version. > INFO Starting Apache Tika 2.0.0-SNAPSHOT server > INFO server watch dog is starting up > Dec 05, 2018 2:22:32 PM org.apache.tika.config.InitializableProblemHandler$3 > handleInitializableProblem > WARNING: J2KImageReader not loaded. JPEG2000 files will not be processed. > See https://pdfbox.apache.org/2.0/dependencies.html#jai-image-io > for optional dependencies. > Dec 05, 2018 2:22:32 PM org.apache.tika.config.InitializableProblemHandler$3 > handleInitializableProblem > WARNING: org.xerial's sqlite-jdbc is not loaded. > Please provide the jar on your classpath to parse sqlite files. > See tika-parsers/pom.xml for the correct version. > INFO Starting Apache Tika 2.0.0-SNAPSHOT server > java.nio.file.NoSuchFileException: > /tmp/tika-server-child-process-mmap-2180120677326747096 > at > java.base/sun.nio.fs.UnixException.translateToIOException(UnixException.java:92) > at > java.base/sun.nio.fs.UnixException.rethrowAsIOException(UnixException.java:111) > at > java.base/sun.nio.fs.UnixException.rethrowAsIOException(UnixException.java:116) > at > java.base/sun.nio.fs.UnixFileAttributeViews$Basic.readAttributes(UnixFileAttributeViews.java:55) > at > java.base/sun.nio.fs.UnixFileSystemProvider.readAttributes(UnixFileSystemProvider.java:145) > at > java.base/sun.nio.fs.LinuxFileSystemProvider.readAttributes(LinuxFileSystemProvider.java:99) > at java.base/java.nio.file.Files.readAttributes(Files.java:1755) > at java.base/java.nio.file.Files.size(Files.java:2372) > at > org.apache.tika.server.TikaServerWatchDog$ChildProcess.(TikaServerWatchDog.java:234) > at > org.apache.tika.server.TikaServerWatchDog$ChildProcess.(TikaServerWatchDog.java:210) > at > org.apache.tika.server.TikaServerWatchDog.execute(TikaServerWatchDog.java:66) > at org.apache.tika.server.TikaServerCli.execute(TikaServerCli.java:146) > at org.apache.tika.server.TikaServerCli.main(TikaServerCli.java:127) > ERROR Can't start: > java.nio.file.NoSuchFileException: > /tmp/tika-server-child-process-mmap-2180120677326747096 > at > java.base/sun.nio.fs.UnixException.translateToIOException(UnixException.java:92) > at > java.base/sun.nio.fs.UnixException.rethrowAsIOException(UnixException.java:111) > at > java.base/sun.nio.fs.UnixException.rethrowAsIOException(UnixException.java:116) > at > java.base/sun.nio.fs.UnixFileAttributeViews$Basic.readAttributes(UnixFileAttributeViews.java:55) > at > java.base/sun.nio.fs.UnixFileSystemProvider.readAttributes(UnixFileSystemProvider.java:145) > at > java.base/sun.nio.fs.LinuxFileSystemProvider.readAttributes(LinuxFileSystemProvider.java:99) > at java.base/java.nio.file.Files.readAttributes(Files.java:1
[jira] [Commented] (TIKA-2795) Error starting Tika 2.0 server with -spawnChild on Ubuntu
[ https://issues.apache.org/jira/browse/TIKA-2795?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16715696#comment-16715696 ] Hudson commented on TIKA-2795: -- UNSTABLE: Integrated in Jenkins build tika-2.x-windows #360 (See [https://builds.apache.org/job/tika-2.x-windows/360/]) TIKA-2795 -- swapped memorymapped buffer for traditional open, write (tallison: rev e921e69c1a484de3036ed4e5a8654b046f54ceea) * (edit) tika-server/src/main/java/org/apache/tika/server/TikaServerWatchDog.java * (edit) tika-server/src/main/java/org/apache/tika/server/ServerStatusWatcher.java * (edit) tika-server/src/test/java/org/apache/tika/server/TikaServerIntegrationTest.java > Error starting Tika 2.0 server with -spawnChild on Ubuntu > - > > Key: TIKA-2795 > URL: https://issues.apache.org/jira/browse/TIKA-2795 > Project: Tika > Issue Type: Bug > Components: server >Affects Versions: 2.0 >Reporter: Mario Bisonti >Priority: Major > > Hallo. > I triend to download Tika server 2.0.0 from here: > [https://builds.apache.org/job/Tika-trunk/lastStableBuild/org.apache.tika$tika-server/] > > I tried to start on my Ubuntu server but with the -spawnChild, it doesn't > work. > > sudo java -jar /opt/tika/tika-server-2.0.0-20181203.205013-340.jar -spawnChild > Dec 05, 2018 2:22:32 PM org.apache.tika.config.InitializableProblemHandler$3 > handleInitializableProblem > WARNING: J2KImageReader not loaded. JPEG2000 files will not be processed. > See https://pdfbox.apache.org/2.0/dependencies.html#jai-image-io > for optional dependencies. > Dec 05, 2018 2:22:32 PM org.apache.tika.config.InitializableProblemHandler$3 > handleInitializableProblem > WARNING: org.xerial's sqlite-jdbc is not loaded. > Please provide the jar on your classpath to parse sqlite files. > See tika-parsers/pom.xml for the correct version. > INFO Starting Apache Tika 2.0.0-SNAPSHOT server > INFO server watch dog is starting up > Dec 05, 2018 2:22:32 PM org.apache.tika.config.InitializableProblemHandler$3 > handleInitializableProblem > WARNING: J2KImageReader not loaded. JPEG2000 files will not be processed. > See https://pdfbox.apache.org/2.0/dependencies.html#jai-image-io > for optional dependencies. > Dec 05, 2018 2:22:32 PM org.apache.tika.config.InitializableProblemHandler$3 > handleInitializableProblem > WARNING: org.xerial's sqlite-jdbc is not loaded. > Please provide the jar on your classpath to parse sqlite files. > See tika-parsers/pom.xml for the correct version. > INFO Starting Apache Tika 2.0.0-SNAPSHOT server > java.nio.file.NoSuchFileException: > /tmp/tika-server-child-process-mmap-2180120677326747096 > at > java.base/sun.nio.fs.UnixException.translateToIOException(UnixException.java:92) > at > java.base/sun.nio.fs.UnixException.rethrowAsIOException(UnixException.java:111) > at > java.base/sun.nio.fs.UnixException.rethrowAsIOException(UnixException.java:116) > at > java.base/sun.nio.fs.UnixFileAttributeViews$Basic.readAttributes(UnixFileAttributeViews.java:55) > at > java.base/sun.nio.fs.UnixFileSystemProvider.readAttributes(UnixFileSystemProvider.java:145) > at > java.base/sun.nio.fs.LinuxFileSystemProvider.readAttributes(LinuxFileSystemProvider.java:99) > at java.base/java.nio.file.Files.readAttributes(Files.java:1755) > at java.base/java.nio.file.Files.size(Files.java:2372) > at > org.apache.tika.server.TikaServerWatchDog$ChildProcess.(TikaServerWatchDog.java:234) > at > org.apache.tika.server.TikaServerWatchDog$ChildProcess.(TikaServerWatchDog.java:210) > at > org.apache.tika.server.TikaServerWatchDog.execute(TikaServerWatchDog.java:66) > at org.apache.tika.server.TikaServerCli.execute(TikaServerCli.java:146) > at org.apache.tika.server.TikaServerCli.main(TikaServerCli.java:127) > ERROR Can't start: > java.nio.file.NoSuchFileException: > /tmp/tika-server-child-process-mmap-2180120677326747096 > at > java.base/sun.nio.fs.UnixException.translateToIOException(UnixException.java:92) > at > java.base/sun.nio.fs.UnixException.rethrowAsIOException(UnixException.java:111) > at > java.base/sun.nio.fs.UnixException.rethrowAsIOException(UnixException.java:116) > at > java.base/sun.nio.fs.UnixFileAttributeViews$Basic.readAttributes(UnixFileAttributeViews.java:55) > at > java.base/sun.nio.fs.UnixFileSystemProvider.readAttributes(UnixFileSystemProvider.java:145) > at > java.base/sun.nio.fs.LinuxFileSystemProvider.readAttributes(LinuxFileSystemProvider.java:99) > at java.base/java.nio.file.Files.readAttributes(Files.java:1
[jira] [Commented] (TIKA-2795) Error starting Tika 2.0 server with -spawnChild on Ubuntu
[ https://issues.apache.org/jira/browse/TIKA-2795?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16715688#comment-16715688 ] Hudson commented on TIKA-2795: -- SUCCESS: Integrated in Jenkins build tika-branch-1x #139 (See [https://builds.apache.org/job/tika-branch-1x/139/]) TIKA-2795 -- swapped memorymapped buffer for traditional open, write (tallison: [https://github.com/apache/tika/commit/8475ddb2bf3aafdbb5aadef77f6888e7e4c4b810]) * (edit) tika-server/src/test/java/org/apache/tika/server/TikaServerIntegrationTest.java * (edit) tika-server/src/main/java/org/apache/tika/server/ServerStatusWatcher.java * (edit) tika-server/src/main/java/org/apache/tika/server/TikaServerWatchDog.java > Error starting Tika 2.0 server with -spawnChild on Ubuntu > - > > Key: TIKA-2795 > URL: https://issues.apache.org/jira/browse/TIKA-2795 > Project: Tika > Issue Type: Bug > Components: server >Affects Versions: 2.0 >Reporter: Mario Bisonti >Priority: Major > > Hallo. > I triend to download Tika server 2.0.0 from here: > [https://builds.apache.org/job/Tika-trunk/lastStableBuild/org.apache.tika$tika-server/] > > I tried to start on my Ubuntu server but with the -spawnChild, it doesn't > work. > > sudo java -jar /opt/tika/tika-server-2.0.0-20181203.205013-340.jar -spawnChild > Dec 05, 2018 2:22:32 PM org.apache.tika.config.InitializableProblemHandler$3 > handleInitializableProblem > WARNING: J2KImageReader not loaded. JPEG2000 files will not be processed. > See https://pdfbox.apache.org/2.0/dependencies.html#jai-image-io > for optional dependencies. > Dec 05, 2018 2:22:32 PM org.apache.tika.config.InitializableProblemHandler$3 > handleInitializableProblem > WARNING: org.xerial's sqlite-jdbc is not loaded. > Please provide the jar on your classpath to parse sqlite files. > See tika-parsers/pom.xml for the correct version. > INFO Starting Apache Tika 2.0.0-SNAPSHOT server > INFO server watch dog is starting up > Dec 05, 2018 2:22:32 PM org.apache.tika.config.InitializableProblemHandler$3 > handleInitializableProblem > WARNING: J2KImageReader not loaded. JPEG2000 files will not be processed. > See https://pdfbox.apache.org/2.0/dependencies.html#jai-image-io > for optional dependencies. > Dec 05, 2018 2:22:32 PM org.apache.tika.config.InitializableProblemHandler$3 > handleInitializableProblem > WARNING: org.xerial's sqlite-jdbc is not loaded. > Please provide the jar on your classpath to parse sqlite files. > See tika-parsers/pom.xml for the correct version. > INFO Starting Apache Tika 2.0.0-SNAPSHOT server > java.nio.file.NoSuchFileException: > /tmp/tika-server-child-process-mmap-2180120677326747096 > at > java.base/sun.nio.fs.UnixException.translateToIOException(UnixException.java:92) > at > java.base/sun.nio.fs.UnixException.rethrowAsIOException(UnixException.java:111) > at > java.base/sun.nio.fs.UnixException.rethrowAsIOException(UnixException.java:116) > at > java.base/sun.nio.fs.UnixFileAttributeViews$Basic.readAttributes(UnixFileAttributeViews.java:55) > at > java.base/sun.nio.fs.UnixFileSystemProvider.readAttributes(UnixFileSystemProvider.java:145) > at > java.base/sun.nio.fs.LinuxFileSystemProvider.readAttributes(LinuxFileSystemProvider.java:99) > at java.base/java.nio.file.Files.readAttributes(Files.java:1755) > at java.base/java.nio.file.Files.size(Files.java:2372) > at > org.apache.tika.server.TikaServerWatchDog$ChildProcess.(TikaServerWatchDog.java:234) > at > org.apache.tika.server.TikaServerWatchDog$ChildProcess.(TikaServerWatchDog.java:210) > at > org.apache.tika.server.TikaServerWatchDog.execute(TikaServerWatchDog.java:66) > at org.apache.tika.server.TikaServerCli.execute(TikaServerCli.java:146) > at org.apache.tika.server.TikaServerCli.main(TikaServerCli.java:127) > ERROR Can't start: > java.nio.file.NoSuchFileException: > /tmp/tika-server-child-process-mmap-2180120677326747096 > at > java.base/sun.nio.fs.UnixException.translateToIOException(UnixException.java:92) > at > java.base/sun.nio.fs.UnixException.rethrowAsIOException(UnixException.java:111) > at > java.base/sun.nio.fs.UnixException.rethrowAsIOException(UnixException.java:116) > at > java.base/sun.nio.fs.UnixFileAttributeViews$Basic.readAttributes(UnixFileAttributeViews.java:55) > at > java.base/sun.nio.fs.UnixFileSystemProvider.readAttributes(UnixFileSystemProvider.java:145) > at > java.base/sun.nio.fs.LinuxFileSystemProvider.readAttributes(LinuxFileSystemProvider.java:99) > at java.base/java.nio.file.Files.readAttributes(Files.java:1
[jira] [Comment Edited] (TIKA-2795) Error starting Tika 2.0 server with -spawnChild on Ubuntu
[ https://issues.apache.org/jira/browse/TIKA-2795?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16715588#comment-16715588 ] Tim Allison edited comment on TIKA-2795 at 12/10/18 9:38 PM: - {{DELETE_ON_CLOSE}} is a bad idea on a file used by different processes. There's no guarantee that the file won't be deleted before close. On linux, the file has not yet been closed in the delete_on_close process, and that process can still see and write to the file, but at the same time the other process can't find the file. The behavior is different on Windows. I tried a number of options, and I found it very hard to guarantee that the tmp file was deleted and that nothing seriously bad happened when two processes shared the tmp file when I shared a memorymapped file between processes _and_ I had to allow either or both processes to be killed. Different platforms handle the details in different ways. For now, I've chosen the simplest option, which is the writer opens the file, waits for trylock, writes the status and closes the file. The reader does the same. This appears to avoid synchronization issues within a process (if more than one thread calls close()) and across processes. I'm sure that we can improve the efficiency of this at some point, but it just shouldn't matter. was (Author: talli...@mitre.org): {{DELETE_ON_CLOSE}} is a bad idea on a file used by different processes. I tried a number of options, and I found it very hard to guarantee that the tmp file was deleted and that nothing seriously bad happened when two processes shared the tmp file when I shared a memorymapped file between processes.. Different platforms handle the details in different ways. For now, I've chosen the simplest option, which is the writer opens the file, waits for trylock, writes the status and closes the file. The reader does the same. This appears to avoid synchronization issues within a process (if more than one thread calls close()) and across processes. > Error starting Tika 2.0 server with -spawnChild on Ubuntu > - > > Key: TIKA-2795 > URL: https://issues.apache.org/jira/browse/TIKA-2795 > Project: Tika > Issue Type: Bug > Components: server >Affects Versions: 2.0 >Reporter: Mario Bisonti >Priority: Major > > Hallo. > I triend to download Tika server 2.0.0 from here: > [https://builds.apache.org/job/Tika-trunk/lastStableBuild/org.apache.tika$tika-server/] > > I tried to start on my Ubuntu server but with the -spawnChild, it doesn't > work. > > sudo java -jar /opt/tika/tika-server-2.0.0-20181203.205013-340.jar -spawnChild > Dec 05, 2018 2:22:32 PM org.apache.tika.config.InitializableProblemHandler$3 > handleInitializableProblem > WARNING: J2KImageReader not loaded. JPEG2000 files will not be processed. > See https://pdfbox.apache.org/2.0/dependencies.html#jai-image-io > for optional dependencies. > Dec 05, 2018 2:22:32 PM org.apache.tika.config.InitializableProblemHandler$3 > handleInitializableProblem > WARNING: org.xerial's sqlite-jdbc is not loaded. > Please provide the jar on your classpath to parse sqlite files. > See tika-parsers/pom.xml for the correct version. > INFO Starting Apache Tika 2.0.0-SNAPSHOT server > INFO server watch dog is starting up > Dec 05, 2018 2:22:32 PM org.apache.tika.config.InitializableProblemHandler$3 > handleInitializableProblem > WARNING: J2KImageReader not loaded. JPEG2000 files will not be processed. > See https://pdfbox.apache.org/2.0/dependencies.html#jai-image-io > for optional dependencies. > Dec 05, 2018 2:22:32 PM org.apache.tika.config.InitializableProblemHandler$3 > handleInitializableProblem > WARNING: org.xerial's sqlite-jdbc is not loaded. > Please provide the jar on your classpath to parse sqlite files. > See tika-parsers/pom.xml for the correct version. > INFO Starting Apache Tika 2.0.0-SNAPSHOT server > java.nio.file.NoSuchFileException: > /tmp/tika-server-child-process-mmap-2180120677326747096 > at > java.base/sun.nio.fs.UnixException.translateToIOException(UnixException.java:92) > at > java.base/sun.nio.fs.UnixException.rethrowAsIOException(UnixException.java:111) > at > java.base/sun.nio.fs.UnixException.rethrowAsIOException(UnixException.java:116) > at > java.base/sun.nio.fs.UnixFileAttributeViews$Basic.readAttributes(UnixFileAttributeViews.java:55) > at > java.base/sun.nio.fs.UnixFileSystemProvider.readAttributes(UnixFileSystemProvider.java:145) > at > java.base/sun.nio.fs.LinuxFileSystemProvider.readAttributes(LinuxFileSystemProvider.java:99) > at java.base/java.nio.file.Files.readAttr
[jira] [Commented] (TIKA-2795) Error starting Tika 2.0 server with -spawnChild on Ubuntu
[ https://issues.apache.org/jira/browse/TIKA-2795?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16715588#comment-16715588 ] Tim Allison commented on TIKA-2795: --- {{DELETE_ON_CLOSE}} is a bad idea on a file used by different processes. I tried a number of options, and I found it very hard to guarantee that the tmp file was deleted and that nothing seriously bad happened when two processes shared the tmp file when I shared a memorymapped file between processes.. Different platforms handle the details in different ways. For now, I've chosen the simplest option, which is the writer opens the file, waits for trylock, writes the status and closes the file. The reader does the same. This appears to avoid synchronization issues within a process (if more than one thread calls close()) and across processes. > Error starting Tika 2.0 server with -spawnChild on Ubuntu > - > > Key: TIKA-2795 > URL: https://issues.apache.org/jira/browse/TIKA-2795 > Project: Tika > Issue Type: Bug > Components: server >Affects Versions: 2.0 >Reporter: Mario Bisonti >Priority: Major > > Hallo. > I triend to download Tika server 2.0.0 from here: > [https://builds.apache.org/job/Tika-trunk/lastStableBuild/org.apache.tika$tika-server/] > > I tried to start on my Ubuntu server but with the -spawnChild, it doesn't > work. > > sudo java -jar /opt/tika/tika-server-2.0.0-20181203.205013-340.jar -spawnChild > Dec 05, 2018 2:22:32 PM org.apache.tika.config.InitializableProblemHandler$3 > handleInitializableProblem > WARNING: J2KImageReader not loaded. JPEG2000 files will not be processed. > See https://pdfbox.apache.org/2.0/dependencies.html#jai-image-io > for optional dependencies. > Dec 05, 2018 2:22:32 PM org.apache.tika.config.InitializableProblemHandler$3 > handleInitializableProblem > WARNING: org.xerial's sqlite-jdbc is not loaded. > Please provide the jar on your classpath to parse sqlite files. > See tika-parsers/pom.xml for the correct version. > INFO Starting Apache Tika 2.0.0-SNAPSHOT server > INFO server watch dog is starting up > Dec 05, 2018 2:22:32 PM org.apache.tika.config.InitializableProblemHandler$3 > handleInitializableProblem > WARNING: J2KImageReader not loaded. JPEG2000 files will not be processed. > See https://pdfbox.apache.org/2.0/dependencies.html#jai-image-io > for optional dependencies. > Dec 05, 2018 2:22:32 PM org.apache.tika.config.InitializableProblemHandler$3 > handleInitializableProblem > WARNING: org.xerial's sqlite-jdbc is not loaded. > Please provide the jar on your classpath to parse sqlite files. > See tika-parsers/pom.xml for the correct version. > INFO Starting Apache Tika 2.0.0-SNAPSHOT server > java.nio.file.NoSuchFileException: > /tmp/tika-server-child-process-mmap-2180120677326747096 > at > java.base/sun.nio.fs.UnixException.translateToIOException(UnixException.java:92) > at > java.base/sun.nio.fs.UnixException.rethrowAsIOException(UnixException.java:111) > at > java.base/sun.nio.fs.UnixException.rethrowAsIOException(UnixException.java:116) > at > java.base/sun.nio.fs.UnixFileAttributeViews$Basic.readAttributes(UnixFileAttributeViews.java:55) > at > java.base/sun.nio.fs.UnixFileSystemProvider.readAttributes(UnixFileSystemProvider.java:145) > at > java.base/sun.nio.fs.LinuxFileSystemProvider.readAttributes(LinuxFileSystemProvider.java:99) > at java.base/java.nio.file.Files.readAttributes(Files.java:1755) > at java.base/java.nio.file.Files.size(Files.java:2372) > at > org.apache.tika.server.TikaServerWatchDog$ChildProcess.(TikaServerWatchDog.java:234) > at > org.apache.tika.server.TikaServerWatchDog$ChildProcess.(TikaServerWatchDog.java:210) > at > org.apache.tika.server.TikaServerWatchDog.execute(TikaServerWatchDog.java:66) > at org.apache.tika.server.TikaServerCli.execute(TikaServerCli.java:146) > at org.apache.tika.server.TikaServerCli.main(TikaServerCli.java:127) > ERROR Can't start: > java.nio.file.NoSuchFileException: > /tmp/tika-server-child-process-mmap-2180120677326747096 > at > java.base/sun.nio.fs.UnixException.translateToIOException(UnixException.java:92) > at > java.base/sun.nio.fs.UnixException.rethrowAsIOException(UnixException.java:111) > at > java.base/sun.nio.fs.UnixException.rethrowAsIOException(UnixException.java:116) > at > java.base/sun.nio.fs.UnixFileAttributeViews$Basic.readAttributes(UnixFileAttributeViews.java:55) > at > java.base/sun.nio.fs.UnixFileSystemProvider.readAttributes(UnixFileSystemProvider.java:145) > at > java.base/sun.nio.fs.LinuxFileSystem
[jira] [Commented] (TIKA-2795) Error starting Tika 2.0 server with -spawnChild on Ubuntu
[ https://issues.apache.org/jira/browse/TIKA-2795?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16713155#comment-16713155 ] Hudson commented on TIKA-2795: -- UNSTABLE: Integrated in Jenkins build tika-branch-1x #138 (See [https://builds.apache.org/job/tika-branch-1x/138/]) TIKA-2795 -- catch IOException if child deletes shared file (tallison: [https://github.com/apache/tika/commit/582a1d441ea8240e06a552c4cfa315439ea47a45]) * (edit) tika-server/src/main/java/org/apache/tika/server/TikaServerWatchDog.java > Error starting Tika 2.0 server with -spawnChild on Ubuntu > - > > Key: TIKA-2795 > URL: https://issues.apache.org/jira/browse/TIKA-2795 > Project: Tika > Issue Type: Bug > Components: server >Affects Versions: 2.0 >Reporter: Mario Bisonti >Priority: Major > > Hallo. > I triend to download Tika server 2.0.0 from here: > [https://builds.apache.org/job/Tika-trunk/lastStableBuild/org.apache.tika$tika-server/] > > I tried to start on my Ubuntu server but with the -spawnChild, it doesn't > work. > > sudo java -jar /opt/tika/tika-server-2.0.0-20181203.205013-340.jar -spawnChild > Dec 05, 2018 2:22:32 PM org.apache.tika.config.InitializableProblemHandler$3 > handleInitializableProblem > WARNING: J2KImageReader not loaded. JPEG2000 files will not be processed. > See https://pdfbox.apache.org/2.0/dependencies.html#jai-image-io > for optional dependencies. > Dec 05, 2018 2:22:32 PM org.apache.tika.config.InitializableProblemHandler$3 > handleInitializableProblem > WARNING: org.xerial's sqlite-jdbc is not loaded. > Please provide the jar on your classpath to parse sqlite files. > See tika-parsers/pom.xml for the correct version. > INFO Starting Apache Tika 2.0.0-SNAPSHOT server > INFO server watch dog is starting up > Dec 05, 2018 2:22:32 PM org.apache.tika.config.InitializableProblemHandler$3 > handleInitializableProblem > WARNING: J2KImageReader not loaded. JPEG2000 files will not be processed. > See https://pdfbox.apache.org/2.0/dependencies.html#jai-image-io > for optional dependencies. > Dec 05, 2018 2:22:32 PM org.apache.tika.config.InitializableProblemHandler$3 > handleInitializableProblem > WARNING: org.xerial's sqlite-jdbc is not loaded. > Please provide the jar on your classpath to parse sqlite files. > See tika-parsers/pom.xml for the correct version. > INFO Starting Apache Tika 2.0.0-SNAPSHOT server > java.nio.file.NoSuchFileException: > /tmp/tika-server-child-process-mmap-2180120677326747096 > at > java.base/sun.nio.fs.UnixException.translateToIOException(UnixException.java:92) > at > java.base/sun.nio.fs.UnixException.rethrowAsIOException(UnixException.java:111) > at > java.base/sun.nio.fs.UnixException.rethrowAsIOException(UnixException.java:116) > at > java.base/sun.nio.fs.UnixFileAttributeViews$Basic.readAttributes(UnixFileAttributeViews.java:55) > at > java.base/sun.nio.fs.UnixFileSystemProvider.readAttributes(UnixFileSystemProvider.java:145) > at > java.base/sun.nio.fs.LinuxFileSystemProvider.readAttributes(LinuxFileSystemProvider.java:99) > at java.base/java.nio.file.Files.readAttributes(Files.java:1755) > at java.base/java.nio.file.Files.size(Files.java:2372) > at > org.apache.tika.server.TikaServerWatchDog$ChildProcess.(TikaServerWatchDog.java:234) > at > org.apache.tika.server.TikaServerWatchDog$ChildProcess.(TikaServerWatchDog.java:210) > at > org.apache.tika.server.TikaServerWatchDog.execute(TikaServerWatchDog.java:66) > at org.apache.tika.server.TikaServerCli.execute(TikaServerCli.java:146) > at org.apache.tika.server.TikaServerCli.main(TikaServerCli.java:127) > ERROR Can't start: > java.nio.file.NoSuchFileException: > /tmp/tika-server-child-process-mmap-2180120677326747096 > at > java.base/sun.nio.fs.UnixException.translateToIOException(UnixException.java:92) > at > java.base/sun.nio.fs.UnixException.rethrowAsIOException(UnixException.java:111) > at > java.base/sun.nio.fs.UnixException.rethrowAsIOException(UnixException.java:116) > at > java.base/sun.nio.fs.UnixFileAttributeViews$Basic.readAttributes(UnixFileAttributeViews.java:55) > at > java.base/sun.nio.fs.UnixFileSystemProvider.readAttributes(UnixFileSystemProvider.java:145) > at > java.base/sun.nio.fs.LinuxFileSystemProvider.readAttributes(LinuxFileSystemProvider.java:99) > at java.base/java.nio.file.Files.readAttributes(Files.java:1755) > at java.base/java.nio.file.Files.size(Files.java:2372) > at > org.apache.tika.server.TikaServerWatchDog$ChildProcess.(TikaServerWatchDog.java:2
[jira] [Updated] (TIKA-2795) Error starting Tika 2.0 server with -spawnChild on Ubuntu
[ https://issues.apache.org/jira/browse/TIKA-2795?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Mario Bisonti updated TIKA-2795: Description: Hallo. I triend to download Tika server 2.0.0 from here: [https://builds.apache.org/job/Tika-trunk/lastStableBuild/org.apache.tika$tika-server/] I tried to start on my Ubuntu server but with the -spawnChild, it doesn't work. sudo java -jar /opt/tika/tika-server-2.0.0-20181203.205013-340.jar -spawnChild Dec 05, 2018 2:22:32 PM org.apache.tika.config.InitializableProblemHandler$3 handleInitializableProblem WARNING: J2KImageReader not loaded. JPEG2000 files will not be processed. See https://pdfbox.apache.org/2.0/dependencies.html#jai-image-io for optional dependencies. Dec 05, 2018 2:22:32 PM org.apache.tika.config.InitializableProblemHandler$3 handleInitializableProblem WARNING: org.xerial's sqlite-jdbc is not loaded. Please provide the jar on your classpath to parse sqlite files. See tika-parsers/pom.xml for the correct version. INFO Starting Apache Tika 2.0.0-SNAPSHOT server INFO server watch dog is starting up Dec 05, 2018 2:22:32 PM org.apache.tika.config.InitializableProblemHandler$3 handleInitializableProblem WARNING: J2KImageReader not loaded. JPEG2000 files will not be processed. See https://pdfbox.apache.org/2.0/dependencies.html#jai-image-io for optional dependencies. Dec 05, 2018 2:22:32 PM org.apache.tika.config.InitializableProblemHandler$3 handleInitializableProblem WARNING: org.xerial's sqlite-jdbc is not loaded. Please provide the jar on your classpath to parse sqlite files. See tika-parsers/pom.xml for the correct version. INFO Starting Apache Tika 2.0.0-SNAPSHOT server java.nio.file.NoSuchFileException: /tmp/tika-server-child-process-mmap-2180120677326747096 at java.base/sun.nio.fs.UnixException.translateToIOException(UnixException.java:92) at java.base/sun.nio.fs.UnixException.rethrowAsIOException(UnixException.java:111) at java.base/sun.nio.fs.UnixException.rethrowAsIOException(UnixException.java:116) at java.base/sun.nio.fs.UnixFileAttributeViews$Basic.readAttributes(UnixFileAttributeViews.java:55) at java.base/sun.nio.fs.UnixFileSystemProvider.readAttributes(UnixFileSystemProvider.java:145) at java.base/sun.nio.fs.LinuxFileSystemProvider.readAttributes(LinuxFileSystemProvider.java:99) at java.base/java.nio.file.Files.readAttributes(Files.java:1755) at java.base/java.nio.file.Files.size(Files.java:2372) at org.apache.tika.server.TikaServerWatchDog$ChildProcess.(TikaServerWatchDog.java:234) at org.apache.tika.server.TikaServerWatchDog$ChildProcess.(TikaServerWatchDog.java:210) at org.apache.tika.server.TikaServerWatchDog.execute(TikaServerWatchDog.java:66) at org.apache.tika.server.TikaServerCli.execute(TikaServerCli.java:146) at org.apache.tika.server.TikaServerCli.main(TikaServerCli.java:127) ERROR Can't start: java.nio.file.NoSuchFileException: /tmp/tika-server-child-process-mmap-2180120677326747096 at java.base/sun.nio.fs.UnixException.translateToIOException(UnixException.java:92) at java.base/sun.nio.fs.UnixException.rethrowAsIOException(UnixException.java:111) at java.base/sun.nio.fs.UnixException.rethrowAsIOException(UnixException.java:116) at java.base/sun.nio.fs.UnixFileAttributeViews$Basic.readAttributes(UnixFileAttributeViews.java:55) at java.base/sun.nio.fs.UnixFileSystemProvider.readAttributes(UnixFileSystemProvider.java:145) at java.base/sun.nio.fs.LinuxFileSystemProvider.readAttributes(LinuxFileSystemProvider.java:99) at java.base/java.nio.file.Files.readAttributes(Files.java:1755) at java.base/java.nio.file.Files.size(Files.java:2372) at org.apache.tika.server.TikaServerWatchDog$ChildProcess.(TikaServerWatchDog.java:234) at org.apache.tika.server.TikaServerWatchDog$ChildProcess.(TikaServerWatchDog.java:210) at org.apache.tika.server.TikaServerWatchDog.execute(TikaServerWatchDog.java:66) at org.apache.tika.server.TikaServerCli.execute(TikaServerCli.java:146) at org.apache.tika.server.TikaServerCli.main(TikaServerCli.java:127) administrator@sengvivv02:/opt/tika$ Instead, on a Windows machine, it starts right. Thanks a lot Mario > Error starting Tika 2.0 server with -spawnChild on Ubuntu > - > > Key: TIKA-2795 > URL: https://issues.apache.org/jira/browse/TIKA-2795 > Project: Tika > Issue Type: Bug > Components: server >Affects Versions: 2.0 >Reporter: Mario Bisonti >Priority: Major > > Hallo. > I triend to download Tika server 2.0.0 from here: > [https://builds.apache.org/job/Tika-trunk/lastStableBuild/org.apache.tika$tika-server/] > > I tried to start on my Ubuntu server but with the -spawnChild, it doesn't > work. > > sudo java -jar /opt/tika/tika-server-2.0.0-2
[jira] [Created] (TIKA-2795) Error starting Tika 2.0 server with -spawnChild on Ubuntu
Mario Bisonti created TIKA-2795: --- Summary: Error starting Tika 2.0 server with -spawnChild on Ubuntu Key: TIKA-2795 URL: https://issues.apache.org/jira/browse/TIKA-2795 Project: Tika Issue Type: Bug Components: server Affects Versions: 2.0 Reporter: Mario Bisonti -- This message was sent by Atlassian JIRA (v7.6.3#76005)
Re: Build with Java 10, but target 8 in Tika 2.0?
I'd also be a bit concerned with ONLY compiling with Java 10. There are some changes to how resources are accessed across module boundaries that could break some existing functionality if folks decided to RUN with > Java 9 using the module system. I covered some of these in my 2016 Apache Con talk[1]. I've got a few of the code changes need based on the old 2.0 branch [2] but there may be more. So that said it might be good to just start with the Automatic Module Name entry in the manifest [3]. Then proceed to add the module-info.java when we've developed some tests that run Tika as a module. Thoughts on this approach? [1] https://www.slideshare.net/rpaulin1/clipboards/tika-java-9-resource-loading [2] https://github.com/bobpaulin/tika/tree/2.x_java9 [3] http://branchandbound.net/blog/java/2017/12/automatic-module-name/ On 6/19/2018 4:10 PM, Tim Allison wrote: > Doh...sorry, right. > > Ken, > Tongue in cheek answer: so that I become a little less stupid about > modern java...see above. :) > Real answer: _if_ we can pull it off, given that we plan to > modularize our parsers anyways, it would be nice to use the language > support in java >= 9 for actual modularity. I know we have to fix some > split packages and possibly rename some of our packages. > I _might_ find some time soon to focus on merging Bob’s awesome 2.0 > work into master, and I thought it would be a good time to try it. > > Nick, >This is good to know. Thank you! > > Cheers, > Tim > > On Tue, Jun 19, 2018 at 4:59 PM Nick Burch wrote: > >> On 19/06/18 20:46, Tim Allison wrote: >>> What would you think of requiring Java 10 to build Tika 2.0 but still >>> setting 8 as the target? This would allow us to bake modularity in now. >>> Given that I haven't actually tried modularizing/jigsawizing Tika yet, >> this >>> could be a complete disaster, of course. :) >> I'm not sure how well it'd work given that most of our dependencies >> aren't java module-ized? >> >> David North (from POI) has done quite a bit on java modules for existing >> codebases, and hit some snags, and IIRC commons have had problems too. I >> don't mind either way though! >> >> Nick >> signature.asc Description: OpenPGP digital signature
Re: Build with Java 10, but target 8 in Tika 2.0?
On 19/06/18 20:46, Tim Allison wrote: What would you think of requiring Java 10 to build Tika 2.0 but still setting 8 as the target? This would allow us to bake modularity in now. Given that I haven't actually tried modularizing/jigsawizing Tika yet, this could be a complete disaster, of course. :) I'm not sure how well it'd work given that most of our dependencies aren't java module-ized? David North (from POI) has done quite a bit on java modules for existing codebases, and hit some snags, and IIRC commons have had problems too. I don't mind either way though! Nick
Re: Build with Java 10, but target 8 in Tika 2.0?
Don't set "Target" to 8. Use "Release" flag! This ensures that code is compiled against Java 8 method signatures. For module info just add a separate compilation source with release 9 or 10 and jar them together. Uwe Am June 19, 2018 7:46:45 PM UTC schrieb Tim Allison : >All, > What would you think of requiring Java 10 to build Tika 2.0 but still >setting 8 as the target? This would allow us to bake modularity in >now. >Given that I haven't actually tried modularizing/jigsawizing Tika yet, >this >could be a complete disaster, of course. :) > > Cheers, > > Tim -- Uwe Schindler Achterdiek 19, 28357 Bremen https://www.thetaphi.de
Re: Build with Java 10, but target 8 in Tika 2.0?
Hi Tim, What’s the issue with needing Java 10 for the build? And yes, I think I can install it, but I’m still on 1.8 :) — Ken > On Jun 19, 2018, at 12:46 PM, Tim Allison wrote: > > All, > What would you think of requiring Java 10 to build Tika 2.0 but still > setting 8 as the target? This would allow us to bake modularity in now. > Given that I haven't actually tried modularizing/jigsawizing Tika yet, this > could be a complete disaster, of course. :) > > Cheers, > > Tim -- Ken Krugler +1 530-210-6378 http://www.scaleunlimited.com Custom big data solutions & training Flink, Solr, Hadoop, Cascading & Cassandra
Build with Java 10, but target 8 in Tika 2.0?
All, What would you think of requiring Java 10 to build Tika 2.0 but still setting 8 as the target? This would allow us to bake modularity in now. Given that I haven't actually tried modularizing/jigsawizing Tika yet, this could be a complete disaster, of course. :) Cheers, Tim
[jira] [Closed] (TIKA-2083) Tika 2.0 - Audit master branch against 2.x branch
[ https://issues.apache.org/jira/browse/TIKA-2083?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Tim Allison closed TIKA-2083. - Resolution: Fixed Current plan is to use 2.x branch as a model, to redo [~bobpaulin]'s awesome work on {{master}}. This will be more work, but we will not risk losing anything done to master. > Tika 2.0 - Audit master branch against 2.x branch > - > > Key: TIKA-2083 > URL: https://issues.apache.org/jira/browse/TIKA-2083 > Project: Tika > Issue Type: Sub-task >Affects Versions: 2.0 >Reporter: Bob Paulin >Assignee: Bob Paulin >Priority: Blocker > Fix For: 2.0 > > > At this point Tika has been doing parallel development on master and the 2.x > for about 9 months. We should audit commit logs for that time to make a best > effort to identify any commits that may not have been applied in 2.x. This > task should be done prior to the 2.0 release -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Updated] (TIKA-1983) Tika 2.0 - remove tika-app's legacy server
[ https://issues.apache.org/jira/browse/TIKA-1983?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Tim Allison updated TIKA-1983: -- Issue Type: Sub-task (was: Task) Parent: TIKA-2085 > Tika 2.0 - remove tika-app's legacy server > --- > > Key: TIKA-1983 > URL: https://issues.apache.org/jira/browse/TIKA-1983 > Project: Tika > Issue Type: Sub-task >Reporter: Tim Allison >Assignee: Tim Allison >Priority: Trivial > Fix For: 2.0.0 > > > In the Tika 2.0 road map > [discussion|https://wiki.apache.org/tika/Tika2_0RoadMap], we decided to > remove tika-app's legacy server. Users should migrate to the tika-server > package. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (TIKA-1983) Tika 2.0 - remove tika-app's legacy server
[ https://issues.apache.org/jira/browse/TIKA-1983?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16352513#comment-16352513 ] Hudson commented on TIKA-1983: -- SUCCESS: Integrated in Jenkins build Tika-trunk #1430 (See [https://builds.apache.org/job/Tika-trunk/1430/]) TIKA-1983 -- remove deprecated server from tika-app in Tika 2.0 (tallison: [https://github.com/apache/tika/commit/5244cde44e18f2e4565c2b0e29e8083575429084]) * (edit) tika-app/src/main/java/org/apache/tika/cli/TikaCLI.java > Tika 2.0 - remove tika-app's legacy server > --- > > Key: TIKA-1983 > URL: https://issues.apache.org/jira/browse/TIKA-1983 > Project: Tika > Issue Type: Task >Reporter: Tim Allison >Assignee: Tim Allison >Priority: Trivial > Fix For: 2.0.0 > > > In the Tika 2.0 road map > [discussion|https://wiki.apache.org/tika/Tika2_0RoadMap], we decided to > remove tika-app's legacy server. Users should migrate to the tika-server > package. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Resolved] (TIKA-1983) Tika 2.0 - remove tika-app's legacy server
[ https://issues.apache.org/jira/browse/TIKA-1983?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Tim Allison resolved TIKA-1983. --- Resolution: Fixed Assignee: Tim Allison Fix Version/s: 2.0.0 Fixed on {{master}}. > Tika 2.0 - remove tika-app's legacy server > --- > > Key: TIKA-1983 > URL: https://issues.apache.org/jira/browse/TIKA-1983 > Project: Tika > Issue Type: Task >Reporter: Tim Allison >Assignee: Tim Allison >Priority: Trivial > Fix For: 2.0.0 > > > In the Tika 2.0 road map > [discussion|https://wiki.apache.org/tika/Tika2_0RoadMap], we decided to > remove tika-app's legacy server. Users should migrate to the tika-server > package. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Reopened] (TIKA-1983) Tika 2.0 - remove tika-app's legacy server
[ https://issues.apache.org/jira/browse/TIKA-1983?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Tim Allison reopened TIKA-1983: --- This was done on the initial 2.x branch. We need to redo it on master. > Tika 2.0 - remove tika-app's legacy server > --- > > Key: TIKA-1983 > URL: https://issues.apache.org/jira/browse/TIKA-1983 > Project: Tika > Issue Type: Task >Reporter: Tim Allison >Priority: Trivial > > In the Tika 2.0 road map > [discussion|https://wiki.apache.org/tika/Tika2_0RoadMap], we decided to > remove tika-app's legacy server. Users should migrate to the tika-server > package. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Resolved] (TIKA-1974) Tika 2.0 - remove deprecated metadata properties
[ https://issues.apache.org/jira/browse/TIKA-1974?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Tim Allison resolved TIKA-1974. --- Resolution: Fixed Fix Version/s: 2.0 If anyone has feedback on this, we can reopen this...or open a separate ticket. > Tika 2.0 - remove deprecated metadata properties > > > Key: TIKA-1974 > URL: https://issues.apache.org/jira/browse/TIKA-1974 > Project: Tika > Issue Type: Sub-task >Reporter: Tim Allison >Priority: Blocker > Fix For: 2.0 > > > We have quite a few metadata properties that are deprecated. We should > remove them for 2.0. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (TIKA-1974) Tika 2.0 - remove deprecated metadata properties
[ https://issues.apache.org/jira/browse/TIKA-1974?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16341673#comment-16341673 ] Hudson commented on TIKA-1974: -- SUCCESS: Integrated in Jenkins build Tika-trunk #1426 (See [https://builds.apache.org/job/Tika-trunk/1426/]) TIKA-1974 -- remove deprecated metadata properties/keys for Tika 2.0 (tallison: [https://github.com/apache/tika/commit/10a8eec119c7a77be76000b30aaffb96a552cc44]) * (edit) tika-core/src/main/java/org/apache/tika/detect/NameDetector.java * (edit) tika-xmp/src/test/java/org/apache/tika/xmp/XMPMetadataTest.java * (edit) tika-parsers/src/test/java/org/apache/tika/parser/pkg/ZipParserTest.java * (edit) tika-parsers/src/main/java/org/apache/tika/parser/pdf/PDF2XHTML.java * (edit) tika-parsers/src/main/java/org/apache/tika/parser/rtf/TextExtractor.java * (edit) tika-app/src/test/java/org/apache/tika/cli/TikaCLITest.java * (edit) tika-core/src/main/java/org/apache/tika/metadata/TikaCoreProperties.java * (edit) tika-parsers/src/test/java/org/apache/tika/parser/image/xmp/JempboxExtractorTest.java * (edit) tika-parsers/src/test/java/org/apache/tika/parser/microsoft/OutlookParserTest.java * (edit) tika-parsers/src/main/java/org/apache/tika/parser/image/ImageMetadataExtractor.java * (edit) tika-parsers/src/test/java/org/apache/tika/parser/jpeg/JpegParserTest.java * (edit) tika-parsers/src/test/java/org/apache/tika/parser/pdf/PDFParserTest.java * (edit) tika-core/src/test/java/org/apache/tika/mime/ProbabilisticMimeDetectionTestWithTika.java * (edit) tika-core/src/main/java/org/apache/tika/Tika.java * (edit) tika-core/src/main/java/org/apache/tika/io/FilenameUtils.java * (edit) tika-parsers/src/main/java/org/apache/tika/parser/microsoft/ooxml/MetadataExtractor.java * (edit) tika-core/src/test/java/org/apache/tika/mime/MimeTypesReaderTest.java * (edit) tika-parsers/src/test/java/org/apache/tika/parser/odf/ODFParserTest.java * (edit) tika-core/src/main/java/org/apache/tika/metadata/DublinCore.java * (edit) tika-core/src/test/java/org/apache/tika/detect/NameDetectorTest.java * (edit) tika-parsers/src/main/java/org/apache/tika/parser/rtf/RTFEmbObjHandler.java * (edit) tika-parsers/src/test/java/org/apache/tika/parser/microsoft/ooxml/SXWPFExtractorTest.java * (edit) tika-batch/src/main/java/org/apache/tika/batch/fs/FSDocumentSelector.java * (edit) tika-core/src/main/java/org/apache/tika/metadata/RTFMetadata.java * (edit) tika-parsers/src/main/java/org/apache/tika/parser/html/HtmlHandler.java * (edit) tika-parsers/src/test/java/org/apache/tika/detect/TestContainerAwareDetector.java * (edit) tika-example/src/main/java/org/apache/tika/example/ParsingExample.java * (edit) tika-core/src/main/java/org/apache/tika/mime/MimeTypes.java * (edit) tika-core/src/test/java/org/apache/tika/mime/ProbabilisticMimeDetectionTest.java * (edit) tika-parsers/src/test/java/org/apache/tika/parser/BouncyCastleDigestingParserTest.java * (edit) tika-parsers/src/test/java/org/apache/tika/parser/microsoft/ExcelParserTest.java * (edit) tika-server/src/main/java/org/apache/tika/server/resource/DetectorResource.java * (edit) tika-parsers/src/main/java/org/apache/tika/parser/odf/OpenDocumentParser.java * (edit) tika-core/src/test/java/org/apache/tika/mime/MimeDetectionTest.java * (edit) tika-parsers/src/main/java/org/apache/tika/parser/xml/DcXMLParser.java * (edit) tika-parsers/src/test/java/org/apache/tika/parser/microsoft/ProjectParserTest.java * (edit) tika-core/src/main/java/org/apache/tika/metadata/Message.java * (edit) tika-parsers/src/main/java/org/apache/tika/parser/pkg/CompressorParser.java * (edit) tika-batch/src/main/java/org/apache/tika/batch/fs/FSFileResource.java * (edit) tika-parsers/src/main/java/org/apache/tika/parser/iptc/IptcAnpaParser.java * (edit) tika-parsers/src/main/java/org/apache/tika/parser/mbox/MboxParser.java * (edit) tika-eval/src/main/java/org/apache/tika/eval/io/ExtractReader.java * (edit) tika-bundle/src/test/java/org/apache/tika/bundle/BundleIT.java * (delete) tika-core/src/main/java/org/apache/tika/metadata/MSOffice.java * (edit) tika-parsers/src/main/java/org/apache/tika/parser/odf/OpenDocumentMetaParser.java * (edit) tika-core/src/main/java/org/apache/tika/extractor/ParsingEmbeddedDocumentExtractor.java * (edit) tika-parsers/src/main/java/org/apache/tika/parser/image/xmp/JempboxExtractor.java * (edit) tika-parsers/src/test/java/org/apache/tika/parser/mp3/Mp3ParserTest.java * (edit) tika-parsers/src/test/java/org/apache/tika/parser/AutoDetectParserTest.java * (edit) tika-parsers/src/test/java/org/apache/tika/parser/executable/ExecutableParserTest.java * (edit) tika-parsers/src/main/java/org/apache/tika/parser/pkg/PackageParser.java * (edit) tika-server/src/main/java/org/apache/tika/server/resource/TikaResource.java * (edit) tika-parsers/src/main/java/org/apache/tika/parser/hdf/HDFParser.java * (edit) tika-parsers/src/test/java/org/apache/tika/parser/rtf
[jira] [Comment Edited] (TIKA-1974) Tika 2.0 - remove deprecated metadata properties
[ https://issues.apache.org/jira/browse/TIKA-1974?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16341416#comment-16341416 ] Tim Allison edited comment on TIKA-1974 at 1/26/18 6:53 PM: I just pushed the [first draft of this|https://github.com/apache/tika/commit/e6e3b8817053e981f3843f1d3b7055b4ae30ed73]. Please take a look and let me know if I botched anything. [~rgauss]...if you have any time, I'd very much appreciate your feedback! was (Author: talli...@mitre.org): I just pushed the first draft of this. Please take a look and let me know if I botched anything. [~rgauss]...if you have any time, I'd very much appreciate your feedback! > Tika 2.0 - remove deprecated metadata properties > > > Key: TIKA-1974 > URL: https://issues.apache.org/jira/browse/TIKA-1974 > Project: Tika > Issue Type: Sub-task >Reporter: Tim Allison >Priority: Blocker > > We have quite a few metadata properties that are deprecated. We should > remove them for 2.0. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Updated] (TIKA-1974) Tika 2.0 - remove deprecated metadata properties
[ https://issues.apache.org/jira/browse/TIKA-1974?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Tim Allison updated TIKA-1974: -- Priority: Blocker (was: Major) > Tika 2.0 - remove deprecated metadata properties > > > Key: TIKA-1974 > URL: https://issues.apache.org/jira/browse/TIKA-1974 > Project: Tika > Issue Type: Sub-task >Reporter: Tim Allison >Priority: Blocker > > We have quite a few metadata properties that are deprecated. We should > remove them for 2.0. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
***UNCHECKED*** [jira] [Commented] (TIKA-1974) Tika 2.0 - remove deprecated metadata properties
[ https://issues.apache.org/jira/browse/TIKA-1974?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16341416#comment-16341416 ] Tim Allison commented on TIKA-1974: --- I just pushed the first draft of this. Please take a look and let me know if I botched anything. [~rgauss]...if you have any time, I'd very much appreciate your feedback! > Tika 2.0 - remove deprecated metadata properties > > > Key: TIKA-1974 > URL: https://issues.apache.org/jira/browse/TIKA-1974 > Project: Tika > Issue Type: Sub-task >Reporter: Tim Allison >Priority: Major > > We have quite a few metadata properties that are deprecated. We should > remove them for 2.0. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (TIKA-1974) Tika 2.0 - remove deprecated metadata properties
[ https://issues-test.apache.org/jira/browse/TIKA-1974?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16265599#comment-16265599 ] Tim Allison commented on TIKA-1974: --- Another question: we're currently making quite a few metadata properties available in {{Metadata}} via {{implements}}: {noformat} public class Metadata implements CreativeCommons, Geographic, HttpHeaders, Message, ClimateForcast, TIFF, TikaMimeKeys,... {noformat} Do we still want to do this? If so, do we want to add TikaCoreProperties? > Tika 2.0 - remove deprecated metadata properties > > > Key: TIKA-1974 > URL: https://issues-test.apache.org/jira/browse/TIKA-1974 > Project: Tika > Issue Type: Sub-task >Reporter: Tim Allison >Priority: Major > > We have quite a few metadata properties that are deprecated. We should > remove them for 2.0. -- This message was sent by Atlassian JIRA (v7.6.0#76001)
[jira] [Comment Edited] (TIKA-1974) Tika 2.0 - remove deprecated metadata properties
[ https://issues-test.apache.org/jira/browse/TIKA-1974?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16265598#comment-16265598 ] Tim Allison edited comment on TIKA-1974 at 1/24/18 2:46 PM: All, I'm picking up work on this again, any recs for the question above? I'm guessing based on this: {noformat} Property ALTITUDE = Geographic.ALTITUDE;{noformat} That we should go with, e.g.: {noformat} Property DESCRIPTION = DublinCore.DESCRIPTION;{noformat} was (Author: talli...@mitre.org): All, I'm picking up work on this again, any recs for the question above? > Tika 2.0 - remove deprecated metadata properties > > > Key: TIKA-1974 > URL: https://issues-test.apache.org/jira/browse/TIKA-1974 > Project: Tika > Issue Type: Sub-task >Reporter: Tim Allison >Priority: Major > > We have quite a few metadata properties that are deprecated. We should > remove them for 2.0. -- This message was sent by Atlassian JIRA (v7.6.0#76001)
[jira] [Commented] (TIKA-1974) Tika 2.0 - remove deprecated metadata properties
[ https://issues-test.apache.org/jira/browse/TIKA-1974?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16265598#comment-16265598 ] Tim Allison commented on TIKA-1974: --- All, I'm picking up work on this again, any recs for the question above? > Tika 2.0 - remove deprecated metadata properties > > > Key: TIKA-1974 > URL: https://issues-test.apache.org/jira/browse/TIKA-1974 > Project: Tika > Issue Type: Sub-task >Reporter: Tim Allison >Priority: Major > > We have quite a few metadata properties that are deprecated. We should > remove them for 2.0. -- This message was sent by Atlassian JIRA (v7.6.0#76001)
RE: steps for Tika 2.0
> I have a 4 week old branch that I've started applying changes to that I could > push up called tika-2.0-demo-update that might provide a head start for you. +1 Also, if you have the stomach and time to redo your work, please take the lead. tf is such a massive dependency that we should break it into its own module, IMHO. I just updated the version in master to 2.0.0-SNAPSHOT. Onward! -Original Message- From: Bob Paulin [mailto:b...@bobpaulin.com] Sent: Wednesday, December 13, 2017 9:52 AM To: dev@tika.apache.org Subject: Re: steps for Tika 2.0 Hey Tim, Happy to help with this effort. I have a 4 week old branch that I've started applying changes to that I could push up called tika-2.0-demo-update that might provide a head start for you. I think we do have to make some decisions on where the captioning, recognition, and sentiment packages go. There was quite a bit of work done integrating all the cool new tensorflow stuff. My initial thought was tika-parser-advanced-module but we could even consider breaking the tensorflow work into it's own. Excited to see this work start in master! - Bob On 12/13/2017 7:51 AM, Allison, Timothy B. wrote: > All, > > I just created branch_1x, where we can put bug fixes and anything else we > want to go into 1.17.1 or 1.18. Unless there are objections, I’m going to > start making some radical changes to master to prep for 2.0.0-BETA over the > next few weeks/months. These changes are all based on Bob Paulin’s amazing > 2.x branch work. > > So, rather than having to remember to make updates to 2.x, we’ll now > have to remember to make updates to branch_1x. Hopefully, this will > get us to 2.0.0 sooner. > > Just keep working on master and treating it as master. > > Let me know if I mis-remembered our earlier conversations about steps to take > for 2.0.0 and/or if you have any other recommendations. Onward! > > Thank you! > > Cheers, > > Tim > >
Re: steps for Tika 2.0
Hey Tim, Happy to help with this effort. I have a 4 week old branch that I've started applying changes to that I could push up called tika-2.0-demo-update that might provide a head start for you. I think we do have to make some decisions on where the captioning, recognition, and sentiment packages go. There was quite a bit of work done integrating all the cool new tensorflow stuff. My initial thought was tika-parser-advanced-module but we could even consider breaking the tensorflow work into it's own. Excited to see this work start in master! - Bob On 12/13/2017 7:51 AM, Allison, Timothy B. wrote: > All, > > I just created branch_1x, where we can put bug fixes and anything else we > want to go into 1.17.1 or 1.18. Unless there are objections, I’m going to > start making some radical changes to master to prep for 2.0.0-BETA over the > next few weeks/months. These changes are all based on Bob Paulin’s amazing > 2.x branch work. > > So, rather than having to remember to make updates to 2.x, we’ll now have to > remember to make updates to branch_1x. Hopefully, this will get us to 2.0.0 > sooner. > > Just keep working on master and treating it as master. > > Let me know if I mis-remembered our earlier conversations about steps to take > for 2.0.0 and/or if you have any other recommendations. Onward! > > Thank you! > > Cheers, > > Tim > > signature.asc Description: OpenPGP digital signature
steps for Tika 2.0
All, I just created branch_1x, where we can put bug fixes and anything else we want to go into 1.17.1 or 1.18. Unless there are objections, I’m going to start making some radical changes to master to prep for 2.0.0-BETA over the next few weeks/months. These changes are all based on Bob Paulin’s amazing 2.x branch work. So, rather than having to remember to make updates to 2.x, we’ll now have to remember to make updates to branch_1x. Hopefully, this will get us to 2.0.0 sooner. Just keep working on master and treating it as master. Let me know if I mis-remembered our earlier conversations about steps to take for 2.0.0 and/or if you have any other recommendations. Onward! Thank you! Cheers, Tim
Re: Tika 2.0?
B it is, proceed ( On 9/12/17, 5:10 AM, "Allison, Timothy B." <talli...@mitre.org> wrote: I'd strongly advocate for 2. I _think_ the hard work was laying out the general structure and adding the ProxyParser workaround. Copying and pasting/reworking into that structure will be: A) far less dangerous than 1 And B) we'll have a cleaner history. On A), I know that we didn't add some major components including: configurability of parsers, completely cleaned up logging, numerous bug fixes and even entire modules (tika-dl). On B), there were a few times where I "caught a parser up" in 2.0 not by individual commits based on the original history but based on a copy/paste from the contemporaneous master. This obliterated the history of some commits on the 2.0 branch and would force us to look back at master. -Original Message- From: Bob Paulin [mailto:b...@bobpaulin.com] Sent: Monday, September 11, 2017 9:48 PM To: dev@tika.apache.org Subject: Re: Tika 2.0? Just so it's clear are we going to: 1) Rename the 2.0 branch over to master or 2) Re-apply the changes on master. I recall Chris' preference was 1 which would be quicker. However there is very likely missed patches. 2 will be more time consuming but it would be more likely to include all the most recent code. I'm open to either. Not sure how far out of date 2.0 branch is so I defer to Tim on the risk of going with #1. - Bob On 9/11/2017 5:15 PM, Chris Mattmann wrote: > +1000 > > > > On 9/11/17, 12:03 PM, "Allison, Timothy B." <talli...@mitre.org> wrote: > > Y, well, I didn't say _which_ September... > > Given my limited availability to work on this in Sept and POI's decision to move to Java 1.8, I propose releasing Tika 1.17 after the release of POI 3.17 and PDFBox 2.0.8. This would be the last version of Tika at the Java 1.7 level, and then we bump the Java requirement to 1.8, switch master to the 2.0 layout and create a 1.x maintenance branch (with Java 1.8) for quick critical bug fixes/security vulnerabilities until we release 2.0. > > What do you all think? > > > -Original Message- > From: Allison, Timothy B. [mailto:talli...@mitre.org] > Sent: Monday, August 28, 2017 9:33 AM > To: dev@tika.apache.org > Subject: Tika 2.0? > > All, > > We're getting some increasing deltas btwn the 2.0 and trunk branches. Many of these are my fault; I gave up making updates to 2.0 around April/May, I think. > > What would people think of punting on some of the desired goals of 2.0 (e.g. chaining parsers, more structured but still simple metadata) and releasing 2.0 soonish...say 2.0-BETA end of September? > > We've been able to make some major improvements to Tika without breaking backwards compatibility. We _might_ be able to do that with the outstanding issues for 2.0 when someone has time. > > We could also do the upgrade to jdk 8 with Tika 2.0. > > If this sounds reasonable, I propose creating a 1.x branch from trunk for 1.x maintenance and then reworking trunk to the 2.x structure that Bob Paulin so elegantly worked out. I figure we can either copy/paste from trunk to the current 2.x (and _hope_ we get all the updates) or use Bob's 2.0 as a model for restructuring trunk. At this point, I'd prefer the second option. The key here is to switch "trunk" to 2.0 so that we all have the mindset that 2.0 is what we're focused on. > >The main benefit of this proposal is that we'd have a more modular Tika soon. > >What do you think? > > Best, > >Tim > > > >
RE: Tika 2.0?
I'd strongly advocate for 2. I _think_ the hard work was laying out the general structure and adding the ProxyParser workaround. Copying and pasting/reworking into that structure will be: A) far less dangerous than 1 And B) we'll have a cleaner history. On A), I know that we didn't add some major components including: configurability of parsers, completely cleaned up logging, numerous bug fixes and even entire modules (tika-dl). On B), there were a few times where I "caught a parser up" in 2.0 not by individual commits based on the original history but based on a copy/paste from the contemporaneous master. This obliterated the history of some commits on the 2.0 branch and would force us to look back at master. -Original Message- From: Bob Paulin [mailto:b...@bobpaulin.com] Sent: Monday, September 11, 2017 9:48 PM To: dev@tika.apache.org Subject: Re: Tika 2.0? Just so it's clear are we going to: 1) Rename the 2.0 branch over to master or 2) Re-apply the changes on master. I recall Chris' preference was 1 which would be quicker. However there is very likely missed patches. 2 will be more time consuming but it would be more likely to include all the most recent code. I'm open to either. Not sure how far out of date 2.0 branch is so I defer to Tim on the risk of going with #1. - Bob On 9/11/2017 5:15 PM, Chris Mattmann wrote: > +1000 > > > > On 9/11/17, 12:03 PM, "Allison, Timothy B." <talli...@mitre.org> wrote: > > Y, well, I didn't say _which_ September... > > Given my limited availability to work on this in Sept and POI's decision > to move to Java 1.8, I propose releasing Tika 1.17 after the release of POI > 3.17 and PDFBox 2.0.8. This would be the last version of Tika at the Java > 1.7 level, and then we bump the Java requirement to 1.8, switch master to the > 2.0 layout and create a 1.x maintenance branch (with Java 1.8) for quick > critical bug fixes/security vulnerabilities until we release 2.0. > > What do you all think? > > > -Original Message- > From: Allison, Timothy B. [mailto:talli...@mitre.org] > Sent: Monday, August 28, 2017 9:33 AM > To: dev@tika.apache.org > Subject: Tika 2.0? > > All, > > We're getting some increasing deltas btwn the 2.0 and trunk branches. > Many of these are my fault; I gave up making updates to 2.0 around April/May, > I think. > > What would people think of punting on some of the desired goals of 2.0 > (e.g. chaining parsers, more structured but still simple metadata) and > releasing 2.0 soonish...say 2.0-BETA end of September? > > We've been able to make some major improvements to Tika without > breaking backwards compatibility. We _might_ be able to do that with the > outstanding issues for 2.0 when someone has time. > > We could also do the upgrade to jdk 8 with Tika 2.0. > > If this sounds reasonable, I propose creating a 1.x branch from trunk > for 1.x maintenance and then reworking trunk to the 2.x structure that Bob > Paulin so elegantly worked out. I figure we can either copy/paste from trunk > to the current 2.x (and _hope_ we get all the updates) or use Bob's 2.0 as a > model for restructuring trunk. At this point, I'd prefer the second option. > The key here is to switch "trunk" to 2.0 so that we all have the mindset that > 2.0 is what we're focused on. > >The main benefit of this proposal is that we'd have a more modular > Tika soon. > >What do you think? > > Best, > >Tim > > > >
Re: Tika 2.0?
Just so it's clear are we going to: 1) Rename the 2.0 branch over to master or 2) Re-apply the changes on master. I recall Chris' preference was 1 which would be quicker. However there is very likely missed patches. 2 will be more time consuming but it would be more likely to include all the most recent code. I'm open to either. Not sure how far out of date 2.0 branch is so I defer to Tim on the risk of going with #1. - Bob On 9/11/2017 5:15 PM, Chris Mattmann wrote: > +1000 > > > > On 9/11/17, 12:03 PM, "Allison, Timothy B." <talli...@mitre.org> wrote: > > Y, well, I didn't say _which_ September... > > Given my limited availability to work on this in Sept and POI's decision > to move to Java 1.8, I propose releasing Tika 1.17 after the release of POI > 3.17 and PDFBox 2.0.8. This would be the last version of Tika at the Java > 1.7 level, and then we bump the Java requirement to 1.8, switch master to the > 2.0 layout and create a 1.x maintenance branch (with Java 1.8) for quick > critical bug fixes/security vulnerabilities until we release 2.0. > > What do you all think? > > > -Original Message- > From: Allison, Timothy B. [mailto:talli...@mitre.org] > Sent: Monday, August 28, 2017 9:33 AM > To: dev@tika.apache.org > Subject: Tika 2.0? > > All, > > We're getting some increasing deltas btwn the 2.0 and trunk branches. > Many of these are my fault; I gave up making updates to 2.0 around April/May, > I think. > > What would people think of punting on some of the desired goals of 2.0 > (e.g. chaining parsers, more structured but still simple metadata) and > releasing 2.0 soonish...say 2.0-BETA end of September? > > We've been able to make some major improvements to Tika without > breaking backwards compatibility. We _might_ be able to do that with the > outstanding issues for 2.0 when someone has time. > > We could also do the upgrade to jdk 8 with Tika 2.0. > > If this sounds reasonable, I propose creating a 1.x branch from trunk > for 1.x maintenance and then reworking trunk to the 2.x structure that Bob > Paulin so elegantly worked out. I figure we can either copy/paste from trunk > to the current 2.x (and _hope_ we get all the updates) or use Bob's 2.0 as a > model for restructuring trunk. At this point, I'd prefer the second option. > The key here is to switch "trunk" to 2.0 so that we all have the mindset that > 2.0 is what we're focused on. > >The main benefit of this proposal is that we'd have a more modular > Tika soon. > >What do you think? > > Best, > >Tim > > > > signature.asc Description: OpenPGP digital signature
Re: Tika 2.0?
+1000 On 9/11/17, 12:03 PM, "Allison, Timothy B." <talli...@mitre.org> wrote: Y, well, I didn't say _which_ September... Given my limited availability to work on this in Sept and POI's decision to move to Java 1.8, I propose releasing Tika 1.17 after the release of POI 3.17 and PDFBox 2.0.8. This would be the last version of Tika at the Java 1.7 level, and then we bump the Java requirement to 1.8, switch master to the 2.0 layout and create a 1.x maintenance branch (with Java 1.8) for quick critical bug fixes/security vulnerabilities until we release 2.0. What do you all think? -Original Message- From: Allison, Timothy B. [mailto:talli...@mitre.org] Sent: Monday, August 28, 2017 9:33 AM To: dev@tika.apache.org Subject: Tika 2.0? All, We're getting some increasing deltas btwn the 2.0 and trunk branches. Many of these are my fault; I gave up making updates to 2.0 around April/May, I think. What would people think of punting on some of the desired goals of 2.0 (e.g. chaining parsers, more structured but still simple metadata) and releasing 2.0 soonish...say 2.0-BETA end of September? We've been able to make some major improvements to Tika without breaking backwards compatibility. We _might_ be able to do that with the outstanding issues for 2.0 when someone has time. We could also do the upgrade to jdk 8 with Tika 2.0. If this sounds reasonable, I propose creating a 1.x branch from trunk for 1.x maintenance and then reworking trunk to the 2.x structure that Bob Paulin so elegantly worked out. I figure we can either copy/paste from trunk to the current 2.x (and _hope_ we get all the updates) or use Bob's 2.0 as a model for restructuring trunk. At this point, I'd prefer the second option. The key here is to switch "trunk" to 2.0 so that we all have the mindset that 2.0 is what we're focused on. The main benefit of this proposal is that we'd have a more modular Tika soon. What do you think? Best, Tim
RE: Tika 2.0?
Y, well, I didn't say _which_ September... Given my limited availability to work on this in Sept and POI's decision to move to Java 1.8, I propose releasing Tika 1.17 after the release of POI 3.17 and PDFBox 2.0.8. This would be the last version of Tika at the Java 1.7 level, and then we bump the Java requirement to 1.8, switch master to the 2.0 layout and create a 1.x maintenance branch (with Java 1.8) for quick critical bug fixes/security vulnerabilities until we release 2.0. What do you all think? -Original Message- From: Allison, Timothy B. [mailto:talli...@mitre.org] Sent: Monday, August 28, 2017 9:33 AM To: dev@tika.apache.org Subject: Tika 2.0? All, We're getting some increasing deltas btwn the 2.0 and trunk branches. Many of these are my fault; I gave up making updates to 2.0 around April/May, I think. What would people think of punting on some of the desired goals of 2.0 (e.g. chaining parsers, more structured but still simple metadata) and releasing 2.0 soonish...say 2.0-BETA end of September? We've been able to make some major improvements to Tika without breaking backwards compatibility. We _might_ be able to do that with the outstanding issues for 2.0 when someone has time. We could also do the upgrade to jdk 8 with Tika 2.0. If this sounds reasonable, I propose creating a 1.x branch from trunk for 1.x maintenance and then reworking trunk to the 2.x structure that Bob Paulin so elegantly worked out. I figure we can either copy/paste from trunk to the current 2.x (and _hope_ we get all the updates) or use Bob's 2.0 as a model for restructuring trunk. At this point, I'd prefer the second option. The key here is to switch "trunk" to 2.0 so that we all have the mindset that 2.0 is what we're focused on. The main benefit of this proposal is that we'd have a more modular Tika soon. What do you think? Best, Tim
Re: Tika 2.0?
I am cool to finally get on the 2.0 kool aid and execute the plan as described by Tim below for our next release. +1. Cheers, Chris ++ Chris Mattmann, Ph.D. Principal Data Scientist, Engineering Administrative Office (3010) Manager, NSF & Open Source Projects Formulation and Development Offices (8212) NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA Office: 180-503E, Mailstop: 180-503 Email: chris.a.mattm...@nasa.gov WWW: http://sunset.usc.edu/~mattmann/ ++ Director, Information Retrieval and Data Science Group (IRDS) Adjunct Associate Professor, Computer Science Department University of Southern California, Los Angeles, CA 90089 USA WWW: http://irds.usc.edu/ ++ On 8/28/17, 8:12 AM, "Konstantin Gribov" <gros...@gmail.com> wrote: Tim, +1 to making restructuring master to 2.x shape. If we can at least migrate modularization patches, dependency changes and move to java 8 it certainly will be a good step forward and big reduction of technical debt. On пн, 28 авг. 2017, 16:52 Bob Paulin <b...@bobpaulin.com> wrote: > Tim, > > +1 You've done an admirable job of dual maintenance but it sounds like > it became a heavy tax on development. Releasing would allow us to get > back to "trunk" based development again. Then we could focus on porting > any missed patches and start looking for any regressions. I also like > the idea of picking up Java 8 as many other projects are starting to do > this. > > - Bob > > > > On 8/28/2017 8:32 AM, Allison, Timothy B. wrote: > > All, > > > > We're getting some increasing deltas btwn the 2.0 and trunk branches. > Many of these are my fault; I gave up making updates to 2.0 around > April/May, I think. > > > > What would people think of punting on some of the desired goals of 2.0 > (e.g. chaining parsers, more structured but still simple metadata) and > releasing 2.0 soonish...say 2.0-BETA end of September? > > > > We've been able to make some major improvements to Tika without > breaking backwards compatibility. We _might_ be able to do that with the > outstanding issues for 2.0 when someone has time. > > > > We could also do the upgrade to jdk 8 with Tika 2.0. > > > > If this sounds reasonable, I propose creating a 1.x branch from trunk > for 1.x maintenance and then reworking trunk to the 2.x structure that Bob > Paulin so elegantly worked out. I figure we can either copy/paste from > trunk to the current 2.x (and _hope_ we get all the updates) or use Bob's > 2.0 as a model for restructuring trunk. At this point, I'd prefer the > second option. The key here is to switch "trunk" to 2.0 so that we all > have the mindset that 2.0 is what we're focused on. > > > >The main benefit of this proposal is that we'd have a more modular > Tika soon. > > > >What do you think? > > > > Best, > > > >Tim > > > > > -- Best regards, Konstantin Gribov
Re: Tika 2.0?
Tim, +1 to making restructuring master to 2.x shape. If we can at least migrate modularization patches, dependency changes and move to java 8 it certainly will be a good step forward and big reduction of technical debt. On пн, 28 авг. 2017, 16:52 Bob Paulin <b...@bobpaulin.com> wrote: > Tim, > > +1 You've done an admirable job of dual maintenance but it sounds like > it became a heavy tax on development. Releasing would allow us to get > back to "trunk" based development again. Then we could focus on porting > any missed patches and start looking for any regressions. I also like > the idea of picking up Java 8 as many other projects are starting to do > this. > > - Bob > > > > On 8/28/2017 8:32 AM, Allison, Timothy B. wrote: > > All, > > > > We're getting some increasing deltas btwn the 2.0 and trunk branches. > Many of these are my fault; I gave up making updates to 2.0 around > April/May, I think. > > > > What would people think of punting on some of the desired goals of 2.0 > (e.g. chaining parsers, more structured but still simple metadata) and > releasing 2.0 soonish...say 2.0-BETA end of September? > > > > We've been able to make some major improvements to Tika without > breaking backwards compatibility. We _might_ be able to do that with the > outstanding issues for 2.0 when someone has time. > > > > We could also do the upgrade to jdk 8 with Tika 2.0. > > > > If this sounds reasonable, I propose creating a 1.x branch from trunk > for 1.x maintenance and then reworking trunk to the 2.x structure that Bob > Paulin so elegantly worked out. I figure we can either copy/paste from > trunk to the current 2.x (and _hope_ we get all the updates) or use Bob's > 2.0 as a model for restructuring trunk. At this point, I'd prefer the > second option. The key here is to switch "trunk" to 2.0 so that we all > have the mindset that 2.0 is what we're focused on. > > > >The main benefit of this proposal is that we'd have a more modular > Tika soon. > > > >What do you think? > > > > Best, > > > >Tim > > > > > -- Best regards, Konstantin Gribov
Re: Tika 2.0?
Tim, +1 You've done an admirable job of dual maintenance but it sounds like it became a heavy tax on development. Releasing would allow us to get back to "trunk" based development again. Then we could focus on porting any missed patches and start looking for any regressions. I also like the idea of picking up Java 8 as many other projects are starting to do this. - Bob On 8/28/2017 8:32 AM, Allison, Timothy B. wrote: > All, > > We're getting some increasing deltas btwn the 2.0 and trunk branches. Many > of these are my fault; I gave up making updates to 2.0 around April/May, I > think. > > What would people think of punting on some of the desired goals of 2.0 > (e.g. chaining parsers, more structured but still simple metadata) and > releasing 2.0 soonish...say 2.0-BETA end of September? > > We've been able to make some major improvements to Tika without breaking > backwards compatibility. We _might_ be able to do that with the outstanding > issues for 2.0 when someone has time. > > We could also do the upgrade to jdk 8 with Tika 2.0. > > If this sounds reasonable, I propose creating a 1.x branch from trunk for > 1.x maintenance and then reworking trunk to the 2.x structure that Bob Paulin > so elegantly worked out. I figure we can either copy/paste from trunk to the > current 2.x (and _hope_ we get all the updates) or use Bob's 2.0 as a model > for restructuring trunk. At this point, I'd prefer the second option. The > key here is to switch "trunk" to 2.0 so that we all have the mindset that 2.0 > is what we're focused on. > >The main benefit of this proposal is that we'd have a more modular Tika > soon. > >What do you think? > > Best, > >Tim > signature.asc Description: OpenPGP digital signature
Re: Tika 2.0?
Hi Tim Having a new major 2.0 master is a good idea IMHO. It will take time to make it final but it's better to finally make it 'mainstream' and start having new ideas realized or finalized... Sergey On 28/08/17 14:32, Allison, Timothy B. wrote: All, We're getting some increasing deltas btwn the 2.0 and trunk branches. Many of these are my fault; I gave up making updates to 2.0 around April/May, I think. What would people think of punting on some of the desired goals of 2.0 (e.g. chaining parsers, more structured but still simple metadata) and releasing 2.0 soonish...say 2.0-BETA end of September? We've been able to make some major improvements to Tika without breaking backwards compatibility. We _might_ be able to do that with the outstanding issues for 2.0 when someone has time. We could also do the upgrade to jdk 8 with Tika 2.0. If this sounds reasonable, I propose creating a 1.x branch from trunk for 1.x maintenance and then reworking trunk to the 2.x structure that Bob Paulin so elegantly worked out. I figure we can either copy/paste from trunk to the current 2.x (and _hope_ we get all the updates) or use Bob's 2.0 as a model for restructuring trunk. At this point, I'd prefer the second option. The key here is to switch "trunk" to 2.0 so that we all have the mindset that 2.0 is what we're focused on. The main benefit of this proposal is that we'd have a more modular Tika soon. What do you think? Best, Tim
Tika 2.0?
All, We're getting some increasing deltas btwn the 2.0 and trunk branches. Many of these are my fault; I gave up making updates to 2.0 around April/May, I think. What would people think of punting on some of the desired goals of 2.0 (e.g. chaining parsers, more structured but still simple metadata) and releasing 2.0 soonish...say 2.0-BETA end of September? We've been able to make some major improvements to Tika without breaking backwards compatibility. We _might_ be able to do that with the outstanding issues for 2.0 when someone has time. We could also do the upgrade to jdk 8 with Tika 2.0. If this sounds reasonable, I propose creating a 1.x branch from trunk for 1.x maintenance and then reworking trunk to the 2.x structure that Bob Paulin so elegantly worked out. I figure we can either copy/paste from trunk to the current 2.x (and _hope_ we get all the updates) or use Bob's 2.0 as a model for restructuring trunk. At this point, I'd prefer the second option. The key here is to switch "trunk" to 2.0 so that we all have the mindset that 2.0 is what we're focused on. The main benefit of this proposal is that we'd have a more modular Tika soon. What do you think? Best, Tim
[jira] [Updated] (TIKA-2096) Tika 2.0 -- Supply AutoDetectParser for embedded documents if user forgets to pass it in via ParseContext
[ https://issues.apache.org/jira/browse/TIKA-2096?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Tim Allison updated TIKA-2096: -- Issue Type: Improvement (was: Sub-task) Parent: (was: TIKA-2085) > Tika 2.0 -- Supply AutoDetectParser for embedded documents if user forgets to > pass it in via ParseContext > - > > Key: TIKA-2096 > URL: https://issues.apache.org/jira/browse/TIKA-2096 > Project: Tika > Issue Type: Improvement >Reporter: Tim Allison > > Currently, if users don't specify a Parser.class or an > EmbeddedDocumentExtractor in the ParseContext, then embedded documents will > not be parsed. I propose that we add an AutoDetectParser automatically if a > Parser or EmbeddedDocumentExtractor is not included in the ParseContext. > If a user doesn't want to parse embedded objects, s/he could pass in an > EmptyParser for the Parser.class. > In short, let's make the default be "parse everything", and the user has to > figure out how to parse only the container document if that's the desired > behavior. > This is a breaking change. I propose adding it to 2.0 only. > We were bitten by this on tika-server (TIKA-1584). Solr (SOLR-7189) has been > bitten by this. [Kite|https://github.com/kite-sdk/kite/issues/397] is still > suffering from this. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (TIKA-2096) Tika 2.0 -- Supply AutoDetectParser for embedded documents if user forgets to pass it in via ParseContext
[ https://issues.apache.org/jira/browse/TIKA-2096?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15652854#comment-15652854 ] Tim Allison commented on TIKA-2096: --- We may want to accelerate this and put it into Tika 1.15. I just found that the MailContentHandler was supplying an AutoDetectParser, but the others aren't. On TIKA-2159, I removed this from the MailContentHandler. Any objections, if we add this to all parsers now? > Tika 2.0 -- Supply AutoDetectParser for embedded documents if user forgets to > pass it in via ParseContext > - > > Key: TIKA-2096 > URL: https://issues.apache.org/jira/browse/TIKA-2096 > Project: Tika > Issue Type: Sub-task >Reporter: Tim Allison > > Currently, if users don't specify a Parser.class or an > EmbeddedDocumentExtractor in the ParseContext, then embedded documents will > not be parsed. I propose that we add an AutoDetectParser automatically if a > Parser or EmbeddedDocumentExtractor is not included in the ParseContext. > If a user doesn't want to parse embedded objects, s/he could pass in an > EmptyParser for the Parser.class. > In short, let's make the default be "parse everything", and the user has to > figure out how to parse only the container document if that's the desired > behavior. > This is a breaking change. I propose adding it to 2.0 only. > We were bitten by this on tika-server (TIKA-1584). Solr (SOLR-7189) has been > bitten by this. [Kite|https://github.com/kite-sdk/kite/issues/397] is still > suffering from this. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (TIKA-1974) Tika 2.0 - remove deprecated metadata properties
[ https://issues.apache.org/jira/browse/TIKA-1974?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15523845#comment-15523845 ] Tim Allison commented on TIKA-1974: --- I'm starting to work on this a bit. For metadata items that map directly to Dublin Core, do we want to have copies of them in TikaCoreProperties, e.g.: {noformat} /** * @see DublinCore#FORMAT */ public static final Property FORMAT = DublinCore.FORMAT; /** * @see DublinCore#IDENTIFIER */ public static final Property IDENTIFIER = DublinCore.IDENTIFIER; {noformat} Or, should we delete these in TikaCoreProperties and just use DC? > Tika 2.0 - remove deprecated metadata properties > > > Key: TIKA-1974 > URL: https://issues.apache.org/jira/browse/TIKA-1974 > Project: Tika > Issue Type: Sub-task >Reporter: Tim Allison > > We have quite a few metadata properties that are deprecated. We should > remove them for 2.0. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (TIKA-2096) Tika 2.0 -- Supply AutoDetectParser for embedded documents if user forgets to pass it in via ParseContext
[ https://issues.apache.org/jira/browse/TIKA-2096?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Tim Allison updated TIKA-2096: -- Issue Type: Sub-task (was: Improvement) Parent: TIKA-2085 > Tika 2.0 -- Supply AutoDetectParser for embedded documents if user forgets to > pass it in via ParseContext > - > > Key: TIKA-2096 > URL: https://issues.apache.org/jira/browse/TIKA-2096 > Project: Tika > Issue Type: Sub-task >Reporter: Tim Allison > > Currently, if users don't specify a Parser.class or an > EmbeddedDocumentExtractor in the ParseContext, then embedded documents will > not be parsed. I propose that we add an AutoDetectParser automatically if a > Parser or EmbeddedDocumentExtractor is not included in the ParseContext. > If a user doesn't want to parse embedded objects, s/he could pass in an > EmptyParser for the Parser.class. > In short, let's make the default be "parse everything", and the user has to > figure out how to parse only the container document if that's the desired > behavior. > This is a breaking change. I propose adding it to 2.0 only. > We were bitten by this on tika-server (TIKA-1584). Solr (SOLR-7189) has been > bitten by this. [Kite|https://github.com/kite-sdk/kite/issues/397] is still > suffering from this. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Created] (TIKA-2096) Tika 2.0 -- Supply AutoDetectParser for embedded documents if user forgets to pass it in via ParseContext
Tim Allison created TIKA-2096: - Summary: Tika 2.0 -- Supply AutoDetectParser for embedded documents if user forgets to pass it in via ParseContext Key: TIKA-2096 URL: https://issues.apache.org/jira/browse/TIKA-2096 Project: Tika Issue Type: Improvement Reporter: Tim Allison Currently, if users don't specify a Parser.class or an EmbeddedDocumentExtractor in the ParseContext, then embedded documents will not be parsed. I propose that we add an AutoDetectParser automatically if a Parser or EmbeddedDocumentExtractor is not included in the ParseContext. If a user doesn't want to parse embedded objects, s/he could pass in an EmptyParser for the Parser.class. In short, let's make the default be "parse everything", and the user has to figure out how to parse only the container document if that's the desired behavior. This is a breaking change. I propose adding it to 2.0 only. We were bitten by this on tika-server (TIKA-1584). Solr (SOLR-7189) has been bitten by this. [Kite|https://github.com/kite-sdk/kite/issues/397] is still suffering from this. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
Re: Plans for the first Tika 2.0 release
NLP/NER is as high a priority to me as the OCR stuff..we have a whole meta framework for doing NER/NLP with NERRecogniser and really cool Tensorflow and other stuff. Hoping 2.0 can help solve this! ☺ ++ Chris Mattmann, Ph.D. Chief Architect, Instrument Software and Science Data Systems Section (398) Manager, Open Source Projects Formulation and Development Office (8212) NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA Office: 168-519, Mailstop: 168-527 Email: chris.a.mattm...@nasa.gov WWW: http://sunset.usc.edu/~mattmann/ ++ Director, Information Retrieval and Data Science Group (IRDS) Adjunct Associate Professor, Computer Science Department University of Southern California, Los Angeles, CA 90089 USA WWW: http://irds.usc.edu/ ++ On 9/21/16, 7:40 AM, "Nick Burch"wrote: On Mon, 19 Sep 2016, Bob Paulin wrote: > I think it's a good thing to discuss. I know there are other features > that are targeted for 2.0. Do we have a general sense of where those > features are at? I think the big one we need to crack is allowing multiple parsers to run against a file. OCR is probably the most critical of these from the modularisation perspective, with all those nasty interlinkings between the parsers to allow the manual delegation. If we can crack the problem of multiple parsers, those proxy issues should go away (or at least get better!) As a bonus, it ought to also improve things for error cases (fallback parsers etc), but for your needs, the simplification for "ocr + image metadata" is likely your biggest win! (I think it might also let us tidy up some of the enhancement parsers too, like how the NLP stuff fits into the parsing framework) Nick
Re: Plans for the first Tika 2.0 release
On Mon, 19 Sep 2016, Bob Paulin wrote: I think it's a good thing to discuss. I know there are other features that are targeted for 2.0. Do we have a general sense of where those features are at? I think the big one we need to crack is allowing multiple parsers to run against a file. OCR is probably the most critical of these from the modularisation perspective, with all those nasty interlinkings between the parsers to allow the manual delegation. If we can crack the problem of multiple parsers, those proxy issues should go away (or at least get better!) As a bonus, it ought to also improve things for error cases (fallback parsers etc), but for your needs, the simplification for "ocr + image metadata" is likely your biggest win! (I think it might also let us tidy up some of the enhancement parsers too, like how the NLP stuff fits into the parsing framework) Nick
Re: Plans for the first Tika 2.0 release
I think that could work! I've also created a custom filter that might help https://issues.apache.org/jira/browse/TIKA-2083?filter=12338448 Logic is as follows: project = TIKA AND affectedVersion = 2.0 AND priority >= Blocker AND status != Closed AND status != Fixed - Bob On 9/19/2016 1:40 PM, Allison, Timothy B. wrote: Should we create a tika-2_0-blocker label to differentiate from regular "blockers"? How about a single master issue: TIKA-2085. What else do we need to add?
RE: Plans for the first Tika 2.0 release
> Should we create a tika-2_0-blocker label to differentiate from regular > "blockers"? How about a single master issue: TIKA-2085. What else do we need to add?
[jira] [Updated] (TIKA-1974) Tika 2.0 - remove deprecated metadata properties
[ https://issues.apache.org/jira/browse/TIKA-1974?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Tim Allison updated TIKA-1974: -- Issue Type: Sub-task (was: Task) Parent: TIKA-2085 > Tika 2.0 - remove deprecated metadata properties > > > Key: TIKA-1974 > URL: https://issues.apache.org/jira/browse/TIKA-1974 > Project: Tika > Issue Type: Sub-task >Reporter: Tim Allison > > We have quite a few metadata properties that are deprecated. We should > remove them for 2.0. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (TIKA-2083) Tika 2.0 - Audit master branch against 2.x branch
[ https://issues.apache.org/jira/browse/TIKA-2083?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Tim Allison updated TIKA-2083: -- Issue Type: Sub-task (was: Task) Parent: TIKA-2085 > Tika 2.0 - Audit master branch against 2.x branch > - > > Key: TIKA-2083 > URL: https://issues.apache.org/jira/browse/TIKA-2083 > Project: Tika > Issue Type: Sub-task >Affects Versions: 2.0 >Reporter: Bob Paulin >Assignee: Bob Paulin >Priority: Blocker > Fix For: 2.0 > > > At this point Tika has been doing parallel development on master and the 2.x > for about 9 months. We should audit commit logs for that time to make a best > effort to identify any commits that may not have been applied in 2.x. This > task should be done prior to the 2.0 release -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Created] (TIKA-2085) Tika 2.0 -- Overarching task list for what we need to do before 2.0
Tim Allison created TIKA-2085: - Summary: Tika 2.0 -- Overarching task list for what we need to do before 2.0 Key: TIKA-2085 URL: https://issues.apache.org/jira/browse/TIKA-2085 Project: Tika Issue Type: Task Reporter: Tim Allison Let's use this issue to track issues that absolutely, positively have to be completed before we release Tika 2.0. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
RE: Plans for the first Tika 2.0 release
>> 1) Implement various strategies for chaining multiple parsers against >> individual files. Much of this has been implemented, but what's holding us >> up on this one (I think?) is a resettable outputstream. >I think we need a JIRA for this. Is there any existing design ideas on how >this would be achieved? Opened TIKA-2084 as subtask of TIKA-1509 > 2) Rich metadata (TIKA-1607) This is great. I think we need to ensure we have JIRAs for all the features we consider blockers and label them as such. This looks like there's a lot of good discussion. It also references TIKA-1903 so is that also a Tika 2.0 blocker? TIKA-1903 is not a blocker on 2.0, and may be obviated by TIKA-1607. >> 1) Get rid of old metadata tags in favor of "new" Dublin core >Need JIRA? Sorry, opened a good while ago: TIKA-1974 > If we can't get a date we should at least try to eliminate the ???. I think > we need to close down the feature set. Y, completely agree. Should we create a tika-2_0-blocker label to differentiate from regular "blockers"?
[jira] [Created] (TIKA-2083) Tika 2.0 - Audit master branch against 2.x branch
Bob Paulin created TIKA-2083: Summary: Tika 2.0 - Audit master branch against 2.x branch Key: TIKA-2083 URL: https://issues.apache.org/jira/browse/TIKA-2083 Project: Tika Issue Type: Task Affects Versions: 2.0 Reporter: Bob Paulin Assignee: Bob Paulin Priority: Blocker Fix For: 2.0 At this point Tika has been doing parallel development on master and the 2.x for about 9 months. We should audit commit logs for that time to make a best effort to identify any commits that may not have been applied in 2.x. This task should be done prior to the 2.0 release -- This message was sent by Atlassian JIRA (v6.3.4#6332)
Re: Plans for the first Tika 2.0 release
Thanks Tim! Replies in line. - Bob On 9/19/2016 12:33 PM, Allison, Timothy B. wrote: Bob, As always, thank you for driving 2.0! My concern is we have been dual maintaining 2 branches for about 9 months. I think the longer we do this the more risk there is that we miss something. Agreed. I think we're already missing a few things. Yikes is there a way we can audit what we might have missed? Perhaps we need a JIRA to do an audit of the commits in master and do a best effort of what might have been missed? I can create the JIRA for this. Would it make sense to at least put a date out there for a feature cut off? I'd be hesitant to do this. To my mind, the key is the actual features and devs who have time to implement them. Ok this is a start to understand what the blocking features are. The key will be creating concrete JIRAs for them and identifying where we are at. For me, the blocking new features are: 1) Implement various strategies for chaining multiple parsers against individual files. Much of this has been implemented, but what's holding us up on this one (I think?) is a resettable outputstream. I think we need a JIRA for this. Is there any existing design ideas on how this would be achieved? 2) Rich metadata (TIKA-1607) This is great. I think we need to ensure we have JIRAs for all the features we consider blockers and label them as such. This looks like there's a lot of good discussion. It also references TIKA-1903 so is that also a Tika 2.0 blocker? The blocking tasks: 1) Get rid of old metadata tags in favor of "new" Dublin core Need JIRA? 2) ??? If we can't get a date we should at least try to eliminate the ???. I think we need to close down the feature set. I'm full up on other stuff at the moment, perhaps after we get 1.14 out, I can turn to 2.0-specific development. What else do we have to do? Anyone else have some time? Yes please would be great to see if there are people that want to own work on the above features. Once we have JIRAs we can post to the Apache Help Wanted page as well. Thanks! Cheers, Tim -Original Message- From: Bob Paulin [mailto:b...@bobpaulin.com] Sent: Monday, September 19, 2016 10:32 AM To: dev@tika.apache.org Subject: Re: Plans for the first Tika 2.0 release Hi, I think it's a good thing to discuss. I know there are other features that are targeted for 2.0. Do we have a general sense of where those features are at? My concern is we have been dual maintaining 2 branches for about 9 months. I think the longer we do this the more risk there is that we miss something. Would it make sense to at least put a date out there for a feature cut off? There's always 3.0 if things are not close to being ready. - Bob
RE: Plans for the first Tika 2.0 release
Bob, As always, thank you for driving 2.0! > My concern is we have been dual maintaining 2 branches for about 9 months. I > think the longer we do this the more risk there is that we miss something. Agreed. I think we're already missing a few things. > Would it make sense to at least put a date out there for a feature cut off? I'd be hesitant to do this. To my mind, the key is the actual features and devs who have time to implement them. For me, the blocking new features are: 1) Implement various strategies for chaining multiple parsers against individual files. Much of this has been implemented, but what's holding us up on this one (I think?) is a resettable outputstream. 2) Rich metadata (TIKA-1607) The blocking tasks: 1) Get rid of old metadata tags in favor of "new" Dublin core 2) ??? I'm full up on other stuff at the moment, perhaps after we get 1.14 out, I can turn to 2.0-specific development. What else do we have to do? Anyone else have some time? Cheers, Tim -Original Message- From: Bob Paulin [mailto:b...@bobpaulin.com] Sent: Monday, September 19, 2016 10:32 AM To: dev@tika.apache.org Subject: Re: Plans for the first Tika 2.0 release Hi, I think it's a good thing to discuss. I know there are other features that are targeted for 2.0. Do we have a general sense of where those features are at? My concern is we have been dual maintaining 2 branches for about 9 months. I think the longer we do this the more risk there is that we miss something. Would it make sense to at least put a date out there for a feature cut off? There's always 3.0 if things are not close to being ready. - Bob
Re: Plans for the first Tika 2.0 release
Hi, I think it's a good thing to discuss. I know there are other features that are targeted for 2.0. Do we have a general sense of where those features are at? My concern is we have been dual maintaining 2 branches for about 9 months. I think the longer we do this the more risk there is that we miss something. Would it make sense to at least put a date out there for a feature cut off? There's always 3.0 if things are not close to being ready. - Bob On 9/19/2016 4:32 AM, Sergey Beryozkin wrote: Hi All Back in May I updated one of our CXF demos on the master 3.2 branch to depend on Tika 2.0 SNAPSHOT to verify the new module system works well. It is feasible that CXF 3.2.0 may be released by the end of the year or early next year. As far as Tika 2.0 dependencies are concerned it will be easy for me to update the demo to temporarily depend on Tika 1.13 or 1.14. But if Tika 2.0 is released by the time CXF 3.2 is about to be released then I'll be happy to keep 2.0 deps. Are there any plans to get Tika 2.0 out in the next few months ? Cheers, Sergey
Plans for the first Tika 2.0 release
Hi All Back in May I updated one of our CXF demos on the master 3.2 branch to depend on Tika 2.0 SNAPSHOT to verify the new module system works well. It is feasible that CXF 3.2.0 may be released by the end of the year or early next year. As far as Tika 2.0 dependencies are concerned it will be easy for me to update the demo to temporarily depend on Tika 1.13 or 1.14. But if Tika 2.0 is released by the time CXF 3.2 is about to be released then I'll be happy to keep 2.0 deps. Are there any plans to get Tika 2.0 out in the next few months ? Cheers, Sergey
Re: PDF with embedded attachments and Tika 2.0 modularity
, your point is well taken. Y, you'd need most parsers, but you can _probably_ live without advanced or scientific (sorry, Chris!). I'd be hesitant to change the structure much. We should definitely document this well, though! -Original Message- From: Sergey Beryozkin [mailto:sberyoz...@gmail.com] Sent: Thursday, September 15, 2016 12:15 PM To: dev@tika.apache.org Subject: PDF with embedded attachments and Tika 2.0 modularity Hi All As Tim educated me, PDF (and indeed other formats) may have all sort of embedded attachments. In my demo I've been working with Tika 2.0-SNAPSHOT which offers a nice option for users to pick up only individual parsers. So I've added PDFParser & OpenDocumentParser and tike-core to the project dependencies and all works very nice when I submit to the demo a simple PDF. But if I were to write the code which can handle the embedded attachments really well then I think I'll probably need to revert to depending on all of tika-parsers - otherwise how would I know which additional parser modules I should add ? If this reasoning is right then one can only use individual modules in the production if it is well-known the files to be processed will have no unexpected formats embedded in them... I've been wondering - would it make sense, for Tika 2.0, add few more 'helper' modules for most used formats, which would offer less than tika-parsers but more than individual modules, for example: this is what 2.x already has: tika-parser-modules/ tika-parser-pdf-module (individual parser modules for the most used ones) tika-parsers (all of the parsers) and now add: tika-parser-pdf-module-all (or similarly named) this tika-parser-pdf-module-all will depend on tika-parser-pdf-module plus the parsers which will be needed to process various PDF attachments ? This list of the extra deps will be based on the accumulated knowledge. Similarly for few other most used formats tika-parser-pdf-module-all will be a 'compromise', it will pull more modules than tika-parser-pdf-module but significantly less than tike-parsers Cheers, Sergey -- Sergey Beryozkin Talend Community Coders http://coders.talend.com/
Re: PDF with embedded attachments and Tika 2.0 modularity
Hi Sergey, On 9/15/2016 3:33 PM, Sergey Beryozkin wrote: Hi Bob, Tim, All, On 15/09/16 18:06, Bob Paulin wrote: Hi Sergey, I definitely get the challenges. In fact recently we merged the PDF module into the Multimedia module due to the tight coupling around the TesseractOCR[1] [2]. We could look into separating the PDF parser out again but I'm a bit short on a simple way to do it with TesseractOCR in play. Like Tim I'm hesitant to change structure but we definitely need to address how we handle embedded parsers. I've done some work with the ParserProxy class to remove some of the hard dependencies between parsers. With that we only pull in the parsers available on the class path. There an example in the JackcessExtractor class in the office module. What is the motivation behind excluding the other parsers in your usecase? Smaller footprint? Incompatibility? Performance? Depending on the the driver there may be other ways to get you to a similar place. Smaller footprint This is the one, it is not a big deal to have all of tika-parsers included in my demo, but I've been curious how the smaller footprint can indeed be achieved in Tika 2.x given it already does the best effort at supporting more modular Tika applications... Totally makes sense. I think you'll end up getting most of what you need by just pulling in the tika-parser-multimedia-module. It's already got all the image parsers for embedded images and TesseractOCR so you can take your demo as far as reading all the images and converting some of the images to text if you have Tesseract installed. You could just include the modules you need and any embedded parsers from other modules could be added via a ParserProxy. This might not remove all the parsers you don't need but might be a good start. I haven't heard of ParserProxy yet, sorry :-). As a Tika user I'm just learning. How would one use ParserProxy to minimize the dependencies ? Just found https://issues.apache.org/jira/browse/TIKA-1904 Sorry I took you for a Tika veteran based on your concerns for embedded parsers! The ParserProxy is new in 2.x and would actually not need to worry about it for coding your demo or a client application. It more for the framework to allow the modules to compile without parsers from other modules on the classpath. It pulls them in via reflection at runtime or if they are not present fallsback to a no-op. The most trimmed down way is what you've provided below in your example creating a tika-parser-pdf-module-all. I'm concerned about the number of combinations we might end up creating. Sure, if such an option would ever be considered then I'd imagine there would have to be a limit set. Ex, 5 most widely used formats which may have embedded attachments would have an extra module support (core parser like PDF parser plus the support parsers for the embedded attachments). I agree that a limit would be needed Would it make sense to hold on including them in Tika for now and see if some popular combinations emerge? Your demo is a great first step to get some feedback; I think we need more in order to ensure we're making the correct combinations. But I'm OK with selecting the individual parser modules that may be needed to have a nearly complete PDF parsing coverage, as long as I know which modules I have to select :-) Yes lets start with the multimedia module. I think you'll get quite a bit of cool things within that. Tim do you know of any other modules that would make sense? Incompatibility You might want to look at the tika-parser-bundle projects since putting the modules in an OSGi container will allow you isolate the classloaders. Performance A combination of the above or you might look to include a tika-config.xml and just exclude the parsers you don't want. That should prevent them from being a part of your pipeline. Other ideas on this? I think it's an important thing to discuss. Many thanks, Sergey Thank you for the feedback! - Bob [1] http://markmail.org/message/e4ncuid7zrvlitp5 [2] https://issues.apache.org/jira/browse/TIKA-2059 On 9/15/2016 11:20 AM, Allison, Timothy B. wrote: Sergey, your point is well taken. Y, you'd need most parsers, but you can _probably_ live without advanced or scientific (sorry, Chris!). I'd be hesitant to change the structure much. We should definitely document this well, though! -Original Message- From: Sergey Beryozkin [mailto:sberyoz...@gmail.com] Sent: Thursday, September 15, 2016 12:15 PM To: dev@tika.apache.org Subject: PDF with embedded attachments and Tika 2.0 modularity Hi All As Tim educated me, PDF (and indeed other formats) may have all sort of embedded attachments. In my demo I've been working with Tika 2.0-SNAPSHOT which offers a nice option for users to pick up only individual parsers. So I've added PDFParser & OpenDocumentParser and tike-core to the project dependencies and all works very nice when I su
Re: PDF with embedded attachments and Tika 2.0 modularity
Hi Sergey, I definitely get the challenges. In fact recently we merged the PDF module into the Multimedia module due to the tight coupling around the TesseractOCR[1] [2]. We could look into separating the PDF parser out again but I'm a bit short on a simple way to do it with TesseractOCR in play. Like Tim I'm hesitant to change structure but we definitely need to address how we handle embedded parsers. I've done some work with the ParserProxy class to remove some of the hard dependencies between parsers. With that we only pull in the parsers available on the class path. There an example in the JackcessExtractor class in the office module. What is the motivation behind excluding the other parsers in your usecase? Smaller footprint? Incompatibility? Performance? Depending on the the driver there may be other ways to get you to a similar place. Smaller footprint You could just include the modules you need and any embedded parsers from other modules could be added via a ParserProxy. This might not remove all the parsers you don't need but might be a good start. The most trimmed down way is what you've provided below in your example creating a tika-parser-pdf-module-all. I'm concerned about the number of combinations we might end up creating. Incompatibility You might want to look at the tika-parser-bundle projects since putting the modules in an OSGi container will allow you isolate the classloaders. Performance A combination of the above or you might look to include a tika-config.xml and just exclude the parsers you don't want. That should prevent them from being a part of your pipeline. Other ideas on this? I think it's an important thing to discuss. - Bob [1] http://markmail.org/message/e4ncuid7zrvlitp5 [2] https://issues.apache.org/jira/browse/TIKA-2059 On 9/15/2016 11:20 AM, Allison, Timothy B. wrote: Sergey, your point is well taken. Y, you'd need most parsers, but you can _probably_ live without advanced or scientific (sorry, Chris!). I'd be hesitant to change the structure much. We should definitely document this well, though! -Original Message- From: Sergey Beryozkin [mailto:sberyoz...@gmail.com] Sent: Thursday, September 15, 2016 12:15 PM To: dev@tika.apache.org Subject: PDF with embedded attachments and Tika 2.0 modularity Hi All As Tim educated me, PDF (and indeed other formats) may have all sort of embedded attachments. In my demo I've been working with Tika 2.0-SNAPSHOT which offers a nice option for users to pick up only individual parsers. So I've added PDFParser & OpenDocumentParser and tike-core to the project dependencies and all works very nice when I submit to the demo a simple PDF. But if I were to write the code which can handle the embedded attachments really well then I think I'll probably need to revert to depending on all of tika-parsers - otherwise how would I know which additional parser modules I should add ? If this reasoning is right then one can only use individual modules in the production if it is well-known the files to be processed will have no unexpected formats embedded in them... I've been wondering - would it make sense, for Tika 2.0, add few more 'helper' modules for most used formats, which would offer less than tika-parsers but more than individual modules, for example: this is what 2.x already has: tika-parser-modules/ tika-parser-pdf-module (individual parser modules for the most used ones) tika-parsers (all of the parsers) and now add: tika-parser-pdf-module-all (or similarly named) this tika-parser-pdf-module-all will depend on tika-parser-pdf-module plus the parsers which will be needed to process various PDF attachments ? This list of the extra deps will be based on the accumulated knowledge. Similarly for few other most used formats tika-parser-pdf-module-all will be a 'compromise', it will pull more modules than tika-parser-pdf-module but significantly less than tike-parsers Cheers, Sergey