Re: [VOTE] Release Apache Tika 1.23 Candidate #2

2019-12-04 Thread David Meikle
On Tue, 3 Dec 2019 at 03:15, Tim Allison  wrote:

> Please vote on releasing this package as Apache Tika 1.23.
> The vote is open for the next 72 hours and passes if a majority of at
> least three +1 Tika PMC votes are cast.
>
> [ ] +1 Release this package as Apache Tika 1.23
> [ ] -1 Do not release this package because...
>

 +1

Cheers,
Dave


[jira] [Commented] (TIKA-2546) com.pff:java-libpst is branch EOL

2019-12-04 Thread Hudson (Jira)


[ 
https://issues.apache.org/jira/browse/TIKA-2546?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16988203#comment-16988203
 ] 

Hudson commented on TIKA-2546:
--

SUCCESS: Integrated in Jenkins build tika-branch-1x #294 (See 
[https://builds.apache.org/job/tika-branch-1x/294/])
TIKA-2546: upgrade to java-libpst 0.9.3 (nassif.lfcn: 
[https://github.com/apache/tika/commit/274353268154784eccb3452c0859194fd54f15d4])
* (edit) 
tika-parsers/src/main/java/org/apache/tika/parser/mbox/OutlookPSTParser.java
* (edit) CHANGES.txt
* (edit) tika-parsers/pom.xml


> com.pff:java-libpst is branch EOL
> -
>
> Key: TIKA-2546
> URL: https://issues.apache.org/jira/browse/TIKA-2546
> Project: Tika
>  Issue Type: Improvement
>  Components: parser
>Affects Versions: 1.16, 1.17
> Environment: All
>Reporter: Richard Jones
>Assignee: Luís Filipe Nassif
>Priority: Major
> Fix For: 1.24
>
>
> com.pff:java-libpst is branch EOL, request Tika moves to active 0.9.3 version



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (TIKA-2546) com.pff:java-libpst is branch EOL

2019-12-04 Thread Hudson (Jira)


[ 
https://issues.apache.org/jira/browse/TIKA-2546?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16988202#comment-16988202
 ] 

Hudson commented on TIKA-2546:
--

SUCCESS: Integrated in Jenkins build Tika-trunk #1760 (See 
[https://builds.apache.org/job/Tika-trunk/1760/])
TIKA-2546: upgrade to java-libpst 0.9.3 (nassif.lfcn: 
[https://github.com/apache/tika/commit/c93bf058633059d0eedc899bd8a4d90603ac10e1])
* (edit) 
tika-parsers/src/main/java/org/apache/tika/parser/mbox/OutlookPSTParser.java
* (edit) tika-parsers/pom.xml
* (edit) CHANGES.txt


> com.pff:java-libpst is branch EOL
> -
>
> Key: TIKA-2546
> URL: https://issues.apache.org/jira/browse/TIKA-2546
> Project: Tika
>  Issue Type: Improvement
>  Components: parser
>Affects Versions: 1.16, 1.17
> Environment: All
>Reporter: Richard Jones
>Assignee: Luís Filipe Nassif
>Priority: Major
> Fix For: 1.24
>
>
> com.pff:java-libpst is branch EOL, request Tika moves to active 0.9.3 version



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (TIKA-3004) OutlookPSTParser missing emails attached to other emails

2019-12-04 Thread Jira
Luís Filipe Nassif created TIKA-3004:


 Summary: OutlookPSTParser missing emails attached to other emails
 Key: TIKA-3004
 URL: https://issues.apache.org/jira/browse/TIKA-3004
 Project: Tika
  Issue Type: Bug
  Components: parser
Affects Versions: 1.22
Reporter: Luís Filipe Nassif
Assignee: Luís Filipe Nassif


While resolving TIKA-2546, I noticed that emails attached to other emails are 
not currently extracted. We should check if attach.getEmbeddedPSTMessage() 
returns something.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (TIKA-2546) com.pff:java-libpst is branch EOL

2019-12-04 Thread Jira


[ 
https://issues.apache.org/jira/browse/TIKA-2546?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16988098#comment-16988098
 ] 

Luís Filipe Nassif commented on TIKA-2546:
--

Upgraded the lib and added a check for PSTFile.PST_TYPE_2013_UNICODE. If it 
matches, an exception is thrown to let the user know the support is very 
broken, until [https://github.com/rjohnsondev/java-libpst/issues/60] is fixed.

> com.pff:java-libpst is branch EOL
> -
>
> Key: TIKA-2546
> URL: https://issues.apache.org/jira/browse/TIKA-2546
> Project: Tika
>  Issue Type: Improvement
>  Components: parser
>Affects Versions: 1.16, 1.17
> Environment: All
>Reporter: Richard Jones
>Assignee: Luís Filipe Nassif
>Priority: Major
>
> com.pff:java-libpst is branch EOL, request Tika moves to active 0.9.3 version



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Resolved] (TIKA-2546) com.pff:java-libpst is branch EOL

2019-12-04 Thread Jira


 [ 
https://issues.apache.org/jira/browse/TIKA-2546?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Luís Filipe Nassif resolved TIKA-2546.
--
Fix Version/s: 1.24
   Resolution: Fixed

> com.pff:java-libpst is branch EOL
> -
>
> Key: TIKA-2546
> URL: https://issues.apache.org/jira/browse/TIKA-2546
> Project: Tika
>  Issue Type: Improvement
>  Components: parser
>Affects Versions: 1.16, 1.17
> Environment: All
>Reporter: Richard Jones
>Assignee: Luís Filipe Nassif
>Priority: Major
> Fix For: 1.24
>
>
> com.pff:java-libpst is branch EOL, request Tika moves to active 0.9.3 version



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (TIKA-2224) OneNote formats support - Mime Magic and Parser

2019-12-04 Thread Tim Allison (Jira)


[ 
https://issues.apache.org/jira/browse/TIKA-2224?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16988088#comment-16988088
 ] 

Tim Allison commented on TIKA-2224:
---

W00t! I’m tied up today and tomorrow. Any chance Friday morning ET would work? 
What time zone are you in?

> OneNote formats support - Mime Magic and Parser
> ---
>
> Key: TIKA-2224
> URL: https://issues.apache.org/jira/browse/TIKA-2224
> Project: Tika
>  Issue Type: Improvement
>  Components: mime
>Affects Versions: 1.14
>Reporter: Nick Burch
>Priority: Major
> Attachments: Sample1.json, Sample1.one, note-ssn-test-.one
>
>
> As raised at 
> http://stackoverflow.com/questions/41272195/onenote-support-for-apache-tika-parsers,
>  we don't have any magic for the OneNote formats. Several years ago we dug 
> out the file format specs (see 
> http://lucene.472066.n3.nabble.com/Tika-OneNote-Support-td4020393.html), but 
> didn't have volunteer energy to implement a parser. However, armed with those 
> specs, we should be able to come up with some mime magic for detection



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Assigned] (TIKA-2546) com.pff:java-libpst is branch EOL

2019-12-04 Thread Jira


 [ 
https://issues.apache.org/jira/browse/TIKA-2546?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Luís Filipe Nassif reassigned TIKA-2546:


Assignee: Luís Filipe Nassif

> com.pff:java-libpst is branch EOL
> -
>
> Key: TIKA-2546
> URL: https://issues.apache.org/jira/browse/TIKA-2546
> Project: Tika
>  Issue Type: Improvement
>  Components: parser
>Affects Versions: 1.16, 1.17
> Environment: All
>Reporter: Richard Jones
>Assignee: Luís Filipe Nassif
>Priority: Major
>
> com.pff:java-libpst is branch EOL, request Tika moves to active 0.9.3 version



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Assigned] (TIKA-2415) Upgrade libpst to 0.9.3

2019-12-04 Thread Jira


 [ 
https://issues.apache.org/jira/browse/TIKA-2415?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Luís Filipe Nassif reassigned TIKA-2415:


Assignee: Luís Filipe Nassif

> Upgrade libpst to 0.9.3 
> 
>
> Key: TIKA-2415
> URL: https://issues.apache.org/jira/browse/TIKA-2415
> Project: Tika
>  Issue Type: Improvement
>Reporter: Tim Allison
>Assignee: Luís Filipe Nassif
>Priority: Trivial
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Comment Edited] (TIKA-2224) OneNote formats support - Mime Magic and Parser

2019-12-04 Thread Nicholas DiPiazza (Jira)


[ 
https://issues.apache.org/jira/browse/TIKA-2224?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16988049#comment-16988049
 ] 

Nicholas DiPiazza edited comment on TIKA-2224 at 12/4/19 6:09 PM:
--

OK i've got it working now. it's parsing all the text as you'd expect.
Tim - are you free today for a zoom session? 
[~tallison]


was (Author: ndipiazza_gmail):
OK i've got it working now. it's parsing all the text as you'd expect.
Tim - are you free today for a zoom session? 

> OneNote formats support - Mime Magic and Parser
> ---
>
> Key: TIKA-2224
> URL: https://issues.apache.org/jira/browse/TIKA-2224
> Project: Tika
>  Issue Type: Improvement
>  Components: mime
>Affects Versions: 1.14
>Reporter: Nick Burch
>Priority: Major
> Attachments: Sample1.json, Sample1.one, note-ssn-test-.one
>
>
> As raised at 
> http://stackoverflow.com/questions/41272195/onenote-support-for-apache-tika-parsers,
>  we don't have any magic for the OneNote formats. Several years ago we dug 
> out the file format specs (see 
> http://lucene.472066.n3.nabble.com/Tika-OneNote-Support-td4020393.html), but 
> didn't have volunteer energy to implement a parser. However, armed with those 
> specs, we should be able to come up with some mime magic for detection



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (TIKA-2224) OneNote formats support - Mime Magic and Parser

2019-12-04 Thread Nicholas DiPiazza (Jira)


[ 
https://issues.apache.org/jira/browse/TIKA-2224?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16988049#comment-16988049
 ] 

Nicholas DiPiazza commented on TIKA-2224:
-

OK i've got it working now. it's parsing all the text as you'd expect.
Tim - are you free today for a zoom session? 

> OneNote formats support - Mime Magic and Parser
> ---
>
> Key: TIKA-2224
> URL: https://issues.apache.org/jira/browse/TIKA-2224
> Project: Tika
>  Issue Type: Improvement
>  Components: mime
>Affects Versions: 1.14
>Reporter: Nick Burch
>Priority: Major
> Attachments: Sample1.json, Sample1.one, note-ssn-test-.one
>
>
> As raised at 
> http://stackoverflow.com/questions/41272195/onenote-support-for-apache-tika-parsers,
>  we don't have any magic for the OneNote formats. Several years ago we dug 
> out the file format specs (see 
> http://lucene.472066.n3.nabble.com/Tika-OneNote-Support-td4020393.html), but 
> didn't have volunteer energy to implement a parser. However, armed with those 
> specs, we should be able to come up with some mime magic for detection



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


Re: [EXTERNAL] Do we have a community supported approach for deploying Tika Server in production?

2019-12-04 Thread Chris Mattmann
Thanks for bringing this conversation up Eric.

 

Historically if you look over the last 5 years, I think what you are asking 
below has sort of already become the de facto
truth. Most people are in fact using Tika server, whether they are individual 
devs, govvies, commercial folk and the like. 

Big, small and medium projects. Evidenced by the expansion of Tika APIs into 
pretty much every PL I know and use of 
actively today.

 

Given that, we probably should update the main website docs to make this more 
prominent. The tika server docs on the
wiki are pretty darn good. But they don’t get prime real estate. Would be 
wonderful if someone wants to update the 
website to make it more prominent.

 

The downstream Tika Python lib that I maintain has tons of activity is used by 
more than 350+ projects and relies solely
on Tika-Server. My recommendation to the Solr folks (having created 7633) from 
the 2014 DARPA MEMEX days was to 
move towards Tika Server based SolrCell dep and that’s the right way to go IMO.

 

Chris

 

 

 

 

 

From: Eric Pugh 
Reply-To: "dev@tika.apache.org" 
Date: Wednesday, December 4, 2019 at 12:24 PM
To: "tika-...@apache.org" 
Subject: [EXTERNAL] Do we have a community supported approach for deploying 
Tika Server in production?

 

Hi all - Hoping this is a reasonable Tika-dev versus Tika-user question!

 

Over in Solr land there has been renewed discussion about streamlining what 
Solr is   

 

In regards to rich content extraction and the Tika project, it seems like the 
two ideas that continue to preserve the existing behavior are:

 

1) To convert the ExtractingRequestHandler into a Package (Plugin) for Solr.   
This slims down the standard Solr download, and *might* make it easier to 
update the version of Tika + dependent jars used?

 

2) The second approach is to instead require Tika-Server to be running 
(https://issues.apache.org/jira/browse/SOLR-7633) and just have Solr delegate 
the call to Tika-Server.

 

 

I was thinking about why I like option 1 better than 2, and I think it boils 
down to how mature the IT organization I am working with is.  Some IT 
organizations have large dev-ops teams, and are working at major scale, and 
managing a fleet of Tika-Server on Kubernetes with Load Balancer dynamically 
scaling up and down is simple and second nature!  However, many organizations 
aren’t like that.

 

So I guess what I’m asking is do we have a reasonable supported approach for 
deploying Tika Server for non-tika savvy organizations?   I’m thinking about 
Solr, and specifically the fact that Solr has a well defined set of Service 
Installation scripts.   When I follow the directions in 
https://lucene.apache.org/solr/guide/8_3/taking-solr-to-production.html#taking-solr-to-production
 I can feel confident that when the server is rebooted, then Solr will come 
back up!   Plus there is log rotation and all the rest.

 

In contrast, when I look at Tika website, specifically 
https://tika.apache.org/1.22/gettingstarted.htm pagel, the message is to run 
Tika as a command line application, or embedded in your application.   

 

I’m wondering if Tika-Server needs to be made more prominent, and treated as 
the “primary method of interacting with Tika”?   Do we need as a community to 
focus more on Tika-Server?   In our getting started documentation, in our usage 
documentation, and in our examples?

 

Do we need to create the equivalent of the Service Installation scripts for 
Tika-Server?   

 

Wanted to stoke the discussion!

 

Eric

 

___

Eric Pugh | Founder & CEO | OpenSource Connections, LLC | 434.466.1467 | 
http://www.opensourceconnections.com  | 
My Free/Busy   

Co-Author: Apache Solr Enterprise Search Server, 3rd Ed 

   

This e-mail and all contents, including attachments, is considered to be 
Company Confidential unless explicitly stated otherwise, regardless of whether 
attachments are marked as such.

 

 



Do we have a community supported approach for deploying Tika Server in production?

2019-12-04 Thread Eric Pugh
Hi all - Hoping this is a reasonable Tika-dev versus Tika-user question!

Over in Solr land there has been renewed discussion about streamlining what 
Solr is   

In regards to rich content extraction and the Tika project, it seems like the 
two ideas that continue to preserve the existing behavior are:

1) To convert the ExtractingRequestHandler into a Package (Plugin) for Solr.   
This slims down the standard Solr download, and *might* make it easier to 
update the version of Tika + dependent jars used?

2) The second approach is to instead require Tika-Server to be running 
(https://issues.apache.org/jira/browse/SOLR-7633) and just have Solr delegate 
the call to Tika-Server.


I was thinking about why I like option 1 better than 2, and I think it boils 
down to how mature the IT organization I am working with is.  Some IT 
organizations have large dev-ops teams, and are working at major scale, and 
managing a fleet of Tika-Server on Kubernetes with Load Balancer dynamically 
scaling up and down is simple and second nature!  However, many organizations 
aren’t like that.

So I guess what I’m asking is do we have a reasonable supported approach for 
deploying Tika Server for non-tika savvy organizations?   I’m thinking about 
Solr, and specifically the fact that Solr has a well defined set of Service 
Installation scripts.   When I follow the directions in 
https://lucene.apache.org/solr/guide/8_3/taking-solr-to-production.html#taking-solr-to-production
 I can feel confident that when the server is rebooted, then Solr will come 
back up!   Plus there is log rotation and all the rest.

In contrast, when I look at Tika website, specifically 
https://tika.apache.org/1.22/gettingstarted.htm pagel, the message is to run 
Tika as a command line application, or embedded in your application.   

I’m wondering if Tika-Server needs to be made more prominent, and treated as 
the “primary method of interacting with Tika”?   Do we need as a community to 
focus more on Tika-Server?   In our getting started documentation, in our usage 
documentation, and in our examples?

Do we need to create the equivalent of the Service Installation scripts for 
Tika-Server?   

Wanted to stoke the discussion!

Eric

___
Eric Pugh | Founder & CEO | OpenSource Connections, LLC | 434.466.1467 | 
http://www.opensourceconnections.com  | 
My Free/Busy   
Co-Author: Apache Solr Enterprise Search Server, 3rd Ed 


This e-mail and all contents, including attachments, is considered to be 
Company Confidential unless explicitly stated otherwise, regardless of whether 
attachments are marked as such.