date:20170922


Hi,
On 22/09/17 22:02, Eugene Kirpichov wrote:

Sure - with hundreds of different file formats and the abundance of weird /
malformed / malicious files in the wild, it's quite expected that sometimes
the library will crash.

Some kinds of issues are easier to address than others. We can catch
exceptions and return a ParseResult representing a failure to parse this
document. Addressing freezes and native JVM process crashes is much harder
and probably not necessary in the first version.

Sergey - I think, the moment you introduce ParseResult into the code, other
changes I suggested will follow "by construction":
- There'll be 1 ParseResult per document, containing filename, content and
metadata, since per discussion above it probably doesn't make sense to
deliver these in separate PCollection elements


I was still harboring the hope that may be using a container bean like 
ParseResult (with the other changes you proposed) can somehow let us 
stream from Tika into the pipeline.


If it is 1 ParseResult per document then it means that until Tika has 
parsed all the document the pipeline will not see it.


I'm sorry if I may be starting to go in circles. But let me ask this. 
How can a Beam user write a Beam function which will ensure the Tika 
content pieces are seen ordered by the pipeline, without TikaIO ?


May be knowing that will help coming up with the idea how to generalize 
somehow with the help of TikaIO ?



- Since you're returning a single value per document, there's no reason to
use a BoundedReader
- Likewise, there's no reason to use asynchronicity because you're not
delivering the result incrementally

I'd suggest to start the refactoring by removing the asynchronous codepath,
then converting from BoundedReader to ParDo or MapElements, then converting
from String to ParseResult.
This is a good plan, thanks, I guess at least for small documents it 
should work well (unless I've misunderstood a ParseResult idea)


Thanks, Sergey


On Fri, Sep 22, 2017 at 12:10 PM Sergey Beryozkin 
wrote:


Hi Tim, All
On 22/09/17 18:17, Allison, Timothy B. wrote:

Y, I think you have it right.


Tika library has a big problem with crashes and freezes


I wouldn't want to overstate it.  Crashes and freezes are exceedingly

rare, but when you are processing millions/billions of files in the wild
[1], they will happen.  We fix the problems or try to get our dependencies
to fix the problems when we can,

I only would like to add to this that IMHO it would be more correct to
state it's not a Tika library's 'fault' that the crashes might occur.
Tika does its best to get the latest libraries helping it to parse the
files, but indeed there will always be some file there that might use
some incomplete format specific tag etc which may cause the specific
parser to spin - but Tika will include the updated parser library asap.

And with Beam's help the crashes that can kill the Tika jobs completely
will probably become a history...

Cheers, Sergey

but given our past history, I have no reason to believe that these

problems won't happen again.


Thank you, again!

Best,

  Tim

[1] Stuff on the internet or ... some of our users are forensics

examiners dealing with broken/corrupted files


P.S./FTR  
1) We've gathered a TB of data from CommonCrawl and we run regression

tests against this TB (thank you, Rackspace for hosting our vm!) to try to
identify these problems.

2) We've started a fuzzing effort to try to identify problems.
3) We added "tika-batch" for robust single box fileshare/fileshare

processing for our low volume users

4) We're trying to get the message out.  Thank you for working with us!!!

-Original Message-
From: Eugene Kirpichov [mailto:kirpic...@google.com.INVALID]
Sent: Friday, September 22, 2017 12:48 PM
To: d...@beam.apache.org
Cc: dev@tika.apache.org
Subject: Re: TikaIO concerns

Hi Tim,
  From what you're saying it sounds like the Tika library has a big

problem with crashes and freezes, and when applying it at scale (eg. in the
context of Beam) requires explicitly addressing this problem, eg. accepting
the fact that in many realistic applications some documents will just need
to be skipped because they are unprocessable? This would be first example
of a Beam IO that has this concern, so I'd like to confirm that my
understanding is correct.


On Fri, Sep 22, 2017 at 9:34 AM Allison, Timothy B. 
wrote:


Reuven,

Thank you!  This suggests to me that it is a good idea to integrate
Tika with Beam so that people don't have to 1) (re)discover the need
to make their wrappers robust and then 2) have to reinvent these
wheels for robustness.

For kicks, see William Palmer's post on his toe-stubbing efforts with
Hadoop [1].  He and other Tika users independently have wound up
carrying out exactly your recommendation for 1) below.

We have a MockParser that you can get to simulate regular exceptions,
OOMs and permanent hangs by asking Tika to

[jira] [Commented] (TIKA-2470) Another Illegal reflective Access -- more cleanup for Java 9

2017-09-22 Thread Konstantin Gribov (JIRA)


[ 
https://issues.apache.org/jira/browse/TIKA-2470?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16177131#comment-16177131
 ] 

Konstantin Gribov commented on TIKA-2470:
-

[~talli...@apache.org], speaking of 1.x -- perfectly ok, since there's no 
slf4j-api in tika-core. I planned to migrate tika-core to slf4j, but, I guess, 
it won't be in near future 'cause lack of time

> Another Illegal reflective Access -- more cleanup for Java 9
> 
>
> Key: TIKA-2470
> URL: https://issues.apache.org/jira/browse/TIKA-2470
> Project: Tika
>  Issue Type: Bug
>Reporter: Tim Allison
> Fix For: 1.17
>
>
> WARNING: Illegal reflective access by org.apache.tika.utils.XMLReaderUtils 
> (file:/C:/data/tika-eval-1.17-SNAPSHOT.jar) to method 
> com.sun.org.apache.xerces.internal.util.SecurityManager.setEntityExpansionLimit(int)



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

[jira] [Comment Edited] (TIKA-2470) Another Illegal reflective Access -- more cleanup for Java 9


[ 
https://issues.apache.org/jira/browse/TIKA-2470?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16177119#comment-16177119
 ] 

Tim Allison edited comment on TIKA-2470 at 9/22/17 9:03 PM:


[~grossws] I used the JUL Logger in tika-core. Is this the right logger to use 
in there?


was (Author: talli...@mitre.org):
[~grossws] I used the JUL Logger in tika-core is this the right logger to use 
in there?

> Another Illegal reflective Access -- more cleanup for Java 9
> 
>
> Key: TIKA-2470
> URL: https://issues.apache.org/jira/browse/TIKA-2470
> Project: Tika
>  Issue Type: Bug
>Reporter: Tim Allison
> Fix For: 1.17
>
>
> WARNING: Illegal reflective access by org.apache.tika.utils.XMLReaderUtils 
> (file:/C:/data/tika-eval-1.17-SNAPSHOT.jar) to method 
> com.sun.org.apache.xerces.internal.util.SecurityManager.setEntityExpansionLimit(int)



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

[jira] [Commented] (TIKA-2470) Another Illegal reflective Access -- more cleanup for Java 9


[ 
https://issues.apache.org/jira/browse/TIKA-2470?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16177119#comment-16177119
 ] 

Tim Allison commented on TIKA-2470:
---

[~grossws] I used the JUL Logger in tika-core is this the right logger to use 
in there?

> Another Illegal reflective Access -- more cleanup for Java 9
> 
>
> Key: TIKA-2470
> URL: https://issues.apache.org/jira/browse/TIKA-2470
> Project: Tika
>  Issue Type: Bug
>Reporter: Tim Allison
> Fix For: 1.17
>
>
> WARNING: Illegal reflective access by org.apache.tika.utils.XMLReaderUtils 
> (file:/C:/data/tika-eval-1.17-SNAPSHOT.jar) to method 
> com.sun.org.apache.xerces.internal.util.SecurityManager.setEntityExpansionLimit(int)



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

[jira] [Resolved] (TIKA-2470) Another Illegal reflective Access -- more cleanup for Java 9


 [ 
https://issues.apache.org/jira/browse/TIKA-2470?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tim Allison resolved TIKA-2470.
---
   Resolution: Fixed
Fix Version/s: 1.17

> Another Illegal reflective Access -- more cleanup for Java 9
> 
>
> Key: TIKA-2470
> URL: https://issues.apache.org/jira/browse/TIKA-2470
> Project: Tika
>  Issue Type: Bug
>Reporter: Tim Allison
> Fix For: 1.17
>
>
> WARNING: Illegal reflective access by org.apache.tika.utils.XMLReaderUtils 
> (file:/C:/data/tika-eval-1.17-SNAPSHOT.jar) to method 
> com.sun.org.apache.xerces.internal.util.SecurityManager.setEntityExpansionLimit(int)



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

[jira] [Updated] (TIKA-2470) Another Illegal reflective Access -- more cleanup for Java 9


 [ 
https://issues.apache.org/jira/browse/TIKA-2470?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tim Allison updated TIKA-2470:
--
Description: WARNING: Illegal reflective access by 
org.apache.tika.utils.XMLReaderUtils 
(file:/C:/data/tika-eval-1.17-SNAPSHOT.jar) to method 
com.sun.org.apache.xerces.internal.util.SecurityManager.setEntityExpansionLimit(int)
  (was: WARNING: Illegal reflective access by 
org.apache.tika.utils.XMLReaderUtils 
(file:/C:/data/fsis_site_crawler/tika-eval-1.17-SNAPSHOT.jar) to method 
com.sun.org.apache.xerces.internal.util.SecurityManager.setEntityExpansionLimit(int))

> Another Illegal reflective Access -- more cleanup for Java 9
> 
>
> Key: TIKA-2470
> URL: https://issues.apache.org/jira/browse/TIKA-2470
> Project: Tika
>  Issue Type: Bug
>Reporter: Tim Allison
>
> WARNING: Illegal reflective access by org.apache.tika.utils.XMLReaderUtils 
> (file:/C:/data/tika-eval-1.17-SNAPSHOT.jar) to method 
> com.sun.org.apache.xerces.internal.util.SecurityManager.setEntityExpansionLimit(int)



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

[jira] [Created] (TIKA-2470) Another Illegal reflective Access -- more cleanup for Java 9

Tim Allison created TIKA-2470:
-

 Summary: Another Illegal reflective Access -- more cleanup for 
Java 9
 Key: TIKA-2470
 URL: https://issues.apache.org/jira/browse/TIKA-2470
 Project: Tika
  Issue Type: Bug
Reporter: Tim Allison


WARNING: Illegal reflective access by org.apache.tika.utils.XMLReaderUtils 
(file:/C:/data/fsis_site_crawler/tika-eval-1.17-SNAPSHOT.jar) to method 
com.sun.org.apache.xerces.internal.util.SecurityManager.setEntityExpansionLimit(int)



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

Re: TikaIO concerns


Hi Tim, All
On 22/09/17 18:17, Allison, Timothy B. wrote:

Y, I think you have it right.


Tika library has a big problem with crashes and freezes


I wouldn't want to overstate it.  Crashes and freezes are exceedingly rare, but 
when you are processing millions/billions of files in the wild [1], they will 
happen.  We fix the problems or try to get our dependencies to fix the problems 
when we can,


I only would like to add to this that IMHO it would be more correct to 
state it's not a Tika library's 'fault' that the crashes might occur. 
Tika does its best to get the latest libraries helping it to parse the 
files, but indeed there will always be some file there that might use 
some incomplete format specific tag etc which may cause the specific 
parser to spin - but Tika will include the updated parser library asap.


And with Beam's help the crashes that can kill the Tika jobs completely 
will probably become a history...


Cheers, Sergey

but given our past history, I have no reason to believe that these problems 
won't happen again.

Thank you, again!

Best,

 Tim

[1] Stuff on the internet or ... some of our users are forensics examiners 
dealing with broken/corrupted files

P.S./FTR  
1) We've gathered a TB of data from CommonCrawl and we run regression tests 
against this TB (thank you, Rackspace for hosting our vm!) to try to identify 
these problems.
2) We've started a fuzzing effort to try to identify problems.
3) We added "tika-batch" for robust single box fileshare/fileshare processing 
for our low volume users
4) We're trying to get the message out.  Thank you for working with us!!!

-Original Message-
From: Eugene Kirpichov [mailto:kirpic...@google.com.INVALID]
Sent: Friday, September 22, 2017 12:48 PM
To: d...@beam.apache.org
Cc: dev@tika.apache.org
Subject: Re: TikaIO concerns

Hi Tim,
 From what you're saying it sounds like the Tika library has a big problem with 
crashes and freezes, and when applying it at scale (eg. in the context of Beam) 
requires explicitly addressing this problem, eg. accepting the fact that in 
many realistic applications some documents will just need to be skipped because 
they are unprocessable? This would be first example of a Beam IO that has this 
concern, so I'd like to confirm that my understanding is correct.

On Fri, Sep 22, 2017 at 9:34 AM Allison, Timothy B. 
wrote:


Reuven,

Thank you!  This suggests to me that it is a good idea to integrate
Tika with Beam so that people don't have to 1) (re)discover the need
to make their wrappers robust and then 2) have to reinvent these
wheels for robustness.

For kicks, see William Palmer's post on his toe-stubbing efforts with
Hadoop [1].  He and other Tika users independently have wound up
carrying out exactly your recommendation for 1) below.

We have a MockParser that you can get to simulate regular exceptions,
OOMs and permanent hangs by asking Tika to parse a  xml [2].


However if processing the document causes the process to crash, then
it

will be retried.
Any ideas on how to get around this?

Thank you again.

Cheers,

Tim

[1]
http://openpreservation.org/blog/2014/03/21/tika-ride-characterising-w
eb-content-nanite/
[2]
https://github.com/apache/tika/blob/master/tika-parsers/src/test/resources/test-documents/mock/example.xml

RE: TikaIO concerns

Great.  Thank you!

-Original Message-
From: Chris Mattmann [mailto:mattm...@apache.org] 
Sent: Friday, September 22, 2017 1:46 PM
To: dev@tika.apache.org
Subject: Re: TikaIO concerns

[dropping Beam on this]

Tim, another thing is that you can finally download the TREC-DD Polar data 
either from the NSF Arctic Data Center (70GB zip), or from Amazon S3, as 
described here:

http://github.com/chrismattmann/trec-dd-polar/ 

In case we want to use as part of our regression.

Cheers,
Chris

On 9/22/17, 10:43 AM, "Allison, Timothy B."  wrote:

>>1) We've gathered a TB of data from CommonCrawl and we run regression 
tests against this TB (thank you, Rackspace for hosting our vm!) to try to 
identify these problems.

And if anyone with connections at a big company doing open source + cloud 
would be interested in floating us some storage and cycles,  we'd be happy to 
move off our single vm to increase coverage and improve the speed for our 
large-scale regression tests.  

:D

But seriously, thank you for this discussion and collaboration!

Cheers,

 Tim

Re: TikaIO concerns

2017-09-22 Thread Chris Mattmann

[dropping Beam on this]

Tim, another thing is that you can finally download the TREC-DD Polar data 
either
from the NSF Arctic Data Center (70GB zip), or from Amazon S3, as described 
here:

http://github.com/chrismattmann/trec-dd-polar/ 

In case we want to use as part of our regression.

Cheers,
Chris




On 9/22/17, 10:43 AM, "Allison, Timothy B."  wrote:

>>1) We've gathered a TB of data from CommonCrawl and we run regression 
tests against this TB (thank you, Rackspace for hosting our vm!) to try to 
identify these problems.

And if anyone with connections at a big company doing open source + cloud 
would be interested in floating us some storage and cycles,  we'd be happy to 
move off our single vm to increase coverage and improve the speed for our 
large-scale regression tests.  

:D

But seriously, thank you for this discussion and collaboration!

Cheers,

 Tim

RE: TikaIO concerns

>>1) We've gathered a TB of data from CommonCrawl and we run regression tests 
>>against this TB (thank you, Rackspace for hosting our vm!) to try to identify 
>>these problems.

And if anyone with connections at a big company doing open source + cloud would 
be interested in floating us some storage and cycles,  we'd be happy to move 
off our single vm to increase coverage and improve the speed for our 
large-scale regression tests.  

:D

But seriously, thank you for this discussion and collaboration!

Cheers,

 Tim

RE: TikaIO concerns

Nice!  Thank you!

-Original Message-
From: Ben Chambers [mailto:bchamb...@apache.org] 
Sent: Friday, September 22, 2017 1:24 PM
To: d...@beam.apache.org
Cc: dev@tika.apache.org
Subject: Re: TikaIO concerns

BigQueryIO allows a side-output for elements that failed to be inserted when 
using the Streaming BigQuery sink:

https://github.com/apache/beam/blob/master/sdks/java/io/google-cloud-platform/src/main/java/org/apache/beam/sdk/io/gcp/bigquery/StreamingWriteTables.java#L92

This follows the pattern of a DoFn with multiple outputs, as described here 
https://cloud.google.com/blog/big-data/2016/01/handling-invalid-inputs-in-dataflow

So, the DoFn that runs the Tika code could be configured in terms of how 
different failures should be handled, with the option of just outputting them 
to a different PCollection that is then processed in some other way.

On Fri, Sep 22, 2017 at 10:18 AM Allison, Timothy B. 
wrote:

> Do tell...
>
> Interesting.  Any pointers?
>
> -Original Message-
> From: Ben Chambers [mailto:bchamb...@google.com.INVALID]
> Sent: Friday, September 22, 2017 12:50 PM
> To: d...@beam.apache.org
> Cc: dev@tika.apache.org
> Subject: Re: TikaIO concerns
>
> Regarding specifically elements that are failing -- I believe some 
> other IO has used the concept of a "Dead Letter" side-output,, where 
> documents that failed to process are side-output so the user can 
> handle them appropriately.
>
> On Fri, Sep 22, 2017 at 9:47 AM Eugene Kirpichov 
>  wrote:
>
> > Hi Tim,
> > From what you're saying it sounds like the Tika library has a big 
> > problem with crashes and freezes, and when applying it at scale (eg.
> > in the context of Beam) requires explicitly addressing this problem, 
> > eg. accepting the fact that in many realistic applications some 
> > documents will just need to be skipped because they are unprocessable?
> > This would be first example of a Beam IO that has this concern, so 
> > I'd like to confirm that my understanding is correct.
> >
> > On Fri, Sep 22, 2017 at 9:34 AM Allison, Timothy B.
> > 
> > wrote:
> >
> > > Reuven,
> > >
> > > Thank you!  This suggests to me that it is a good idea to 
> > > integrate Tika with Beam so that people don't have to 1) 
> > > (re)discover the need to make their wrappers robust and then 2) 
> > > have to reinvent these wheels for robustness.
> > >
> > > For kicks, see William Palmer's post on his toe-stubbing efforts 
> > > with Hadoop [1].  He and other Tika users independently have wound 
> > > up carrying out exactly your recommendation for 1) below.
> > >
> > > We have a MockParser that you can get to simulate regular 
> > > exceptions,
> > OOMs
> > > and permanent hangs by asking Tika to parse a  xml [2].
> > >
> > > > However if processing the document causes the process to crash, 
> > > > then it
> > > will be retried.
> > > Any ideas on how to get around this?
> > >
> > > Thank you again.
> > >
> > > Cheers,
> > >
> > >Tim
> > >
> > > [1]
> > >
> > http://openpreservation.org/blog/2014/03/21/tika-ride-characterising
> > -w
> > eb-content-nanite/
> > > [2]
> > >
> > https://github.com/apache/tika/blob/master/tika-parsers/src/test/res
> > ou rces/test-documents/mock/example.xml
> > >
> >
>

Re: TikaIO concerns

2017-09-22 Thread Ben Chambers

BigQueryIO allows a side-output for elements that failed to be inserted
when using the Streaming BigQuery sink:

https://github.com/apache/beam/blob/master/sdks/java/io/google-cloud-platform/src/main/java/org/apache/beam/sdk/io/gcp/bigquery/StreamingWriteTables.java#L92

This follows the pattern of a DoFn with multiple outputs, as described here
https://cloud.google.com/blog/big-data/2016/01/handling-invalid-inputs-in-dataflow

So, the DoFn that runs the Tika code could be configured in terms of how
different failures should be handled, with the option of just outputting
them to a different PCollection that is then processed in some other way.

On Fri, Sep 22, 2017 at 10:18 AM Allison, Timothy B. 
wrote:

> Do tell...
>
> Interesting.  Any pointers?
>
> -Original Message-
> From: Ben Chambers [mailto:bchamb...@google.com.INVALID]
> Sent: Friday, September 22, 2017 12:50 PM
> To: d...@beam.apache.org
> Cc: dev@tika.apache.org
> Subject: Re: TikaIO concerns
>
> Regarding specifically elements that are failing -- I believe some other
> IO has used the concept of a "Dead Letter" side-output,, where documents
> that failed to process are side-output so the user can handle them
> appropriately.
>
> On Fri, Sep 22, 2017 at 9:47 AM Eugene Kirpichov
>  wrote:
>
> > Hi Tim,
> > From what you're saying it sounds like the Tika library has a big
> > problem with crashes and freezes, and when applying it at scale (eg.
> > in the context of Beam) requires explicitly addressing this problem,
> > eg. accepting the fact that in many realistic applications some
> > documents will just need to be skipped because they are unprocessable?
> > This would be first example of a Beam IO that has this concern, so I'd
> > like to confirm that my understanding is correct.
> >
> > On Fri, Sep 22, 2017 at 9:34 AM Allison, Timothy B.
> > 
> > wrote:
> >
> > > Reuven,
> > >
> > > Thank you!  This suggests to me that it is a good idea to integrate
> > > Tika with Beam so that people don't have to 1) (re)discover the need
> > > to make their wrappers robust and then 2) have to reinvent these
> > > wheels for robustness.
> > >
> > > For kicks, see William Palmer's post on his toe-stubbing efforts
> > > with Hadoop [1].  He and other Tika users independently have wound
> > > up carrying out exactly your recommendation for 1) below.
> > >
> > > We have a MockParser that you can get to simulate regular
> > > exceptions,
> > OOMs
> > > and permanent hangs by asking Tika to parse a  xml [2].
> > >
> > > > However if processing the document causes the process to crash,
> > > > then it
> > > will be retried.
> > > Any ideas on how to get around this?
> > >
> > > Thank you again.
> > >
> > > Cheers,
> > >
> > >Tim
> > >
> > > [1]
> > >
> > http://openpreservation.org/blog/2014/03/21/tika-ride-characterising-w
> > eb-content-nanite/
> > > [2]
> > >
> > https://github.com/apache/tika/blob/master/tika-parsers/src/test/resou
> > rces/test-documents/mock/example.xml
> > >
> >
>

RE: TikaIO concerns

Do tell...

Interesting.  Any pointers?

-Original Message-
From: Ben Chambers [mailto:bchamb...@google.com.INVALID] 
Sent: Friday, September 22, 2017 12:50 PM
To: d...@beam.apache.org
Cc: dev@tika.apache.org
Subject: Re: TikaIO concerns

Regarding specifically elements that are failing -- I believe some other IO has 
used the concept of a "Dead Letter" side-output,, where documents that failed 
to process are side-output so the user can handle them appropriately.

On Fri, Sep 22, 2017 at 9:47 AM Eugene Kirpichov  
wrote:

> Hi Tim,
> From what you're saying it sounds like the Tika library has a big 
> problem with crashes and freezes, and when applying it at scale (eg. 
> in the context of Beam) requires explicitly addressing this problem, 
> eg. accepting the fact that in many realistic applications some 
> documents will just need to be skipped because they are unprocessable? 
> This would be first example of a Beam IO that has this concern, so I'd 
> like to confirm that my understanding is correct.
>
> On Fri, Sep 22, 2017 at 9:34 AM Allison, Timothy B. 
> 
> wrote:
>
> > Reuven,
> >
> > Thank you!  This suggests to me that it is a good idea to integrate 
> > Tika with Beam so that people don't have to 1) (re)discover the need 
> > to make their wrappers robust and then 2) have to reinvent these 
> > wheels for robustness.
> >
> > For kicks, see William Palmer's post on his toe-stubbing efforts 
> > with Hadoop [1].  He and other Tika users independently have wound 
> > up carrying out exactly your recommendation for 1) below.
> >
> > We have a MockParser that you can get to simulate regular 
> > exceptions,
> OOMs
> > and permanent hangs by asking Tika to parse a  xml [2].
> >
> > > However if processing the document causes the process to crash, 
> > > then it
> > will be retried.
> > Any ideas on how to get around this?
> >
> > Thank you again.
> >
> > Cheers,
> >
> >Tim
> >
> > [1]
> >
> http://openpreservation.org/blog/2014/03/21/tika-ride-characterising-w
> eb-content-nanite/
> > [2]
> >
> https://github.com/apache/tika/blob/master/tika-parsers/src/test/resou
> rces/test-documents/mock/example.xml
> >
>

RE: TikaIO concerns

Y, I think you have it right.

> Tika library has a big problem with crashes and freezes

I wouldn't want to overstate it.  Crashes and freezes are exceedingly rare, but 
when you are processing millions/billions of files in the wild [1], they will 
happen.  We fix the problems or try to get our dependencies to fix the problems 
when we can, but given our past history, I have no reason to believe that these 
problems won't happen again.

Thank you, again!

Best,

Tim

[1] Stuff on the internet or ... some of our users are forensics examiners 
dealing with broken/corrupted files

P.S./FTR  
1) We've gathered a TB of data from CommonCrawl and we run regression tests 
against this TB (thank you, Rackspace for hosting our vm!) to try to identify 
these problems. 
2) We've started a fuzzing effort to try to identify problems.
3) We added "tika-batch" for robust single box fileshare/fileshare processing 
for our low volume users 
4) We're trying to get the message out.  Thank you for working with us!!!

-Original Message-
From: Eugene Kirpichov [mailto:kirpic...@google.com.INVALID] 
Sent: Friday, September 22, 2017 12:48 PM
To: d...@beam.apache.org
Cc: dev@tika.apache.org
Subject: Re: TikaIO concerns

Hi Tim,
From what you're saying it sounds like the Tika library has a big problem with 
crashes and freezes, and when applying it at scale (eg. in the context of Beam) 
requires explicitly addressing this problem, eg. accepting the fact that in 
many realistic applications some documents will just need to be skipped because 
they are unprocessable? This would be first example of a Beam IO that has this 
concern, so I'd like to confirm that my understanding is correct.

On Fri, Sep 22, 2017 at 9:34 AM Allison, Timothy B. 
wrote:

> Reuven,
>
> Thank you!  This suggests to me that it is a good idea to integrate 
> Tika with Beam so that people don't have to 1) (re)discover the need 
> to make their wrappers robust and then 2) have to reinvent these 
> wheels for robustness.
>
> For kicks, see William Palmer's post on his toe-stubbing efforts with 
> Hadoop [1].  He and other Tika users independently have wound up 
> carrying out exactly your recommendation for 1) below.
>
> We have a MockParser that you can get to simulate regular exceptions, 
> OOMs and permanent hangs by asking Tika to parse a  xml [2].
>
> > However if processing the document causes the process to crash, then 
> > it
> will be retried.
> Any ideas on how to get around this?
>
> Thank you again.
>
> Cheers,
>
>Tim
>
> [1]
> http://openpreservation.org/blog/2014/03/21/tika-ride-characterising-w
> eb-content-nanite/
> [2]
> https://github.com/apache/tika/blob/master/tika-parsers/src/test/resources/test-documents/mock/example.xml
>

Re: TikaIO concerns

2017-09-22 Thread Ben Chambers

Regarding specifically elements that are failing -- I believe some other IO
has used the concept of a "Dead Letter" side-output,, where documents that
failed to process are side-output so the user can handle them appropriately.

On Fri, Sep 22, 2017 at 9:47 AM Eugene Kirpichov
 wrote:

> Hi Tim,
> From what you're saying it sounds like the Tika library has a big problem
> with crashes and freezes, and when applying it at scale (eg. in the context
> of Beam) requires explicitly addressing this problem, eg. accepting the
> fact that in many realistic applications some documents will just need to
> be skipped because they are unprocessable? This would be first example of a
> Beam IO that has this concern, so I'd like to confirm that my understanding
> is correct.
>
> On Fri, Sep 22, 2017 at 9:34 AM Allison, Timothy B. 
> wrote:
>
> > Reuven,
> >
> > Thank you!  This suggests to me that it is a good idea to integrate Tika
> > with Beam so that people don't have to 1) (re)discover the need to make
> > their wrappers robust and then 2) have to reinvent these wheels for
> > robustness.
> >
> > For kicks, see William Palmer's post on his toe-stubbing efforts with
> > Hadoop [1].  He and other Tika users independently have wound up carrying
> > out exactly your recommendation for 1) below.
> >
> > We have a MockParser that you can get to simulate regular exceptions,
> OOMs
> > and permanent hangs by asking Tika to parse a  xml [2].
> >
> > > However if processing the document causes the process to crash, then it
> > will be retried.
> > Any ideas on how to get around this?
> >
> > Thank you again.
> >
> > Cheers,
> >
> >Tim
> >
> > [1]
> >
> http://openpreservation.org/blog/2014/03/21/tika-ride-characterising-web-content-nanite/
> > [2]
> >
> https://github.com/apache/tika/blob/master/tika-parsers/src/test/resources/test-documents/mock/example.xml
> >
>

RE: TikaIO concerns

Reuven,

Thank you!  This suggests to me that it is a good idea to integrate Tika with 
Beam so that people don't have to 1) (re)discover the need to make their 
wrappers robust and then 2) have to reinvent these wheels for robustness.  

For kicks, see William Palmer's post on his toe-stubbing efforts with Hadoop 
[1].  He and other Tika users independently have wound up carrying out exactly 
your recommendation for 1) below. 

We have a MockParser that you can get to simulate regular exceptions, OOMs and 
permanent hangs by asking Tika to parse a  xml [2]. 

> However if processing the document causes the process to crash, then it will 
> be retried.
Any ideas on how to get around this?

Thank you again.

Cheers,

   Tim

[1] 
http://openpreservation.org/blog/2014/03/21/tika-ride-characterising-web-content-nanite/
[2] 
https://github.com/apache/tika/blob/master/tika-parsers/src/test/resources/test-documents/mock/example.xml

RE: TikaIO concerns

>> How will it work now, with new Metadata() passed to the AutoDetect parser, 
>> will this Metadata have a Metadata value per every attachment, possibly 
>> keyed by a name ?

An example of how to call the RecursiveParserWrapper:

https://github.com/apache/tika/blob/master/tika-example/src/main/java/org/apache/tika/example/ParsingExample.java#L138

To serialize the List, use:

https://github.com/apache/tika/blob/master/tika-serialization/src/main/java/org/apache/tika/metadata/serialization/JsonMetadataList.java#L47

Re: TikaIO concerns


Hi Tim

Sorry for getting into the RecursiveParserWrapper discussion first, I 
was certain the time zone difference was on my side :-)


How will it work now, with new Metadata() passed to the AutoDetect 
parser, will this Metadata have a Metadata value per every attachment, 
possibly keyed by a name ?


Thanks, Sergey
On 22/09/17 12:58, Allison, Timothy B. wrote:

@Timothy: can you tell more about this RecursiveParserWrapper? Is this 
something that the user can configure by specifying the Parser on TikaIO if 
they so wish?

Not at the moment, we’d have to do some coding on our end or within Beam.  The 
format is a list of maps/dicts for each file.  Each map contains all of the 
metadata, with one key reserved for the content.  If a file is a file with no 
attachments, the list has length 1; otherwise there’s a map for each embedded 
file.  Unlike our legacy xhtml, this format maintains metadata for attachments.

The downside to this extract format is that it requires a full parse of the 
document and all data to be held in-memory before writing it.  On the other 
hand, while Tika tries to be streaming, and that was one of the critical early 
design goals, for some file formats, we simply have to parse the whole thing 
before we can have any output.

So, y, large files are a problem. :\

Example with purely made-up keys representing a pdf file containing an RTF 
attachment
[
{
Name : “container file”,
Author: “Chris Mattmann”,
Content: “Four score and seven years ago…”,
Content-type: “application/pdf”
   …
},
{
   Name : “embedded file1”
   Author: “Nick Burch”,
   Content: “When in the course of human events…”,
   Content-type: “application/rtf”
}
]

From: Eugene Kirpichov [mailto:kirpic...@google.com]
Sent: Thursday, September 21, 2017 7:42 PM
To: Allison, Timothy B. ; d...@beam.apache.org
Cc: dev@tika.apache.org
Subject: Re: TikaIO concerns

Hi,
@Sergey:
- I already marked TikaIO @Experimental, so we can make changes.
- Yes, the String in KV is the filename. I guess we could 
alternatively put it into ParseResult - don't have a strong opinion.

@Chris: unorderedness of Metadata would have helped if we extracted each 
Metadata item into a separate PCollection element, but that's not what we want 
to do (we want to have an element per document instead).

@Timothy: can you tell more about this RecursiveParserWrapper? Is this 
something that the user can configure by specifying the Parser on TikaIO if 
they so wish?

On Thu, Sep 21, 2017 at 2:23 PM Allison, Timothy B. 
> wrote:
Like Sergey, it’ll take me some time to understand your recommendations.  Thank 
you!

On one small point:

return a PCollection>, where ParseResult is a 
class with properties { String content, Metadata metadata }


For this option, I’d strongly encourage using the Json output from the 
RecursiveParserWrapper that contains metadata and content, and captures 
metadata even from embedded documents.


However, since TikaIO can be applied to very large files, this could produce 
very large elements, which is a bad idea

Large documents are a problem, no doubt about it…

From: Eugene Kirpichov 
[mailto:kirpic...@google.com]
Sent: Thursday, September 21, 2017 4:41 PM
To: Allison, Timothy B. >; 
d...@beam.apache.org
Cc: dev@tika.apache.org
Subject: Re: TikaIO concerns

Thanks all for the discussion. It seems we have consensus that both 
within-document order and association with the original filename are necessary, 
but currently absent from TikaIO.

Association with original file:
Sergey - Beam does not automatically provide a way to associate an element with 
the file it originated from: automatically tracking data provenance is a known 
very hard research problem on which many papers have been written, and obvious 
solutions are very easy to break. See related discussion at 
https://lists.apache.org/thread.html/32aab699db3901d9f0191ac7dbc0091b31cb8be85eee6349deaee671@%3Cuser.beam.apache.org%3E
 .

If you want the elements of your PCollection to contain additional information, 
you need the elements themselves to contain this information: the elements are 
self-contained and have no metadata associated with them (beyond the timestamp 
and windows, universal to the whole Beam model).

Order within a file:
The only way to have any kind of order within a PCollection is to have the elements of the 
PCollection contain something ordered, e.g. have a 
PCollection, where each List is for one file [I'm assuming 
Tika, at a low level, works on a per-file basis?]. However, since TikaIO can be applied to 
very large files, this could produce very large elements, which is a bad idea. Because of 
this, I don't think the result of applying Tika to a single file can be encoded as a

RE: TikaIO concerns

@Eugene: What's the best way to have Beam help us with these issues, or do 
these come for free with the Beam framework? 

1) a process-level timeout (because you can't actually kill a thread in Java)
2) a process-level restart on OOM
3) avoid trying to reprocess a badly behaving document

Re: TikaIO concerns

Hi,
On 22/09/17 00:42, Eugene Kirpichov wrote:

Hi,
@Sergey:
- I already marked TikaIO @Experimental, so we can make changes.

OK, thanks

- Yes, the String in KV is the filename. I guess we
could alternatively put it into ParseResult - don't have a strong opinion.

Sure. If you don't mind then the 1st thing I'd like to try hopefully
early next week is to introduce ParseResult first into the existing code.
I know it won't 'fix' the issues related to the ordering, but starting
with a complete re-write would be a steep curve for me, so I'd try to
experiment first with the idea (which I like very much) of wrapping
several related pieces (content fragment, metadata, and the doc id/file
name) into ParseResult.

By the way, reporting Tika file (output) metadata with every ParseResult
instance will work much better, I thought first it won't because Tika
does not callback when it populates the file metadata; it only does it
for the actual content, but it will update the Metadata instance passed
to it while it keeps parsing and finding the new metadata, so the
metadata pieces will be available to the pipeline as soon as they may
become available. Though Tika (1.17 ?) may need to ensure its Metadata
is backed up by the concurrent map for this approach to work, not sure
yet...

@Chris: unorderedness of Metadata would have helped if we extracted each
Metadata item into a separate PCollection element, but that's not what we
want to do (we want to have an element per document instead).

@Timothy: can you tell more about this RecursiveParserWrapper? Is this
something that the user can configure by specifying the Parser on TikaIO if
they so wish?

As a general note, Metadata passed to the top-level parser acts as a
file (and embedded attachments) metadata sink but also as a 'helper' to
the parser, right now TikaIO uses it to pass a media type hint if
available (to help the auto-detect parser select the correct parser
faster), and also a parser which will be used to parse the embedded
attachments (I did it after Tim hinted about it earlier on...).

Not sure if RecusriveParserWrapper can act as a top-level parser or
needs to be passed as a metadata property to AutoDetectParser, Tim will
know :-)

Thanks, Sergey

On Thu, Sep 21, 2017 at 2:23 PM Allison, Timothy B.
wrote:

Like Sergey, it’ll take me some time to understand your recommendations.
Thank you!

On one small point:

return a PCollection>, where ParseResult

is a class with properties { String content, Metadata metadata }

For this option, I’d strongly encourage using the Json output from the
RecursiveParserWrapper that contains metadata and content, and captures
metadata even from embedded documents.

However, since TikaIO can be applied to very large files, this could

produce very large elements, which is a bad idea

Large documents are a problem, no doubt about it…

*From:* Eugene Kirpichov [mailto:kirpic...@google.com]
*Sent:* Thursday, September 21, 2017 4:41 PM
*To:* Allison, Timothy B. ; d...@beam.apache.org
*Cc:* dev@tika.apache.org
*Subject:* Re: TikaIO concerns

Thanks all for the discussion. It seems we have consensus that both
within-document order and association with the original filename are
necessary, but currently absent from TikaIO.

*Association with original file:*

Sergey - Beam does not *automatically* provide a way to associate an
element with the file it originated from: automatically tracking data
provenance is a known very hard research problem on which many papers have
been written, and obvious solutions are very easy to break. See related
discussion at
https://lists.apache.org/thread.html/32aab699db3901d9f0191ac7dbc0091b31cb8be85eee6349deaee671@%3Cuser.beam.apache.org%3E
.

If you want the elements of your PCollection to contain additional
information, you need the elements themselves to contain this information:
the elements are self-contained and have no metadata associated with them
(beyond the timestamp and windows, universal to the whole Beam model).

*Order within a file:*

The only way to have any kind of order within a PCollection is to have the
elements of the PCollection contain something ordered, e.g. have a
PCollection, where each List is for one file [I'm assuming
Tika, at a low level, works on a per-file basis?]. However, since TikaIO
can be applied to very large files, this could produce very large elements,
which is a bad idea. Because of this, I don't think the result of applying
Tika to a single file can be encoded as a PCollection element.

Given both of these, I think that it's not possible to create a
*general-purpose* TikaIO transform that will be better than manual
invocation of Tika as a DoFn on the result of FileIO.readMatches().

However, looking at the examples at
https://tika.apache.org/1.16/examples.html - almost all of the examples
involve extracting a

RE: TikaIO concerns