[jira] [Commented] (TIKA-1592) It seems dbus and x11 server are invoked, and fails for some reason too

2015-04-03 Thread Michael Couck (JIRA)

[ 
https://issues.apache.org/jira/browse/TIKA-1592?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14394153#comment-14394153
 ] 

Michael Couck commented on TIKA-1592:
-

Gentlemen, 

I have wasted your time, I removed Tika from the dependencies and I still have 
the issue. Apologies.

Great project! Thank you :)

 It seems dbus and x11 server are invoked, and fails for some reason too
 ---

 Key: TIKA-1592
 URL: https://issues.apache.org/jira/browse/TIKA-1592
 Project: Tika
  Issue Type: Bug
Affects Versions: 1.7
 Environment: CentOs 6.6, Java 1.7
Reporter: Michael Couck

 Exception running unit tests:
 GConf Error: Failed to contact configuration server; some possible causes are 
 that you need to enable TCP/IP networking for ORBit, or you have stale NFS 
 locks due to a system crash. See http://projects.gnome.org/gconf/ for 
 information. (Details -  1: Not running within active session)
 Is Tika trying to start an x11 server using dbus? Why? This breaks the unit 
 tests, the logging is a gig for each run, and even a 64 core server is 100% 
 cpu during the failure. I am completely confounded. Any ideas?



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


RE: Any interest in running Apache Tika as part of CommonCrawl?

2015-04-03 Thread Allison, Timothy B.
Sorry, link wasn’t included:

https://groups.google.com/forum/#!topic/common-crawl/Cv21VRQjGN0

From: tallison314...@gmail.com [mailto:tallison314...@gmail.com]
Sent: Friday, April 03, 2015 8:35 AM
To: d...@pdfbox.apache.org; dev@tika.apache.org; d...@poi.apache.org
Subject: Fwd: Any interest in running Apache Tika as part of CommonCrawl?

All,
  What do we think?

On Friday, April 3, 2015 at 8:23:11 AM UTC-4, 
talliso...@gmail.commailto:talliso...@gmail.com wrote:
CommonCrawl currently has the WET format that extracts plain text from web 
pages.  My guess is that this is text stripping from text-y formats.  Let me 
know if I'm wrong!

Would there be any interest in adding another format: WETT (WET-Tika) or 
supplementing the current WET by using Tika to extract contents from binary 
formats too: PDF, MSWord, etc.

Julien Nioche kindly carved out 220 GB for us to experiment with on 
TIKA-1302https://issues.apache.org/jira/browse/TIKA-1302 on a Rackspace vm.  
But, I'm wondering now if it would make more sense to have CommonCrawl run Tika 
as part of its regular process and make the output available in one of your 
standard formats.

CommonCrawl consumers would get Tika output, and the Tika dev community 
(including its dependencies, PDFBox, POI, etc.) could get the stacktraces to 
help prioritize bug fixes.

Cheers,

  Tim


[jira] [Commented] (TIKA-1592) It seems dbus and x11 server are invoked, and fails for some reason too

2015-04-03 Thread Michael Couck (JIRA)

[ 
https://issues.apache.org/jira/browse/TIKA-1592?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14394356#comment-14394356
 ] 

Michael Couck commented on TIKA-1592:
-

Just for completeness, I upgraded/updated the os, perhaps a confounding event? 
As it turns out, small change in the dbus/display/gconf/x11 combination, and 
this is required, put in the /etc/profile:

eval $(dbus-launch --sh-syntax)
export DBUS_SESSION_BUS_ADDRESS
export DBUS_SESSION_BUS_PID

A little cryptic perhaps? Well there you have it, several days to get to that, 
hope no one else falls into the same trap.

Cheers,
Michael

 It seems dbus and x11 server are invoked, and fails for some reason too
 ---

 Key: TIKA-1592
 URL: https://issues.apache.org/jira/browse/TIKA-1592
 Project: Tika
  Issue Type: Bug
Affects Versions: 1.7
 Environment: CentOs 6.6, Java 1.7
Reporter: Michael Couck

 Exception running unit tests:
 GConf Error: Failed to contact configuration server; some possible causes are 
 that you need to enable TCP/IP networking for ORBit, or you have stale NFS 
 locks due to a system crash. See http://projects.gnome.org/gconf/ for 
 information. (Details -  1: Not running within active session)
 Is Tika trying to start an x11 server using dbus? Why? This breaks the unit 
 tests, the logging is a gig for each run, and even a 64 core server is 100% 
 cpu during the failure. I am completely confounded. Any ideas?



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


Re: FW: Any interest in running Apache Tika as part of CommonCrawl?

2015-04-03 Thread Konstantin Gribov
Tim,
seems interesting, because it provides big test dataset.
As I see, they store pdfs/docs in WARC files, so there's source data for
parsing.

-- 
Best regards,
Konstantin Gribov

пт, 3 апр. 2015 г. в 17:29, Allison, Timothy B. talli...@mitre.org:

 All,

 What do you think?


 https://groups.google.com/forum/#!topic/common-crawl/Cv21VRQjGN0


 On Friday, April 3, 2015 at 8:23:11 AM UTC-4, talliso...@gmail.commailto:
 talliso...@gmail.com wrote:
 CommonCrawl currently has the WET format that extracts plain text from web
 pages.  My guess is that this is text stripping from text-y formats.  Let
 me know if I'm wrong!

 Would there be any interest in adding another format: WETT (WET-Tika) or
 supplementing the current WET by using Tika to extract contents from binary
 formats too: PDF, MSWord, etc.

 Julien Nioche kindly carved out 220 GB for us to experiment with on
 TIKA-1302https://issues.apache.org/jira/browse/TIKA-1302 on a Rackspace
 vm.  But, I'm wondering now if it would make more sense to have CommonCrawl
 run Tika as part of its regular process and make the output available in
 one of your standard formats.

 CommonCrawl consumers would get Tika output, and the Tika dev community
 (including its dependencies, PDFBox, POI, etc.) could get the stacktraces
 to help prioritize bug fixes.

 Cheers,

   Tim

 -
 To unsubscribe, e-mail: dev-unsubscr...@poi.apache.org
 For additional commands, e-mail: dev-h...@poi.apache.org




[jira] [Commented] (TIKA-1593) Doco: Broken link to Parser Quick Start Guide

2015-04-03 Thread Konstantin Gribov (JIRA)

[ 
https://issues.apache.org/jira/browse/TIKA-1593?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14394555#comment-14394555
 ] 

Konstantin Gribov commented on TIKA-1593:
-

Thank you, Dan.

Seems, it should be something like https://tika.apache.org/1.7/parser_guide.html

 Doco: Broken link to Parser Quick Start Guide
 ---

 Key: TIKA-1593
 URL: https://issues.apache.org/jira/browse/TIKA-1593
 Project: Tika
  Issue Type: Bug
  Components: documentation
Affects Versions: 1.7
Reporter: Dan Rollo
Priority: Minor

 The Tika web page: https://tika.apache.org/contribute.html, under the 
 Section: New Parsers, Detectors and Mime Types, there is a link with the 
 text: Parser Quick Start Guide. The link URL is: 
 https://tika.apache.org/parser_guide.apt, and does not work. 
 The .apt extension seems odd. I don't know what the link should be.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Created] (TIKA-1593) Doco: Broken link to Parser Quick Start Guide

2015-04-03 Thread Dan Rollo (JIRA)
Dan Rollo created TIKA-1593:
---

 Summary: Doco: Broken link to Parser Quick Start Guide
 Key: TIKA-1593
 URL: https://issues.apache.org/jira/browse/TIKA-1593
 Project: Tika
  Issue Type: Bug
  Components: documentation
Affects Versions: 1.7
Reporter: Dan Rollo
Priority: Minor


The Tika web page: https://tika.apache.org/contribute.html, under the Section: 
New Parsers, Detectors and Mime Types, there is a link with the text: Parser 
Quick Start Guide. The link URL is: https://tika.apache.org/parser_guide.apt, 
and does not work. 

The .apt extension seems odd. I don't know what the link should be.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


Re: FW: Any interest in running Apache Tika as part of CommonCrawl?

2015-04-03 Thread Oleg Tikhonov
I Tim,
Having looked at CC, a couple of ideas crossed the mind. I think it's cool.
+1.

BR,
Oleg
On 3 Apr 2015 17:29, Allison, Timothy B. talli...@mitre.org wrote:

 All,

 What do you think?


 https://groups.google.com/forum/#!topic/common-crawl/Cv21VRQjGN0


 On Friday, April 3, 2015 at 8:23:11 AM UTC-4, talliso...@gmail.commailto:
 talliso...@gmail.com wrote:
 CommonCrawl currently has the WET format that extracts plain text from web
 pages.  My guess is that this is text stripping from text-y formats.  Let
 me know if I'm wrong!

 Would there be any interest in adding another format: WETT (WET-Tika) or
 supplementing the current WET by using Tika to extract contents from binary
 formats too: PDF, MSWord, etc.

 Julien Nioche kindly carved out 220 GB for us to experiment with on
 TIKA-1302https://issues.apache.org/jira/browse/TIKA-1302 on a Rackspace
 vm.  But, I'm wondering now if it would make more sense to have CommonCrawl
 run Tika as part of its regular process and make the output available in
 one of your standard formats.

 CommonCrawl consumers would get Tika output, and the Tika dev community
 (including its dependencies, PDFBox, POI, etc.) could get the stacktraces
 to help prioritize bug fixes.

 Cheers,

   Tim



[jira] [Closed] (TIKA-1592) It seems dbus and x11 server are invoked, and fails for some reason too

2015-04-03 Thread Tyler Palsulich (JIRA)

 [ 
https://issues.apache.org/jira/browse/TIKA-1592?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tyler Palsulich closed TIKA-1592.
-
Resolution: Invalid

Closing as Invalid. Feel free to create additional issues if you run into other 
problems with Tika!

Thank you for updating with the solution! I'm glad you found it. :) (I'm also 
glad this wasn't a Tika issue... Ha.)

 It seems dbus and x11 server are invoked, and fails for some reason too
 ---

 Key: TIKA-1592
 URL: https://issues.apache.org/jira/browse/TIKA-1592
 Project: Tika
  Issue Type: Bug
Affects Versions: 1.7
 Environment: CentOs 6.6, Java 1.7
Reporter: Michael Couck

 Exception running unit tests:
 GConf Error: Failed to contact configuration server; some possible causes are 
 that you need to enable TCP/IP networking for ORBit, or you have stale NFS 
 locks due to a system crash. See http://projects.gnome.org/gconf/ for 
 information. (Details -  1: Not running within active session)
 Is Tika trying to start an x11 server using dbus? Why? This breaks the unit 
 tests, the logging is a gig for each run, and even a 64 core server is 100% 
 cpu during the failure. I am completely confounded. Any ideas?



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


Re: FW: Any interest in running Apache Tika as part of CommonCrawl?

2015-04-03 Thread Dominik Stadler
Hi,

I am very interested as I am following the Common Crawl activity for
some time already. It sounds like a neat idea to do the check already
when the crawl is done, are the binary documents already part of the
crawl-data?

Actually I am currently playing around with the Common Crawl URL Index
(http://blog.commoncrawl.org/2013/01/common-crawl-url-index/) which is
a much smaller sized download (230G) and only contains URLs without
all the additional information.

The index is a bit outdated and currently only covers half of the full
common crawl, however there are people working on refreshing it for
the latest crawls.

I wrote a small app which extracts interesting URLs out of these (aka
files that POI should be able to open), resulting in aprox. 6.6
million links! Based on some tests for the full download there would
be around 3.3 million documents requiring approximately 3TB of
storage. Note that this is still an old crawl with only half of the
data included, so a current crawl will be considerably bigger!

Running them through the integration testing that we added in POI
(which performs text and property extraction but also some other
POI-related actions) already showed a few cases where slightly
off-spec documents can cause bugs to appear, some initial related
commits will follow shortly...

Dominik.

On Fri, Apr 3, 2015 at 4:28 PM, Allison, Timothy B. talli...@mitre.org wrote:
 All,

 What do you think?


 https://groups.google.com/forum/#!topic/common-crawl/Cv21VRQjGN0


 On Friday, April 3, 2015 at 8:23:11 AM UTC-4, 
 talliso...@gmail.commailto:talliso...@gmail.com wrote:
 CommonCrawl currently has the WET format that extracts plain text from web 
 pages.  My guess is that this is text stripping from text-y formats.  Let me 
 know if I'm wrong!

 Would there be any interest in adding another format: WETT (WET-Tika) or 
 supplementing the current WET by using Tika to extract contents from binary 
 formats too: PDF, MSWord, etc.

 Julien Nioche kindly carved out 220 GB for us to experiment with on 
 TIKA-1302https://issues.apache.org/jira/browse/TIKA-1302 on a Rackspace vm. 
  But, I'm wondering now if it would make more sense to have CommonCrawl run 
 Tika as part of its regular process and make the output available in one of 
 your standard formats.

 CommonCrawl consumers would get Tika output, and the Tika dev community 
 (including its dependencies, PDFBox, POI, etc.) could get the stacktraces to 
 help prioritize bug fixes.

 Cheers,

   Tim

 -
 To unsubscribe, e-mail: dev-unsubscr...@poi.apache.org
 For additional commands, e-mail: dev-h...@poi.apache.org



Re: FW: Any interest in running Apache Tika as part of CommonCrawl?

2015-04-03 Thread Andreas Beeker
Hi,

similar to Dominiks approach of checking the file base for parsing errors,
I'd like to scan for certain file constellations, for the typically left over 
bytes error
or other record combinations which I can't reproduce with my MS/Libre office 
versions.

I haven't thought about how it's actually done, but I think logging the 
location in the
integration tests and later manually checking the corresponding files should be
sufficient.

Best wishes,
Andi



On 03.04.2015 17:51, Dominik Stadler wrote:
 Hi,

 I am very interested as I am following the Common Crawl activity for
 some time already. It sounds like a neat idea to do the check already
 when the crawl is done, are the binary documents already part of the
 crawl-data?

 ...

 Dominik.

 On Fri, Apr 3, 2015 at 4:28 PM, Allison, Timothy B. talli...@mitre.org 
 wrote:
 All,

 What do you think?






Fwd: Any interest in running Apache Tika as part of CommonCrawl?

2015-04-03 Thread tallison314159
All,
  What do we think?

On Friday, April 3, 2015 at 8:23:11 AM UTC-4, talliso...@gmail.com wrote:

 CommonCrawl currently has the WET format that extracts plain text from web 
 pages.  My guess is that this is text stripping from text-y formats.  Let 
 me know if I'm wrong!

 Would there be any interest in adding another format: WETT (WET-Tika) or 
 supplementing the current WET by using Tika to extract contents from binary 
 formats too: PDF, MSWord, etc.

 Julien Nioche kindly carved out 220 GB for us to experiment with on 
 TIKA-1302 https://issues.apache.org/jira/browse/TIKA-1302 on 
 a Rackspace vm.  But, I'm wondering now if it would make more sense to have 
 CommonCrawl run Tika as part of its regular process and make the output 
 available in one of your standard formats.  

 CommonCrawl consumers would get Tika output, and the Tika dev community 
 (including its dependencies, PDFBox, POI, etc.) could get the stacktraces 
 to help prioritize bug fixes.

 Cheers,

   Tim 



Re: FW: Any interest in running Apache Tika as part of CommonCrawl?

2015-04-03 Thread Konstantin Gribov
Dominik,
I've downloaded one of WARC files (from CC-MAIN-2015-01,
https://aws-publicdatasets.s3.amazonaws.com/common-crawl/crawl-data/CC-MAIN-2015-06/segments/1422115855094.38/warc/CC-MAIN-20150124161055-0-ip-10-180-212-252.ec2.internal.warc.gz,
1.2GB) and
it contains at least PDFs and DOCs in crawled data.

-- 
Best regards,
Konstantin Gribov

пт, 3 апр. 2015 г. в 18:52, Dominik Stadler dominik.stad...@gmx.at:

Hi,

 I am very interested as I am following the Common Crawl activity for
 some time already. It sounds like a neat idea to do the check already
 when the crawl is done, are the binary documents already part of the
 crawl-data?

 Actually I am currently playing around with the Common Crawl URL Index
 (http://blog.commoncrawl.org/2013/01/common-crawl-url-index/) which is
 a much smaller sized download (230G) and only contains URLs without
 all the additional information.

 The index is a bit outdated and currently only covers half of the full
 common crawl, however there are people working on refreshing it for
 the latest crawls.

 I wrote a small app which extracts interesting URLs out of these (aka
 files that POI should be able to open), resulting in aprox. 6.6
 million links! Based on some tests for the full download there would
 be around 3.3 million documents requiring approximately 3TB of
 storage. Note that this is still an old crawl with only half of the
 data included, so a current crawl will be considerably bigger!

 Running them through the integration testing that we added in POI
 (which performs text and property extraction but also some other
 POI-related actions) already showed a few cases where slightly
 off-spec documents can cause bugs to appear, some initial related
 commits will follow shortly...

 Dominik.

 On Fri, Apr 3, 2015 at 4:28 PM, Allison, Timothy B. talli...@mitre.org
 wrote:
  All,
 
  What do you think?
 
 
  https://groups.google.com/forum/#!topic/common-crawl/Cv21VRQjGN0
 
 
  On Friday, April 3, 2015 at 8:23:11 AM UTC-4, talliso...@gmail.com
 mailto:talliso...@gmail.com wrote:
  CommonCrawl currently has the WET format that extracts plain text from
 web pages.  My guess is that this is text stripping from text-y formats.
 Let me know if I'm wrong!
 
  Would there be any interest in adding another format: WETT (WET-Tika) or
 supplementing the current WET by using Tika to extract contents from binary
 formats too: PDF, MSWord, etc.
 
  Julien Nioche kindly carved out 220 GB for us to experiment with on
 TIKA-1302https://issues.apache.org/jira/browse/TIKA-1302 on a Rackspace
 vm.  But, I'm wondering now if it would make more sense to have CommonCrawl
 run Tika as part of its regular process and make the output available in
 one of your standard formats.
 
  CommonCrawl consumers would get Tika output, and the Tika dev community
 (including its dependencies, PDFBox, POI, etc.) could get the stacktraces
 to help prioritize bug fixes.
 
  Cheers,
 
Tim
 
  -
  To unsubscribe, e-mail: dev-unsubscr...@poi.apache.org
  For additional commands, e-mail: dev-h...@poi.apache.org
 

 -
 To unsubscribe, e-mail: dev-unsubscr...@poi.apache.org
 For additional commands, e-mail: dev-h...@poi.apache.org




Re: Any interest in running Apache Tika as part of CommonCrawl?

2015-04-03 Thread Mattmann, Chris A (3980)
+1 this makes immense sense to me. Thanks Juls and Tim.

++
Chris Mattmann, Ph.D.
Chief Architect
Instrument Software and Science Data Systems Section (398)
NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA
Office: 168-519, Mailstop: 168-527
Email: chris.a.mattm...@nasa.gov
WWW:  http://sunset.usc.edu/~mattmann/
++
Adjunct Associate Professor, Computer Science Department
University of Southern California, Los Angeles, CA 90089 USA
++






-Original Message-
From: tallison314...@gmail.com tallison314...@gmail.com
Reply-To: dev@tika.apache.org dev@tika.apache.org
Date: Friday, April 3, 2015 at 5:35 AM
To: d...@pdfbox.apache.org d...@pdfbox.apache.org, dev@tika.apache.org
dev@tika.apache.org, d...@poi.apache.org d...@poi.apache.org
Subject: Fwd: Any interest in running Apache Tika as part of CommonCrawl?

All,
  What do we think?

On Friday, April 3, 2015 at 8:23:11 AM UTC-4, talliso...@gmail.com wrote:

CommonCrawl currently has the WET format that extracts plain text from
web pages.  My guess is that this is text stripping from text-y formats.
Let me know if I'm wrong!


Would there be any interest in adding another format: WETT (WET-Tika) or
supplementing the current WET by using Tika to extract contents from
binary formats too: PDF, MSWord, etc.


Julien Nioche kindly carved out 220 GB for us to experiment with on
TIKA-1302 https://issues.apache.org/jira/browse/TIKA-1302 on a
Rackspace vm.  But, I'm wondering now if it would make more sense to have
CommonCrawl run Tika as part of its regular process and make the output
available in one of your standard formats.



CommonCrawl consumers would get Tika output, and the Tika dev community
(including its dependencies, PDFBox, POI, etc.) could get the stacktraces
to help prioritize bug fixes.


Cheers,


  Tim 







FW: Any interest in running Apache Tika as part of CommonCrawl?

2015-04-03 Thread Allison, Timothy B.
All,

What do you think?


https://groups.google.com/forum/#!topic/common-crawl/Cv21VRQjGN0


On Friday, April 3, 2015 at 8:23:11 AM UTC-4, 
talliso...@gmail.commailto:talliso...@gmail.com wrote:
CommonCrawl currently has the WET format that extracts plain text from web 
pages.  My guess is that this is text stripping from text-y formats.  Let me 
know if I'm wrong!

Would there be any interest in adding another format: WETT (WET-Tika) or 
supplementing the current WET by using Tika to extract contents from binary 
formats too: PDF, MSWord, etc.

Julien Nioche kindly carved out 220 GB for us to experiment with on 
TIKA-1302https://issues.apache.org/jira/browse/TIKA-1302 on a Rackspace vm.  
But, I'm wondering now if it would make more sense to have CommonCrawl run Tika 
as part of its regular process and make the output available in one of your 
standard formats.

CommonCrawl consumers would get Tika output, and the Tika dev community 
(including its dependencies, PDFBox, POI, etc.) could get the stacktraces to 
help prioritize bug fixes.

Cheers,

  Tim