[jira] [Commented] (TIKA-1592) It seems dbus and x11 server are invoked, and fails for some reason too
[ https://issues.apache.org/jira/browse/TIKA-1592?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14394153#comment-14394153 ] Michael Couck commented on TIKA-1592: - Gentlemen, I have wasted your time, I removed Tika from the dependencies and I still have the issue. Apologies. Great project! Thank you :) It seems dbus and x11 server are invoked, and fails for some reason too --- Key: TIKA-1592 URL: https://issues.apache.org/jira/browse/TIKA-1592 Project: Tika Issue Type: Bug Affects Versions: 1.7 Environment: CentOs 6.6, Java 1.7 Reporter: Michael Couck Exception running unit tests: GConf Error: Failed to contact configuration server; some possible causes are that you need to enable TCP/IP networking for ORBit, or you have stale NFS locks due to a system crash. See http://projects.gnome.org/gconf/ for information. (Details - 1: Not running within active session) Is Tika trying to start an x11 server using dbus? Why? This breaks the unit tests, the logging is a gig for each run, and even a 64 core server is 100% cpu during the failure. I am completely confounded. Any ideas? -- This message was sent by Atlassian JIRA (v6.3.4#6332)
RE: Any interest in running Apache Tika as part of CommonCrawl?
Sorry, link wasn’t included: https://groups.google.com/forum/#!topic/common-crawl/Cv21VRQjGN0 From: tallison314...@gmail.com [mailto:tallison314...@gmail.com] Sent: Friday, April 03, 2015 8:35 AM To: d...@pdfbox.apache.org; dev@tika.apache.org; d...@poi.apache.org Subject: Fwd: Any interest in running Apache Tika as part of CommonCrawl? All, What do we think? On Friday, April 3, 2015 at 8:23:11 AM UTC-4, talliso...@gmail.commailto:talliso...@gmail.com wrote: CommonCrawl currently has the WET format that extracts plain text from web pages. My guess is that this is text stripping from text-y formats. Let me know if I'm wrong! Would there be any interest in adding another format: WETT (WET-Tika) or supplementing the current WET by using Tika to extract contents from binary formats too: PDF, MSWord, etc. Julien Nioche kindly carved out 220 GB for us to experiment with on TIKA-1302https://issues.apache.org/jira/browse/TIKA-1302 on a Rackspace vm. But, I'm wondering now if it would make more sense to have CommonCrawl run Tika as part of its regular process and make the output available in one of your standard formats. CommonCrawl consumers would get Tika output, and the Tika dev community (including its dependencies, PDFBox, POI, etc.) could get the stacktraces to help prioritize bug fixes. Cheers, Tim
[jira] [Commented] (TIKA-1592) It seems dbus and x11 server are invoked, and fails for some reason too
[ https://issues.apache.org/jira/browse/TIKA-1592?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14394356#comment-14394356 ] Michael Couck commented on TIKA-1592: - Just for completeness, I upgraded/updated the os, perhaps a confounding event? As it turns out, small change in the dbus/display/gconf/x11 combination, and this is required, put in the /etc/profile: eval $(dbus-launch --sh-syntax) export DBUS_SESSION_BUS_ADDRESS export DBUS_SESSION_BUS_PID A little cryptic perhaps? Well there you have it, several days to get to that, hope no one else falls into the same trap. Cheers, Michael It seems dbus and x11 server are invoked, and fails for some reason too --- Key: TIKA-1592 URL: https://issues.apache.org/jira/browse/TIKA-1592 Project: Tika Issue Type: Bug Affects Versions: 1.7 Environment: CentOs 6.6, Java 1.7 Reporter: Michael Couck Exception running unit tests: GConf Error: Failed to contact configuration server; some possible causes are that you need to enable TCP/IP networking for ORBit, or you have stale NFS locks due to a system crash. See http://projects.gnome.org/gconf/ for information. (Details - 1: Not running within active session) Is Tika trying to start an x11 server using dbus? Why? This breaks the unit tests, the logging is a gig for each run, and even a 64 core server is 100% cpu during the failure. I am completely confounded. Any ideas? -- This message was sent by Atlassian JIRA (v6.3.4#6332)
Re: FW: Any interest in running Apache Tika as part of CommonCrawl?
Tim, seems interesting, because it provides big test dataset. As I see, they store pdfs/docs in WARC files, so there's source data for parsing. -- Best regards, Konstantin Gribov пт, 3 апр. 2015 г. в 17:29, Allison, Timothy B. talli...@mitre.org: All, What do you think? https://groups.google.com/forum/#!topic/common-crawl/Cv21VRQjGN0 On Friday, April 3, 2015 at 8:23:11 AM UTC-4, talliso...@gmail.commailto: talliso...@gmail.com wrote: CommonCrawl currently has the WET format that extracts plain text from web pages. My guess is that this is text stripping from text-y formats. Let me know if I'm wrong! Would there be any interest in adding another format: WETT (WET-Tika) or supplementing the current WET by using Tika to extract contents from binary formats too: PDF, MSWord, etc. Julien Nioche kindly carved out 220 GB for us to experiment with on TIKA-1302https://issues.apache.org/jira/browse/TIKA-1302 on a Rackspace vm. But, I'm wondering now if it would make more sense to have CommonCrawl run Tika as part of its regular process and make the output available in one of your standard formats. CommonCrawl consumers would get Tika output, and the Tika dev community (including its dependencies, PDFBox, POI, etc.) could get the stacktraces to help prioritize bug fixes. Cheers, Tim - To unsubscribe, e-mail: dev-unsubscr...@poi.apache.org For additional commands, e-mail: dev-h...@poi.apache.org
[jira] [Commented] (TIKA-1593) Doco: Broken link to Parser Quick Start Guide
[ https://issues.apache.org/jira/browse/TIKA-1593?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14394555#comment-14394555 ] Konstantin Gribov commented on TIKA-1593: - Thank you, Dan. Seems, it should be something like https://tika.apache.org/1.7/parser_guide.html Doco: Broken link to Parser Quick Start Guide --- Key: TIKA-1593 URL: https://issues.apache.org/jira/browse/TIKA-1593 Project: Tika Issue Type: Bug Components: documentation Affects Versions: 1.7 Reporter: Dan Rollo Priority: Minor The Tika web page: https://tika.apache.org/contribute.html, under the Section: New Parsers, Detectors and Mime Types, there is a link with the text: Parser Quick Start Guide. The link URL is: https://tika.apache.org/parser_guide.apt, and does not work. The .apt extension seems odd. I don't know what the link should be. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Created] (TIKA-1593) Doco: Broken link to Parser Quick Start Guide
Dan Rollo created TIKA-1593: --- Summary: Doco: Broken link to Parser Quick Start Guide Key: TIKA-1593 URL: https://issues.apache.org/jira/browse/TIKA-1593 Project: Tika Issue Type: Bug Components: documentation Affects Versions: 1.7 Reporter: Dan Rollo Priority: Minor The Tika web page: https://tika.apache.org/contribute.html, under the Section: New Parsers, Detectors and Mime Types, there is a link with the text: Parser Quick Start Guide. The link URL is: https://tika.apache.org/parser_guide.apt, and does not work. The .apt extension seems odd. I don't know what the link should be. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
Re: FW: Any interest in running Apache Tika as part of CommonCrawl?
I Tim, Having looked at CC, a couple of ideas crossed the mind. I think it's cool. +1. BR, Oleg On 3 Apr 2015 17:29, Allison, Timothy B. talli...@mitre.org wrote: All, What do you think? https://groups.google.com/forum/#!topic/common-crawl/Cv21VRQjGN0 On Friday, April 3, 2015 at 8:23:11 AM UTC-4, talliso...@gmail.commailto: talliso...@gmail.com wrote: CommonCrawl currently has the WET format that extracts plain text from web pages. My guess is that this is text stripping from text-y formats. Let me know if I'm wrong! Would there be any interest in adding another format: WETT (WET-Tika) or supplementing the current WET by using Tika to extract contents from binary formats too: PDF, MSWord, etc. Julien Nioche kindly carved out 220 GB for us to experiment with on TIKA-1302https://issues.apache.org/jira/browse/TIKA-1302 on a Rackspace vm. But, I'm wondering now if it would make more sense to have CommonCrawl run Tika as part of its regular process and make the output available in one of your standard formats. CommonCrawl consumers would get Tika output, and the Tika dev community (including its dependencies, PDFBox, POI, etc.) could get the stacktraces to help prioritize bug fixes. Cheers, Tim
[jira] [Closed] (TIKA-1592) It seems dbus and x11 server are invoked, and fails for some reason too
[ https://issues.apache.org/jira/browse/TIKA-1592?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Tyler Palsulich closed TIKA-1592. - Resolution: Invalid Closing as Invalid. Feel free to create additional issues if you run into other problems with Tika! Thank you for updating with the solution! I'm glad you found it. :) (I'm also glad this wasn't a Tika issue... Ha.) It seems dbus and x11 server are invoked, and fails for some reason too --- Key: TIKA-1592 URL: https://issues.apache.org/jira/browse/TIKA-1592 Project: Tika Issue Type: Bug Affects Versions: 1.7 Environment: CentOs 6.6, Java 1.7 Reporter: Michael Couck Exception running unit tests: GConf Error: Failed to contact configuration server; some possible causes are that you need to enable TCP/IP networking for ORBit, or you have stale NFS locks due to a system crash. See http://projects.gnome.org/gconf/ for information. (Details - 1: Not running within active session) Is Tika trying to start an x11 server using dbus? Why? This breaks the unit tests, the logging is a gig for each run, and even a 64 core server is 100% cpu during the failure. I am completely confounded. Any ideas? -- This message was sent by Atlassian JIRA (v6.3.4#6332)
Re: FW: Any interest in running Apache Tika as part of CommonCrawl?
Hi, I am very interested as I am following the Common Crawl activity for some time already. It sounds like a neat idea to do the check already when the crawl is done, are the binary documents already part of the crawl-data? Actually I am currently playing around with the Common Crawl URL Index (http://blog.commoncrawl.org/2013/01/common-crawl-url-index/) which is a much smaller sized download (230G) and only contains URLs without all the additional information. The index is a bit outdated and currently only covers half of the full common crawl, however there are people working on refreshing it for the latest crawls. I wrote a small app which extracts interesting URLs out of these (aka files that POI should be able to open), resulting in aprox. 6.6 million links! Based on some tests for the full download there would be around 3.3 million documents requiring approximately 3TB of storage. Note that this is still an old crawl with only half of the data included, so a current crawl will be considerably bigger! Running them through the integration testing that we added in POI (which performs text and property extraction but also some other POI-related actions) already showed a few cases where slightly off-spec documents can cause bugs to appear, some initial related commits will follow shortly... Dominik. On Fri, Apr 3, 2015 at 4:28 PM, Allison, Timothy B. talli...@mitre.org wrote: All, What do you think? https://groups.google.com/forum/#!topic/common-crawl/Cv21VRQjGN0 On Friday, April 3, 2015 at 8:23:11 AM UTC-4, talliso...@gmail.commailto:talliso...@gmail.com wrote: CommonCrawl currently has the WET format that extracts plain text from web pages. My guess is that this is text stripping from text-y formats. Let me know if I'm wrong! Would there be any interest in adding another format: WETT (WET-Tika) or supplementing the current WET by using Tika to extract contents from binary formats too: PDF, MSWord, etc. Julien Nioche kindly carved out 220 GB for us to experiment with on TIKA-1302https://issues.apache.org/jira/browse/TIKA-1302 on a Rackspace vm. But, I'm wondering now if it would make more sense to have CommonCrawl run Tika as part of its regular process and make the output available in one of your standard formats. CommonCrawl consumers would get Tika output, and the Tika dev community (including its dependencies, PDFBox, POI, etc.) could get the stacktraces to help prioritize bug fixes. Cheers, Tim - To unsubscribe, e-mail: dev-unsubscr...@poi.apache.org For additional commands, e-mail: dev-h...@poi.apache.org
Re: FW: Any interest in running Apache Tika as part of CommonCrawl?
Hi, similar to Dominiks approach of checking the file base for parsing errors, I'd like to scan for certain file constellations, for the typically left over bytes error or other record combinations which I can't reproduce with my MS/Libre office versions. I haven't thought about how it's actually done, but I think logging the location in the integration tests and later manually checking the corresponding files should be sufficient. Best wishes, Andi On 03.04.2015 17:51, Dominik Stadler wrote: Hi, I am very interested as I am following the Common Crawl activity for some time already. It sounds like a neat idea to do the check already when the crawl is done, are the binary documents already part of the crawl-data? ... Dominik. On Fri, Apr 3, 2015 at 4:28 PM, Allison, Timothy B. talli...@mitre.org wrote: All, What do you think?
Fwd: Any interest in running Apache Tika as part of CommonCrawl?
All, What do we think? On Friday, April 3, 2015 at 8:23:11 AM UTC-4, talliso...@gmail.com wrote: CommonCrawl currently has the WET format that extracts plain text from web pages. My guess is that this is text stripping from text-y formats. Let me know if I'm wrong! Would there be any interest in adding another format: WETT (WET-Tika) or supplementing the current WET by using Tika to extract contents from binary formats too: PDF, MSWord, etc. Julien Nioche kindly carved out 220 GB for us to experiment with on TIKA-1302 https://issues.apache.org/jira/browse/TIKA-1302 on a Rackspace vm. But, I'm wondering now if it would make more sense to have CommonCrawl run Tika as part of its regular process and make the output available in one of your standard formats. CommonCrawl consumers would get Tika output, and the Tika dev community (including its dependencies, PDFBox, POI, etc.) could get the stacktraces to help prioritize bug fixes. Cheers, Tim
Re: FW: Any interest in running Apache Tika as part of CommonCrawl?
Dominik, I've downloaded one of WARC files (from CC-MAIN-2015-01, https://aws-publicdatasets.s3.amazonaws.com/common-crawl/crawl-data/CC-MAIN-2015-06/segments/1422115855094.38/warc/CC-MAIN-20150124161055-0-ip-10-180-212-252.ec2.internal.warc.gz, 1.2GB) and it contains at least PDFs and DOCs in crawled data. -- Best regards, Konstantin Gribov пт, 3 апр. 2015 г. в 18:52, Dominik Stadler dominik.stad...@gmx.at: Hi, I am very interested as I am following the Common Crawl activity for some time already. It sounds like a neat idea to do the check already when the crawl is done, are the binary documents already part of the crawl-data? Actually I am currently playing around with the Common Crawl URL Index (http://blog.commoncrawl.org/2013/01/common-crawl-url-index/) which is a much smaller sized download (230G) and only contains URLs without all the additional information. The index is a bit outdated and currently only covers half of the full common crawl, however there are people working on refreshing it for the latest crawls. I wrote a small app which extracts interesting URLs out of these (aka files that POI should be able to open), resulting in aprox. 6.6 million links! Based on some tests for the full download there would be around 3.3 million documents requiring approximately 3TB of storage. Note that this is still an old crawl with only half of the data included, so a current crawl will be considerably bigger! Running them through the integration testing that we added in POI (which performs text and property extraction but also some other POI-related actions) already showed a few cases where slightly off-spec documents can cause bugs to appear, some initial related commits will follow shortly... Dominik. On Fri, Apr 3, 2015 at 4:28 PM, Allison, Timothy B. talli...@mitre.org wrote: All, What do you think? https://groups.google.com/forum/#!topic/common-crawl/Cv21VRQjGN0 On Friday, April 3, 2015 at 8:23:11 AM UTC-4, talliso...@gmail.com mailto:talliso...@gmail.com wrote: CommonCrawl currently has the WET format that extracts plain text from web pages. My guess is that this is text stripping from text-y formats. Let me know if I'm wrong! Would there be any interest in adding another format: WETT (WET-Tika) or supplementing the current WET by using Tika to extract contents from binary formats too: PDF, MSWord, etc. Julien Nioche kindly carved out 220 GB for us to experiment with on TIKA-1302https://issues.apache.org/jira/browse/TIKA-1302 on a Rackspace vm. But, I'm wondering now if it would make more sense to have CommonCrawl run Tika as part of its regular process and make the output available in one of your standard formats. CommonCrawl consumers would get Tika output, and the Tika dev community (including its dependencies, PDFBox, POI, etc.) could get the stacktraces to help prioritize bug fixes. Cheers, Tim - To unsubscribe, e-mail: dev-unsubscr...@poi.apache.org For additional commands, e-mail: dev-h...@poi.apache.org - To unsubscribe, e-mail: dev-unsubscr...@poi.apache.org For additional commands, e-mail: dev-h...@poi.apache.org
Re: Any interest in running Apache Tika as part of CommonCrawl?
+1 this makes immense sense to me. Thanks Juls and Tim. ++ Chris Mattmann, Ph.D. Chief Architect Instrument Software and Science Data Systems Section (398) NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA Office: 168-519, Mailstop: 168-527 Email: chris.a.mattm...@nasa.gov WWW: http://sunset.usc.edu/~mattmann/ ++ Adjunct Associate Professor, Computer Science Department University of Southern California, Los Angeles, CA 90089 USA ++ -Original Message- From: tallison314...@gmail.com tallison314...@gmail.com Reply-To: dev@tika.apache.org dev@tika.apache.org Date: Friday, April 3, 2015 at 5:35 AM To: d...@pdfbox.apache.org d...@pdfbox.apache.org, dev@tika.apache.org dev@tika.apache.org, d...@poi.apache.org d...@poi.apache.org Subject: Fwd: Any interest in running Apache Tika as part of CommonCrawl? All, What do we think? On Friday, April 3, 2015 at 8:23:11 AM UTC-4, talliso...@gmail.com wrote: CommonCrawl currently has the WET format that extracts plain text from web pages. My guess is that this is text stripping from text-y formats. Let me know if I'm wrong! Would there be any interest in adding another format: WETT (WET-Tika) or supplementing the current WET by using Tika to extract contents from binary formats too: PDF, MSWord, etc. Julien Nioche kindly carved out 220 GB for us to experiment with on TIKA-1302 https://issues.apache.org/jira/browse/TIKA-1302 on a Rackspace vm. But, I'm wondering now if it would make more sense to have CommonCrawl run Tika as part of its regular process and make the output available in one of your standard formats. CommonCrawl consumers would get Tika output, and the Tika dev community (including its dependencies, PDFBox, POI, etc.) could get the stacktraces to help prioritize bug fixes. Cheers, Tim
FW: Any interest in running Apache Tika as part of CommonCrawl?
All, What do you think? https://groups.google.com/forum/#!topic/common-crawl/Cv21VRQjGN0 On Friday, April 3, 2015 at 8:23:11 AM UTC-4, talliso...@gmail.commailto:talliso...@gmail.com wrote: CommonCrawl currently has the WET format that extracts plain text from web pages. My guess is that this is text stripping from text-y formats. Let me know if I'm wrong! Would there be any interest in adding another format: WETT (WET-Tika) or supplementing the current WET by using Tika to extract contents from binary formats too: PDF, MSWord, etc. Julien Nioche kindly carved out 220 GB for us to experiment with on TIKA-1302https://issues.apache.org/jira/browse/TIKA-1302 on a Rackspace vm. But, I'm wondering now if it would make more sense to have CommonCrawl run Tika as part of its regular process and make the output available in one of your standard formats. CommonCrawl consumers would get Tika output, and the Tika dev community (including its dependencies, PDFBox, POI, etc.) could get the stacktraces to help prioritize bug fixes. Cheers, Tim