Re: ApacheCon EU Sevilla

2016-06-29 Thread Mattmann, Chris A (3980)
I’m thinking about it :) Would be great to go. ++ Chris Mattmann, Ph.D. Chief Architect Instrument Software and Science Data Systems Section (398) NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA Office: 168-519, Mailstop:

Re: Nutch uses git?

2016-06-24 Thread Mattmann, Chris A (3980)
of Southern California, Los Angeles, CA 90089 USA WWW: http://irds.usc.edu/ ++ On 6/24/16, 5:56 PM, "Mattmann, Chris A (3980)" <chris.a.mattm...@jpl.nasa.gov> wrote: >Fixed, sorry. > > > > >

Re: Nutch uses git?

2016-06-24 Thread Mattmann, Chris A (3980)
Fixed, sorry. On 6/24/16, 5:53 PM, "Gav" wrote: >Hi All, > > >Obivous to you all that you use Git as your primary scm but for potential new >contributors it may not be. > > >This page :- > >http://nutch.apache.org/version_control.html > > >says you use SVN as your

Re: [VOTE] Release Apache Nutch 1.12

2016-06-16 Thread Mattmann, Chris A (3980)
+1 from me, great job Lewis and team! SIGS pass, CHECKSUMS pass: LMC-053601:apache-nutch-1.12-rc1 mattmann$ $HOME/bin/stage_apache_rc apache-nutch 1.12-bin https://dist.apache.org/repos/dist/dev/nutch/1.12/ % Total% Received % Xferd Average Speed TimeTime Time Current

Re: Proposal to Use Jira Release Notes within CHANGES.txt

2016-05-17 Thread Mattmann, Chris A (3980)
Neat thanks for sending Lewis! FYI I wrote some tools in Python to parse Tika (and Nutch) style Changes.txt, and to generate an APT output template for e.g., web page release notes. FYI here: https://github.com/chrismattmann/apachestuff/blob/master/extract-tika-issues.py Works with JIRA and

Re: [DISCUSS] Release Nutch 1.12

2016-05-10 Thread Mattmann, Chris A (3980)
+1 from me! ++ Chris Mattmann, Ph.D. Chief Architect Instrument Software and Science Data Systems Section (398) NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA Office: 168-519, Mailstop: 168-527 Email: chris.a.mattm...@nasa.gov

Need to update version control page and SVN docs to point to Git

2016-04-28 Thread Mattmann, Chris A (3980)
Thanks to Gav for reminding me Sent from my iPhone

Re: Jenkins build failures after git migration

2016-04-21 Thread Mattmann, Chris A (3980)
o normal. > >Sebastian > >On 04/18/2016 05:56 PM, Mattmann, Chris A (3980) wrote: >> Hey Seb, I’ll also take a look. @Lewis could potentially help here >> too. Lewis any time to scope? >> >> >>

Re: Jenkins build failures after git migration

2016-04-18 Thread Mattmann, Chris A (3980)
Hey Seb, I’ll also take a look. @Lewis could potentially help here too. Lewis any time to scope? ++ Chris Mattmann, Ph.D. Chief Architect Instrument Software and Science Data Systems Section (398) NASA Jet Propulsion Laboratory

Re: Maven Central Plugins

2016-04-10 Thread Mattmann, Chris A (3980)
I am +1 and also +1 to branch and start to just build Maven3 support full out in Nutch. ++ Chris Mattmann, Ph.D. Chief Architect Instrument Software and Science Data Systems Section (398) NASA Jet Propulsion Laboratory Pasadena, CA

FW: Apache Tika used to parse the Panama papers!

2016-04-05 Thread Mattmann, Chris A (3980)
FYI: On 4/5/16, 6:46 PM, "Mattmann, Chris A (3980)" <chris.a.mattm...@jpl.nasa.gov> wrote: >FYI: >http://www.forbes.com/sites/thomasbrewster/2016/04/05/panama-papers-amazon-encryption-epic-leak/?utm_campaign=ForbesTech_source=TWITTER_medium=social_channel=Technology

Re: Recent stackoverflow questions

2016-04-05 Thread Mattmann, Chris A (3980)
yeah well we are going to have to accept that some of these will appear on SO, but that we will try as hard as possible to suggest they contact the dev list as you mentioned :) Thanks for commenting Markus. ++ Chris Mattmann, Ph.D.

Re: 1.11 branch/tag

2016-03-19 Thread Mattmann, Chris A (3980)
try: release-1.11-rc2 :) ++ Chris Mattmann, Ph.D. Chief Architect Instrument Software and Science Data Systems Section (398) NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA Office: 168-519, Mailstop: 168-527 Email:

[NOTICE] Nutch now using Writeable Git repos at the ASF

2016-02-26 Thread Mattmann, Chris A (3980)
Hi Team, Nutch now officially uses Git to manage its source repos. You can see the final elements to that here: https://issues.apache.org/jira/browse/INFRA-11300 I’ve written a guide for the wiki describing how to migrate your existing SVN checkout to Nutch if you are a user or a developer.

Re: [RESULT] [VOTE] Moving to Git

2016-02-23 Thread Mattmann, Chris A (3980)
thern California, Los Angeles, CA 90089 USA >> ++ >> >> >> >> >> >> -Original Message- >> From: Sebastian Nagel <wastl.na...@googlemail.com> >> Reply-To: "dev@nutc

Re: [RESULT] [VOTE] Moving to Git

2016-02-21 Thread Mattmann, Chris A (3980)
Git >Thanks, Chris! > >On 02/20/2016 08:49 AM, Mattmann, Chris A (3980) wrote: >> Team: >> >> https://issues.apache.org/jira/browse/INFRA-11300 >> >> >> to track the progress.. >> >> Cheers, >> Chris >> >> +++

Re: [RESULT] [VOTE] Moving to Git

2016-02-19 Thread Mattmann, Chris A (3980)
Team: https://issues.apache.org/jira/browse/INFRA-11300 to track the progress.. Cheers, Chris ++ Chris Mattmann, Ph.D. Chief Architect Instrument Software and Science Data Systems Section (398) NASA Jet Propulsion Laboratory

[RESULT] [VOTE] Moving to Git

2016-02-19 Thread Mattmann, Chris A (3980)
Team, This VOTE has PASSED with the following tallies: +1 PMC Chris Mattmann* Sebastien Nagel* Michael Joyce* Asitang Mishra* Dennis Kubes* BlackIce Julien Nioche* Sujen Shah* Given that I’ll file a ticket with INFRA to move the repos over. thanks! Cheers, Chris

Re: [VOTE] Release Apache Nutch 2.3.1rc2

2016-01-20 Thread Mattmann, Chris A (3980)
My bad I said I would do this! Here you go it’s done: +1 SIGS, checksums check out: [chipotle:~/tmp/apache-nutch-2.3.1-rc2] mattmann% $HOME/bin/stage_apache_rc apache-nutch 2.3.1-src https://dist.apache.org/repos/dist/dev/nutch/2.3.1rc2/ % Total% Received % Xferd Average Speed Time

Re: [VOTE] Release Apache Nutch 2.3.1rc2

2016-01-13 Thread Mattmann, Chris A (3980)
I will review tonight. ++ Chris Mattmann, Ph.D. Chief Architect Instrument Software and Science Data Systems Section (398) NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA Office: 168-519, Mailstop: 168-527 Email:

Re: [VOTE] Moving to Git

2016-01-10 Thread Mattmann, Chris A (3980)
RE: The note - all Dennis has to do is ask to be back on the PMC and he would be welcomed back in a jiffy, as an Emeritus PMC member. Cheers, Chris ++ Chris Mattmann, Ph.D. Chief Architect Instrument Software and Science Data

Re: [RELEASE] Apache Nutch 1.11

2015-12-08 Thread Mattmann, Chris A (3980)
Booom Sent from my iPhone On Dec 8, 2015, at 3:18 PM, Michael Joyce > wrote: Cheers for pushing this out Lewis. And great job everyone on the hard work!!! -- Jimmy On Tue, Dec 8, 2015 at 1:26 AM, Markus Jelsma

Re: Dropping Nutch 1.11RC#1 Artifacts

2015-12-04 Thread Mattmann, Chris A (3980)
Hey Lewis, 1.11 rC #1 release artifacts dropped from Nexus. You should have perms to remove the release artifacts from dist.apache.org/repos/dist/. I can remove them though? [mattmann-0420740:~/tmp/nutch1.11/release] mattmann% svn rm 1.11 D 1.11 D 1.11/CHANGES-1.11.txt D

Re: [VOTE] Release Apache Nutch 1.11 RC#2

2015-12-04 Thread Mattmann, Chris A (3980)
Hi Lewis, +1 from me. SIGS and CHECKSUMS check out. bash-3.2$ for atype in bin src; do /Users/mattmann/bin/stage_apache_rc apache-nutch 1.11-$atype https://dist.apache.org/repos/dist/dev/nutch/1.11rc2/; done % Total% Received % Xferd Average Speed TimeTime Time Current

Re: [DISCUSS] Release Nutch 1.11?

2015-11-20 Thread Mattmann, Chris A (3980)
Sounds great mate let's get the rc up there Sent from my iPhone On Nov 20, 2015, at 10:32 AM, Lewis John Mcgibbney > wrote: Hi Folks, Title says it all. There is only one pending issue for 1.11.

Re: [DISCUSS] Moving to Git

2015-11-20 Thread Mattmann, Chris A (3980)
;dev@nutch.apache.org" <dev@nutch.apache.org> Subject: Re: [DISCUSS] Moving to Git >+1 from me > >But, please, after 1.11 and 2.3.1 have been finally released. >There is few work to do, and we should keep the releases on focus first. > >Sebastian > >On 11/19/2015 04:39 A

Re: [jira] [Commented] (NUTCH-2166) Add reverse URL format to dump tool

2015-11-12 Thread Mattmann, Chris A (3980)
We’ll run into file length issues - Giuseppe had the same problem, and so did students who used it from USC hence the solution we have now. I think having nested directory structures is probably the best bet, and making it configurable.

Re: Updates to CHANGES.txt on commit

2015-11-11 Thread Mattmann, Chris A (3980)
Mike I honestly prefer just having it as a text file. If you search way back in the logs Doug talked about this long ago, but I generally agree. JIRA would be nice but I just like to keep it up to date in text and in JIRA. Sorry for the dupe work but it pays off.

The Nutch Webapp

2015-11-06 Thread Mattmann, Chris A (3980)
Hey Everyone, So I just tried the Nutch Webapp for 1.11. It’s brittle, but works. I am REALLY happy with it. Great work Fjodor Vershinin and Lewis on making the application! Since it’s in Wicket and I know my way around Wicket I’m going to work in 1.12 and beyond on really improving this and

Re: [jira] [Created] (NUTCH-2155) Create a "crawl completeness" utility

2015-11-01 Thread Mattmann, Chris A (3980)
Hey Aron, it isn’t yet - @MikeJ and @Sujen want to give it a whack? Cheers, Chris ++ Chris Mattmann, Ph.D. Chief Architect Instrument Software and Science Data Systems Section (398) NASA Jet Propulsion Laboratory Pasadena, CA 91109

Re: [VOTE] Apache Nutch 1.11 Release Candidate #1

2015-10-26 Thread Mattmann, Chris A (3980)
1 We usually release tar.gz as well as zip. More importantly we need >to release the sources as well as the binary. We can't even test that it >compiles OK > > >Since you released Tika, why don't we include it before cutting 1.11? > > >Thanks > > >Julien > > > &

[VOTE] Apache Nutch 1.11 Release Candidate #1

2015-10-25 Thread Mattmann, Chris A (3980)
Hi Folks, A first candidate for the Nutch 1.11 release is available at: https://dist.apache.org/repos/dist/dev/nutch/1.11/ The release candidate is a zip archive of the sources in: http://svn.apache.org/repos/asf/nutch/tags/release-1.11-rc1/ The SHA1 checksum of the archive is

Re: [DISCUSS] Release 1.11 RC #1 (70 issues fixed)

2015-10-22 Thread Mattmann, Chris A (3980)
Okey dok. I’m also trying to get 1.11 of Tika pushed too. ++ Chris Mattmann, Ph.D. Chief Architect Instrument Software and Science Data Systems Section (398) NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA Office: 168-519,

[DISCUSS] Release 1.11 RC #1 (70 issues fixed)

2015-10-18 Thread Mattmann, Chris A (3980)
Hey Folks, I’ll cut a 1.11 RC #1 today. We have 70 issues fixed, and I think it would be a great time to release. Going to try for a Tika 1.11 release candidate 1 today too. Cheers, Chris ++ Chris Mattmann, Ph.D. Chief Architect

nutch-python

2015-10-16 Thread Mattmann, Chris A (3980)
Hey Folks, My team at JPL and Continuum Analytics have been building a Python-based interface to Nutch that uses the REST API. It’s pretty much done in its initial version: http://github.com/chrismattmann/nutch-python/ We even have a bin/crawl like functionality, crawl.py, here:

Re: Interactive selenium plugin issue

2015-10-08 Thread Mattmann, Chris A (3980)
You should be using nutch 1.11-trunk for your assignment Sent from my iPhone On Oct 8, 2015, at 1:55 PM, Junpeng Luo > wrote: Hi everyone, I am using nutch 1.10 and try to use the interactive selenium plugin of the following link:

Re: [VOTE] Release Apache Nutch 2.3.1

2015-09-30 Thread Mattmann, Chris A (3980)
I’ll download and VOTE on the release right now Lewis. ++ Chris Mattmann, Ph.D. Chief Architect Instrument Software and Science Data Systems Section (398) NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA Office: 168-519,

Re: [VOTE] Release Apache Nutch 2.3.1

2015-09-30 Thread Mattmann, Chris A (3980)
+1 from me: [chipotle:~/tmp/nutch-2.3.1-rc1] mattmann% $HOME/bin/stage_apache_rc apache-nutch 2.3.1 https://dist.apache.org/repos/dist/dev/nutch/2.3.1 [chipotle:~/tmp/nutch-2.3.1-rc1] mattmann% $HOME/bin/stage_apache_rc apache-nutch 2.3.1-src https://dist.apache.org/repos/dist/dev/nutch/2.3.1 %

Re: CSCI - 572: Team 18 : Questions

2015-09-27 Thread Mattmann, Chris A (3980)
Hi Team 18, This would be a good question and discussion to move to the dev@nutch.apache.org list. So I’m moving it there. Mike Joyce and Kim Whitehall who are working on Nutch and Selenium can help there. Cheers, Chris + Chris Mattmann,

Re: Webcast : Apache Nutch on EMR

2015-09-23 Thread Mattmann, Chris A (3980)
Thanks Julien, great work ++ Chris Mattmann, Ph.D. Chief Architect Instrument Software and Science Data Systems Section (398) NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA Office: 168-519, Mailstop: 168-527 Email:

Re: Questions regarding CS-572 assignment 1

2015-09-20 Thread Mattmann, Chris A (3980)
Hi Charan, Thanks for your questions. Please copy your emails to dev@nutch.apache.org and subscribe there, as you will find more help I believe. Here are the answers: -Original Message- From: Charan Shampur Date: Sunday, September 20, 2015 at 3:55 PM To: jpluser

Re: Job Opening at Common Crawl - Crawl Engineer / Data Scientist

2015-09-18 Thread Mattmann, Chris A (3980)
awesome thanks for sharing! ++ Chris Mattmann, Ph.D. Chief Architect Instrument Software and Science Data Systems Section (398) NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA Office: 168-519, Mailstop: 168-527 Email:

Re: Introducing myself (Aron Ahmadia)

2015-09-14 Thread Mattmann, Chris A (3980)
Woo hoo welcome Aron!!! ++ Chris Mattmann, Ph.D. Chief Architect Instrument Software and Science Data Systems Section (398) NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA Office: 168-519, Mailstop: 168-527 Email:

Re: [ANNOUNCE] New Nutch committer and PMC - Asitang Mishra

2015-09-09 Thread Mattmann, Chris A (3980)
welcome Asitang!! ++ Chris Mattmann, Ph.D. Chief Architect Instrument Software and Science Data Systems Section (398) NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA Office: 168-519, Mailstop: 168-527 Email:

Re: [DISCUSS] Release Nutch trunk 1.11

2015-08-26 Thread Mattmann, Chris A (3980)
Great thanks. I would love to see the web GUI ported from 2.x: NUTCH-2086 Sujen, do you think you can throw up a Pull Request by today? Cheers, Chris ++ Chris Mattmann, Ph.D. Chief Architect Instrument Software and Science Data

Nutch-Python

2015-08-20 Thread Mattmann, Chris A (3980)
Hey Folks, My team at JPL and I have an initial prototype Nutch-Python Python library. We are going to integrate it into Memex Explorer, our crawl UI/tool [1], and we have other plans for it too (building D3 viz and charts, etc.) Thanks to Brian Wilson/JPL, and to Sujen Shah/JPL+USC for their

Re: [jira] [Commented] (NUTCH-2048) parse-tika: fix dependencies in plugin.xml

2015-07-23 Thread Mattmann, Chris A (3980)
+1 Sent from my iPhone On Jul 23, 2015, at 1:48 PM, Sebastian Nagel (JIRA) j...@apache.org wrote: [ https://issues.apache.org/jira/browse/NUTCH-2048?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14639407#comment-14639407 ] Sebastian Nagel

Re: [jira] [Updated] (NUTCH-2042) parse-html increase chunk size used to detect charset

2015-07-23 Thread Mattmann, Chris A (3980)
+1 Sent from my iPhone On Jul 23, 2015, at 1:47 PM, Sebastian Nagel (JIRA) j...@apache.org wrote: [ https://issues.apache.org/jira/browse/NUTCH-2042?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sebastian Nagel updated NUTCH-2042:

Jenkins now publishes Nutch test results (added test-plugins)

2015-07-13 Thread Mattmann, Chris A (3980)
Hey Everyone, Per: https://issues.apache.org/jira/browse/NUTCH-2059 I’ve updated the Jenkins job to correctly record and publish the test results - just added test-plugins as a target as well. See: https://builds.apache.org/job/Nutch-trunk/ Cheers, Chris

Re: Squashing Git Commits

2015-07-03 Thread Mattmann, Chris A (3980)
Agreed! I’ve had to do a lot of this work myself since Mike Joyce challenged me to become a Git master ;) Challenge accepted. But the more contributors can help to squash this stuff the better. Otherwise, my advice is —include and —exclude are your friends :) See #43 for how to use that.

RE: Github Spam

2015-06-24 Thread Mattmann, Chris A (3980)
Hey Lewis, Yeah to be honest, this no different than ReviewBoard, JIRA, etc. At least it's not as bad as Spark :/ I did a review of Asitang's patch and it took each one of my comments and sent a mail. B/c of Apache's requirement that things happen on the list, we have to have the mails replicated

RE: Github Spam

2015-06-24 Thread Mattmann, Chris A (3980)
Sorry I wasn't clear. I'm *not* fine with getting rid of Github. I was simply proposing for the mail spam to be moved to a different list. But, to me JIRA/SVN, is no different than Github comments and pull requests and so forth. To each their own :) The ASF full supports Git and Github integration

Added g...@git.apache.org to dev

2015-06-18 Thread Mattmann, Chris A (3980)
Hey Guys, I got sick of moderating Git messages so I used my apmail karma to add g...@git.apache.org to the lists. List moderation for Git should be going away! :) (yay) Cheers, Chris

RE: [jira] [Created] (NUTCH-2000) Link inversion fails with .locked already exists.

2015-06-17 Thread Mattmann, Chris A (3980)
Please follow the instructions on the website to unsubscribe. From: Sahil Shah [sahilshah2...@gmail.com] Sent: Wednesday, June 17, 2015 6:01 PM To: dev@nutch.apache.org Subject: Re: [jira] [Created] (NUTCH-2000) Link inversion fails with .locked already exists.

NASA/JPL's involvement in Memex: now public

2015-05-27 Thread Mattmann, Chris A (3980)
Hey Folks, Just wanted to share publicly some articles recently on NASA and JPL’s involvement in Memex. It’s basically focused around Tika, Nutch and Solr, so keep up the great work on all projects. A sampling of the recent press/articles: E. Landau. Deep Web Search May Help Scientists. NASA Jet

Re: NASA/JPL's involvement in Memex: now public

2015-05-27 Thread Mattmann, Chris A (3980)
, Karl On Wed, May 27, 2015 at 8:26 AM, Mattmann, Chris A (3980) chris.a.mattm...@jpl.nasa.gov wrote: Hey Folks, Just wanted to share publicly some articles recently on NASA and JPL’s involvement in Memex. It’s basically focused around Tika, Nutch and Solr, so keep up the great work on all

Re: [Nutch Wiki] Trivial Update of ContributorsGroup by LewisJohnMcgibbney

2015-05-18 Thread Mattmann, Chris A (3980)
So please send an email to dev-unsubscr...@nutch.apache.org as it indicates on the website. http://nutch.apache.org/mailing_lists.html This goes for all the rest of the students recently sending the same email - the instructions are above. -Original Message- From: Haishan Ye

Re: Reverse Geocoding with Nutch 1.10

2015-05-06 Thread Mattmann, Chris A (3980)
awesome job Lewis ++ Chris Mattmann, Ph.D. Chief Architect Instrument Software and Science Data Systems Section (398) NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA Office: 168-519, Mailstop: 168-527 Email:

Re: [VOTE] Release Apache Nutch 1.10

2015-05-03 Thread Mattmann, Chris A (3980)
+1 from me! SIGS, CHECKSUMS check out, looks gr8. [chipotle:~/tmp/apache-nutch-1.10-rc1] mattmann% $HOME/bin/stage_apache_rc apache-nutch 1.10-src https://dist.apache.org/repos/dist/dev/nutch/1.10/ % Total% Received % Xferd Average Speed TimeTime Time Current

All issues fixed for 1.10 - Tika 1.8 build issue

2015-04-27 Thread Mattmann, Chris A (3980)
Hey Folks, All the 1.10 issues are resolved and fix. There is still the issue that the upgrade to Tika 1.8 broke the build. I’m still trying to figure it out. Cheers, Chris ++ Chris Mattmann, Ph.D. Chief Architect Instrument

Re: [PROPOSE] Kick off Apache Nutch 1.8 by EoB Friday 04232015

2015-04-23 Thread Mattmann, Chris A (3980)
s/1.8/1.10/ right? If so +1! ++ Chris Mattmann, Ph.D. Chief Architect Instrument Software and Science Data Systems Section (398) NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA Office: 168-519, Mailstop: 168-527 Email:

Re: [jira] [Updated] (NUTCH-1927) Create a whitelist of IPs/hostnames to allow skipping of RobotRules parsing

2015-04-17 Thread Mattmann, Chris A (3980)
+1 please commit! Thanks seb Sent from my iPhone On Apr 17, 2015, at 4:15 PM, Sebastian Nagel (JIRA) j...@apache.org wrote: [ https://issues.apache.org/jira/browse/NUTCH-1927?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sebastian Nagel updated NUTCH-1927:

DARPA Memex

2015-04-17 Thread Mattmann, Chris A (3980)
Hey Everyone, Here’s what we’ve been involved in: http://www.forbes.com/sites/thomasbrewster/2015/04/17/darpa-nasa-and-partne rs-show-off-memex/ :) Nutch, Tika, Solr FTW! Cheers, Chris ++ Chris Mattmann, Ph.D. Chief Architect

Re: HTTP Post Authentication

2015-04-07 Thread Mattmann, Chris A (3980)
Thanks Tizy - adding Tyler to this in case he didn’t see it. Tyler is this what you were running into? Thoughts? ++ Chris Mattmann, Ph.D. Chief Architect Instrument Software and Science Data Systems Section (398) NASA Jet Propulsion

Re: Hello!

2015-04-06 Thread Mattmann, Chris A (3980)
Welcome Nipurn! Looking forward to your awesome contributions this summer! :) ++ Chris Mattmann, Ph.D. Chief Architect Instrument Software and Science Data Systems Section (398) NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA

Re: Warm hello!

2015-04-06 Thread Mattmann, Chris A (3980)
Dear Shivika, I am very excited and fully expect you to rock your contributions to Nutch! You will be awesome thanks! Cheers, Chris P.S. CC’ing Lauren Wong who I also expect will be doing awesome ++ Chris Mattmann, Ph.D. Chief

Re: [DISCUSS] Release Apache Nutch 1.10

2015-03-31 Thread Mattmann, Chris A (3980)
I’m happy to roll the release. It’s been a while! :) I’ll start right away. Cheers, Chris ++ Chris Mattmann, Ph.D. Chief Architect Instrument Software and Science Data Systems Section (398) NASA Jet Propulsion Laboratory Pasadena,

Re: GSOC RDF Microformats Support

2015-03-27 Thread Mattmann, Chris A (3980)
Hi Remzi - thanks! You may want to consider this as a Tika or Any23 project since Nutch delegates its parsing to Tika (and Any23 uses Tika [and vice versa] to handle micro formats). Cheers, Chris ++ Chris Mattmann, Ph.D. Chief

Re: GSOC RDF Microformats Support

2015-03-27 Thread Mattmann, Chris A (3980)
, Thanks for your feedback. I was planning to use any23 and tika but I dont have detailed grasp of both projects. I guess Im gonna need to dive in both. I would appreciate if you could guide me thanks On Fri, Mar 27, 2015 at 4:07 PM, Mattmann, Chris A (3980) chris.a.mattm...@jpl.nasa.gov wrote: Hi

Re: Crawl images and store locally

2015-03-24 Thread Mattmann, Chris A (3980)
Hi Tizy, After you crawl the images, take a look at ./bin/nutch dump to get the images out. ./bin/nutch commoncrawldumper also will dump into the common crawl format. Cheers, Chris ++ Chris Mattmann, Ph.D. Chief Architect

Re: TestGDALParser.testParseBasicInfo and TestGDALParser.testParseMetadata errors

2015-03-22 Thread Mattmann, Chris A (3980)
Agreed Seb, moving dev@nutch.a.o into BCC and moving this to the Tika list. ++ Chris Mattmann, Ph.D. Chief Architect Instrument Software and Science Data Systems Section (398) NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA

Re: [ANNOUNCE] New Nutch committer and PMC - Mo Omer

2015-03-22 Thread Mattmann, Chris A (3980)
Welcome, Mo! ++ Chris Mattmann, Ph.D. Chief Architect Instrument Software and Science Data Systems Section (398) NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA Office: 168-519, Mailstop: 168-527 Email:

FW: Curating Issues

2015-03-01 Thread Mattmann, Chris A (3980)
If anyone wants to take a crack at closing issues based on the following criteria, good thread from the dev@tika list. ++ Chris Mattmann, Ph.D. Chief Architect Instrument Software and Science Data Systems Section (398) NASA Jet

Re: tika to parse url data content

2015-02-25 Thread Mattmann, Chris A (3980)
Hi Nancy, Tika is what put the metadata into the parsed content in the file you are looking at. See the parse-tika plugin. You don’t need to use Tika further that the information that is in your crawled data. Cheers, Chris ++

Re: questions about the webui packages

2015-02-24 Thread Mattmann, Chris A (3980)
Yep, Seb, that’s right. I have a student (Sujeh Shah) at USC working on Nutch REST 1.x API, with the goal of eventually making D3 visualizations of crawl graphs and seeing what’s going on in a crawl while it’s happening! :) We are working on Wiki pages and have some patches coming on that that

Re: Nutch-Selenium Plugin Truncates Binary Data

2015-02-23 Thread Mattmann, Chris A (3980)
Binary Data Sure, I've just uploaded the updated patch. On Sun, Feb 22, 2015 at 4:50 PM, Mattmann, Chris A (3980) chris.a.mattm...@jpl.nasa.gov wrote: I think this is fantastic Mohammad! Can you update the patch on NUTCH-1933 with this improvement, so we can get it into the sources? Cheers

Re: How to read metadata/content of an URL in URLFilter?

2015-02-22 Thread Mattmann, Chris A (3980)
INFO exactdup.ExactDupURLFilter - Processed 5 links 2015-02-22 21:07:13,899 INFO exactdup.ExactDupURLFilter - Processed 6 links Not sure if it is configurable? On Sun, Feb 22, 2015 at 8:56 PM, Mattmann, Chris A (3980) chris.a.mattm...@jpl.nasa.gov wrote: That’s one way - for sure - but what

Re: Tesseract OCR and GDAL in Tika plugin for Nutch?

2015-02-22 Thread Mattmann, Chris A (3980)
You need to install 1.8-SNAPSHOT version of Tika in your assignment. Please read the assignment instructions again. http://sunset.usc.edu/classes/cs572_2015/ Cheers, Chris ++ Chris Mattmann, Ph.D. Chief Architect Instrument

Re: How to read metadata/content of an URL in URLFilter?

2015-02-22 Thread Mattmann, Chris A (3980)
In the constructor of your URLFilter, why not consider passing in a NutchConfiguration object, and then reading the path to e.g, the LinkDb from the config. Then have a private member variable for the LinkDbReader (maybe static initialized for efficiency) and use that in your interface method.

Re: Vagrant Crushed When using Nutch-Selenium

2015-02-22 Thread Mattmann, Chris A (3980)
@nutch.apache.org Subject: Re: Vagrant Crushed When using Nutch-Selenium No problem! How'd it work out? Mo This message was drafted on a tiny touch screen; please forgive brevity tpyos On Feb 22, 2015, at 6:19 PM, Mattmann, Chris A (3980) chris.a.mattm...@jpl.nasa.gov wrote: Thanks Mo, great advice

Re:

2015-02-22 Thread Mattmann, Chris A (3980)
Exactly, Jiaxin, great answer. Cheers, Chris ++ Chris Mattmann, Ph.D. Chief Architect Instrument Software and Science Data Systems Section (398) NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA Office: 168-519, Mailstop:

Re: Nutch-Selenium Error

2015-02-22 Thread Mattmann, Chris A (3980)
Hi Mohammad, did you get this fixed? Cheers, Chris ++ Chris Mattmann, Ph.D. Chief Architect Instrument Software and Science Data Systems Section (398) NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA Office: 168-519, Mailstop:

Re: Nutchpy crawled statistics

2015-02-22 Thread Mattmann, Chris A (3980)
Exactly, Mohammad, thank you. Cheers, Chris ++ Chris Mattmann, Ph.D. Chief Architect Instrument Software and Science Data Systems Section (398) NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA Office: 168-519, Mailstop:

Re: How to read metadata/content of an URL in URLFilter?

2015-02-22 Thread Mattmann, Chris A (3980)
, Chris A (3980) chris.a.mattm...@jpl.nasa.gov wrote: In the constructor of your URLFilter, why not consider passing in a NutchConfiguration object, and then reading the path to e.g, the LinkDb from the config. Then have a private member variable for the LinkDbReader (maybe static initialized

Re: Nutch-Selenium Error

2015-02-22 Thread Mattmann, Chris A (3980)
Good to hear! ++ Chris Mattmann, Ph.D. Chief Architect Instrument Software and Science Data Systems Section (398) NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA Office: 168-519, Mailstop: 168-527 Email:

Re: Vagrant Crushed When using Nutch-Selenium

2015-02-22 Thread Mattmann, Chris A (3980)
at 11:34 AM, Mattmann, Chris A (3980) chris.a.mattm...@jpl.nasa.gov javascript:_e(%7B%7D,'cvml','chris.a.mattm...@jpl.nasa.gov'); wrote: Thank you Mo. I sincerely appreciate your guidance and contribution. I will work to get your nutch selenium grid plugin contributed to work with Nutch 1.x. Cheers

Re: Problem installing Selenium on Ubuntu with Nutch trunk 1.10

2015-02-22 Thread Mattmann, Chris A (3980)
You are using the Github version of the patch which only works with Nutch2 - you need to use NUTCH-1933. Cheers, Chris ++ Chris Mattmann, Ph.D. Chief Architect Instrument Software and Science Data Systems Section (398) NASA Jet

Re: Problem installing Selenium on Ubuntu with Nutch trunk 1.10

2015-02-22 Thread Mattmann, Chris A (3980)
Hi Nikunj, Please see this: https://en.wikipedia.org/wiki/Patch_(Unix) Cheers, Chris ++ Chris Mattmann, Ph.D. Chief Architect Instrument Software and Science Data Systems Section (398) NASA Jet Propulsion Laboratory Pasadena, CA

Re: Nutch-Selenium Plugin Truncates Binary Data

2015-02-22 Thread Mattmann, Chris A (3980)
I think this is fantastic Mohammad! Can you update the patch on NUTCH-1933 with this improvement, so we can get it into the sources? Cheers, Chris ++ Chris Mattmann, Ph.D. Chief Architect Instrument Software and Science Data

Re: linkdb/current/part-00000/data does not exist

2015-02-22 Thread Mattmann, Chris A (3980)
What command are you using to crawl? Are you using bin/crawl, and/or doing incremental crawling? Cheers, Chris ++ Chris Mattmann, Ph.D. Chief Architect Instrument Software and Science Data Systems Section (398) NASA Jet Propulsion

Re: How to read metadata/content of an URL in URLFilter?

2015-02-22 Thread Mattmann, Chris A (3980)
On Sun, Feb 22, 2015 at 4:53 PM, Mattmann, Chris A (3980) chris.a.mattm...@jpl.nasa.gov wrote: In the constructor of your URLFilter, why not consider passing in a NutchConfiguration object, and then reading the path to e.g, the LinkDb from the config. Then have a private member variable

Re: How to read metadata/content of an URL in URLFilter?

2015-02-22 Thread Mattmann, Chris A (3980)
detection? Thanks, Renxia On Sun, Feb 22, 2015 at 8:39 PM, Mattmann, Chris A (3980) chris.a.mattm...@jpl.nasa.gov wrote: There is nothing stating in your assignment that you can’t use *previously* crawled data to train your model - you should have at least 2 full sets of this. Cheers, Chris

Re: [ANNOUNCE] New Nutch committer and PMC - Jorge Luis Betancourt Gonzalez

2015-02-19 Thread Mattmann, Chris A (3980)
Welcome to the party, Jorge! Cheers, Chris ++ Chris Mattmann, Ph.D. Chief Architect Instrument Software and Science Data Systems Section (398) NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA Office: 168-519, Mailstop: 168-527

Re: Tesseract OCR and GDAL in Tika plugin for Nutch?

2015-02-18 Thread Mattmann, Chris A (3980)
Parser checker Sent from my iPhone On Feb 18, 2015, at 3:03 PM, Jiaxin Ye jiaxi...@usc.edumailto:jiaxi...@usc.edu wrote: Hi Tyler, Is there anyway to test if newest version of tika is working on Nutch or not? On Wednesday, February 18, 2015, Tyler Palsulich

Re: Vagrant Crushed When using Nutch-Selenium

2015-02-13 Thread Mattmann, Chris A (3980)
Hi Shuo, Thanks for your email. I wonder if using selenium grid would help? Please see this plugin: https://github.com/momer/nutch-selenium-grid-plugin I’m CC’ing Mo the author of the plugin to see if he experienced this while running the original selenium plugin - Mo did using selenium grid

Re: Vagrant Crushed When using Nutch-Selenium

2015-02-13 Thread Mattmann, Chris A (3980)
, Feb 13, 2015 at 10:25 AM, Mattmann, Chris A (3980) chris.a.mattm...@jpl.nasa.gov wrote: Hi Shuo, Thanks for your email. I wonder if using selenium grid would help? Please see this plugin: https://github.com/momer/nutch-selenium-grid-plugin I’m CC’ing Mo the author of the plugin to see if he

Integrate Splash with Nutch akin to Selenium

2015-02-13 Thread Mattmann, Chris A (3980)
Hi Guys, As we bring Nutch into the realm of the dynamic deep web, I would like to be working on a plugin that has a similar idea to the Selenium stuff that Mo started and that Lewis and I are integrating - I would like to bring Splash as a component into Nutch too:

Re: Nutch-Selenium in Nutch 1.10

2015-02-12 Thread Mattmann, Chris A (3980)
have modified. 1. patch -p0 YOUR_PATCH_FILE 2. ant clean jar 3. ant runtime Will try crawling using selenium later on. Hope this helped. _ On Thu, Feb 12, 2015 at 9:20 AM, Mattmann, Chris A (3980) chris.a.mattm...@jpl.nasa.gov javascript:_e(%7B%7D,'cvml','chris.a.mattm...@jpl.nasa.gov'); wrote

Re: Nutch-Selenium in Nutch 1.10

2015-02-12 Thread Mattmann, Chris A (3980)
You need Selenium Jiaxin, in order to crawl dynamic pages in the polar dataset you have been assigned in my CSCI 572 search engines class. The instructions for integrating Selenium with Nutch 1.10-trunk are here: https://issues.apache.org/jira/browse/NUTCH-1933 Cheers, Chris

Re: Nutch-Selenium in Nutch 1.10

2015-02-12 Thread Mattmann, Chris A (3980)
, Chris A (3980) chris.a.mattm...@jpl.nasa.govmailto:chris.a.mattm...@jpl.nasa.gov wrote: You need Selenium Jiaxin, in order to crawl dynamic pages in the polar dataset you have been assigned in my CSCI 572 search engines class. The instructions for integrating Selenium with Nutch 1.10-trunk are here

  1   2   >