I’m thinking about it :) Would be great to go.
++
Chris Mattmann, Ph.D.
Chief Architect
Instrument Software and Science Data Systems Section (398)
NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA
Office: 168-519, Mailstop:
of Southern California, Los Angeles, CA 90089 USA
WWW: http://irds.usc.edu/
++
On 6/24/16, 5:56 PM, "Mattmann, Chris A (3980)" <chris.a.mattm...@jpl.nasa.gov>
wrote:
>Fixed, sorry.
>
>
>
>
>
Fixed, sorry.
On 6/24/16, 5:53 PM, "Gav" wrote:
>Hi All,
>
>
>Obivous to you all that you use Git as your primary scm but for potential new
>contributors it may not be.
>
>
>This page :-
>
>http://nutch.apache.org/version_control.html
>
>
>says you use SVN as your
+1 from me, great job Lewis and team!
SIGS pass, CHECKSUMS pass:
LMC-053601:apache-nutch-1.12-rc1 mattmann$ $HOME/bin/stage_apache_rc
apache-nutch 1.12-bin https://dist.apache.org/repos/dist/dev/nutch/1.12/
% Total% Received % Xferd Average Speed TimeTime Time Current
Neat thanks for sending Lewis!
FYI I wrote some tools in Python to parse Tika (and Nutch)
style Changes.txt, and to generate an APT output template for
e.g., web page release notes. FYI here:
https://github.com/chrismattmann/apachestuff/blob/master/extract-tika-issues.py
Works with JIRA and
+1 from me!
++
Chris Mattmann, Ph.D.
Chief Architect
Instrument Software and Science Data Systems Section (398)
NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA
Office: 168-519, Mailstop: 168-527
Email: chris.a.mattm...@nasa.gov
Thanks to Gav for reminding me
Sent from my iPhone
o normal.
>
>Sebastian
>
>On 04/18/2016 05:56 PM, Mattmann, Chris A (3980) wrote:
>> Hey Seb, I’ll also take a look. @Lewis could potentially help here
>> too. Lewis any time to scope?
>>
>>
>>
Hey Seb, I’ll also take a look. @Lewis could potentially help here
too. Lewis any time to scope?
++
Chris Mattmann, Ph.D.
Chief Architect
Instrument Software and Science Data Systems Section (398)
NASA Jet Propulsion Laboratory
I am +1 and also +1 to branch and start to just build Maven3 support
full out in Nutch.
++
Chris Mattmann, Ph.D.
Chief Architect
Instrument Software and Science Data Systems Section (398)
NASA Jet Propulsion Laboratory Pasadena, CA
FYI:
On 4/5/16, 6:46 PM, "Mattmann, Chris A (3980)" <chris.a.mattm...@jpl.nasa.gov>
wrote:
>FYI:
>http://www.forbes.com/sites/thomasbrewster/2016/04/05/panama-papers-amazon-encryption-epic-leak/?utm_campaign=ForbesTech_source=TWITTER_medium=social_channel=Technology
yeah well we are going to have to accept that some of these will
appear on SO, but that we will try as hard as possible to suggest
they contact the dev list as you mentioned :)
Thanks for commenting Markus.
++
Chris Mattmann, Ph.D.
try: release-1.11-rc2 :)
++
Chris Mattmann, Ph.D.
Chief Architect
Instrument Software and Science Data Systems Section (398)
NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA
Office: 168-519, Mailstop: 168-527
Email:
Hi Team,
Nutch now officially uses Git to manage its source repos. You can
see the final elements to that here:
https://issues.apache.org/jira/browse/INFRA-11300
I’ve written a guide for the wiki describing how to migrate your
existing SVN checkout to Nutch if you are a user or a developer.
thern California, Los Angeles, CA 90089 USA
>> ++
>>
>>
>>
>>
>>
>> -Original Message-
>> From: Sebastian Nagel <wastl.na...@googlemail.com>
>> Reply-To: "dev@nutc
Git
>Thanks, Chris!
>
>On 02/20/2016 08:49 AM, Mattmann, Chris A (3980) wrote:
>> Team:
>>
>> https://issues.apache.org/jira/browse/INFRA-11300
>>
>>
>> to track the progress..
>>
>> Cheers,
>> Chris
>>
>> +++
Team:
https://issues.apache.org/jira/browse/INFRA-11300
to track the progress..
Cheers,
Chris
++
Chris Mattmann, Ph.D.
Chief Architect
Instrument Software and Science Data Systems Section (398)
NASA Jet Propulsion Laboratory
Team,
This VOTE has PASSED with the following tallies:
+1 PMC
Chris Mattmann*
Sebastien Nagel*
Michael Joyce*
Asitang Mishra*
Dennis Kubes*
BlackIce
Julien Nioche*
Sujen Shah*
Given that I’ll file a ticket with INFRA to move the repos over.
thanks!
Cheers,
Chris
My bad I said I would do this!
Here you go it’s done:
+1
SIGS, checksums check out:
[chipotle:~/tmp/apache-nutch-2.3.1-rc2] mattmann%
$HOME/bin/stage_apache_rc apache-nutch 2.3.1-src
https://dist.apache.org/repos/dist/dev/nutch/2.3.1rc2/
% Total% Received % Xferd Average Speed Time
I will review tonight.
++
Chris Mattmann, Ph.D.
Chief Architect
Instrument Software and Science Data Systems Section (398)
NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA
Office: 168-519, Mailstop: 168-527
Email:
RE: The note - all Dennis has to do is ask to be back on the PMC
and he would be welcomed back in a jiffy, as an Emeritus PMC member.
Cheers,
Chris
++
Chris Mattmann, Ph.D.
Chief Architect
Instrument Software and Science Data
Booom
Sent from my iPhone
On Dec 8, 2015, at 3:18 PM, Michael Joyce
> wrote:
Cheers for pushing this out Lewis. And great job everyone on the hard work!!!
-- Jimmy
On Tue, Dec 8, 2015 at 1:26 AM, Markus Jelsma
Hey Lewis,
1.11 rC #1 release artifacts dropped from Nexus.
You should have perms to remove the release artifacts from
dist.apache.org/repos/dist/. I can remove them though?
[mattmann-0420740:~/tmp/nutch1.11/release] mattmann% svn rm 1.11
D 1.11
D 1.11/CHANGES-1.11.txt
D
Hi Lewis,
+1 from me. SIGS and CHECKSUMS check out.
bash-3.2$ for atype in bin src; do /Users/mattmann/bin/stage_apache_rc
apache-nutch 1.11-$atype
https://dist.apache.org/repos/dist/dev/nutch/1.11rc2/; done
% Total% Received % Xferd Average Speed TimeTime Time
Current
Sounds great mate let's get the rc up there
Sent from my iPhone
On Nov 20, 2015, at 10:32 AM, Lewis John Mcgibbney
> wrote:
Hi Folks,
Title says it all.
There is only one pending issue for 1.11.
;dev@nutch.apache.org" <dev@nutch.apache.org>
Subject: Re: [DISCUSS] Moving to Git
>+1 from me
>
>But, please, after 1.11 and 2.3.1 have been finally released.
>There is few work to do, and we should keep the releases on focus first.
>
>Sebastian
>
>On 11/19/2015 04:39 A
We’ll run into file length issues - Giuseppe had the same problem,
and so did students who used it from USC hence the solution we have
now. I think having nested directory structures is probably the best
bet, and making it configurable.
Mike I honestly prefer just having it as a text file. If you search
way back in the logs Doug talked about this long ago, but I generally
agree. JIRA would be nice but I just like to keep it up to date in text
and in JIRA.
Sorry for the dupe work but it pays off.
Hey Everyone,
So I just tried the Nutch Webapp for 1.11. It’s brittle, but works.
I am REALLY happy with it. Great work Fjodor Vershinin and Lewis on
making the application!
Since it’s in Wicket and I know my way around Wicket I’m going to
work in 1.12 and beyond on really improving this and
Hey Aron, it isn’t yet - @MikeJ and @Sujen want to give it a whack?
Cheers,
Chris
++
Chris Mattmann, Ph.D.
Chief Architect
Instrument Software and Science Data Systems Section (398)
NASA Jet Propulsion Laboratory Pasadena, CA 91109
1 We usually release tar.gz as well as zip. More importantly we need
>to release the sources as well as the binary. We can't even test that it
>compiles OK
>
>
>Since you released Tika, why don't we include it before cutting 1.11?
>
>
>Thanks
>
>
>Julien
>
>
>
&
Hi Folks,
A first candidate for the Nutch 1.11 release is available at:
https://dist.apache.org/repos/dist/dev/nutch/1.11/
The release candidate is a zip archive of the sources in:
http://svn.apache.org/repos/asf/nutch/tags/release-1.11-rc1/
The SHA1 checksum of the archive is
Okey dok. I’m also trying to get 1.11 of Tika pushed too.
++
Chris Mattmann, Ph.D.
Chief Architect
Instrument Software and Science Data Systems Section (398)
NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA
Office: 168-519,
Hey Folks,
I’ll cut a 1.11 RC #1 today. We have 70 issues fixed, and I think
it would be a great time to release.
Going to try for a Tika 1.11 release candidate 1 today too.
Cheers,
Chris
++
Chris Mattmann, Ph.D.
Chief Architect
Hey Folks,
My team at JPL and Continuum Analytics have been building a
Python-based interface to Nutch that uses the REST API.
It’s pretty much done in its initial version:
http://github.com/chrismattmann/nutch-python/
We even have a bin/crawl like functionality, crawl.py, here:
You should be using nutch 1.11-trunk for your assignment
Sent from my iPhone
On Oct 8, 2015, at 1:55 PM, Junpeng Luo
> wrote:
Hi everyone,
I am using nutch 1.10 and try to use the interactive selenium plugin of the
following link:
I’ll download and VOTE on the release right now Lewis.
++
Chris Mattmann, Ph.D.
Chief Architect
Instrument Software and Science Data Systems Section (398)
NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA
Office: 168-519,
+1 from me:
[chipotle:~/tmp/nutch-2.3.1-rc1] mattmann% $HOME/bin/stage_apache_rc
apache-nutch 2.3.1 https://dist.apache.org/repos/dist/dev/nutch/2.3.1
[chipotle:~/tmp/nutch-2.3.1-rc1] mattmann% $HOME/bin/stage_apache_rc
apache-nutch 2.3.1-src https://dist.apache.org/repos/dist/dev/nutch/2.3.1
%
Hi Team 18,
This would be a good question and discussion to move
to the dev@nutch.apache.org list. So I’m moving it there.
Mike Joyce and Kim Whitehall who are working on Nutch and
Selenium can help there.
Cheers,
Chris
+
Chris Mattmann,
Thanks Julien, great work
++
Chris Mattmann, Ph.D.
Chief Architect
Instrument Software and Science Data Systems Section (398)
NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA
Office: 168-519, Mailstop: 168-527
Email:
Hi Charan,
Thanks for your questions. Please copy your emails to
dev@nutch.apache.org and subscribe there, as you will
find more help I believe.
Here are the answers:
-Original Message-
From: Charan Shampur
Date: Sunday, September 20, 2015 at 3:55 PM
To: jpluser
awesome thanks for sharing!
++
Chris Mattmann, Ph.D.
Chief Architect
Instrument Software and Science Data Systems Section (398)
NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA
Office: 168-519, Mailstop: 168-527
Email:
Woo hoo welcome Aron!!!
++
Chris Mattmann, Ph.D.
Chief Architect
Instrument Software and Science Data Systems Section (398)
NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA
Office: 168-519, Mailstop: 168-527
Email:
welcome Asitang!!
++
Chris Mattmann, Ph.D.
Chief Architect
Instrument Software and Science Data Systems Section (398)
NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA
Office: 168-519, Mailstop: 168-527
Email:
Great thanks. I would love to see the web GUI ported from 2.x:
NUTCH-2086
Sujen, do you think you can throw up a Pull Request by today?
Cheers,
Chris
++
Chris Mattmann, Ph.D.
Chief Architect
Instrument Software and Science Data
Hey Folks,
My team at JPL and I have an initial prototype Nutch-Python
Python library. We are going to integrate it into Memex Explorer,
our crawl UI/tool [1], and we have other plans for it too (building
D3 viz and charts, etc.)
Thanks to Brian Wilson/JPL, and to Sujen Shah/JPL+USC for their
+1
Sent from my iPhone
On Jul 23, 2015, at 1:48 PM, Sebastian Nagel (JIRA) j...@apache.org wrote:
[
https://issues.apache.org/jira/browse/NUTCH-2048?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14639407#comment-14639407
]
Sebastian Nagel
+1
Sent from my iPhone
On Jul 23, 2015, at 1:47 PM, Sebastian Nagel (JIRA) j...@apache.org wrote:
[
https://issues.apache.org/jira/browse/NUTCH-2042?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Sebastian Nagel updated NUTCH-2042:
Hey Everyone,
Per: https://issues.apache.org/jira/browse/NUTCH-2059
I’ve updated the Jenkins job to correctly record and publish
the test results - just added test-plugins as a target as well.
See:
https://builds.apache.org/job/Nutch-trunk/
Cheers,
Chris
Agreed! I’ve had to do a lot of this work myself since Mike Joyce
challenged me to become a Git master ;) Challenge accepted.
But the more contributors can help to squash this stuff the better.
Otherwise, my advice is —include and —exclude are your friends :)
See #43 for how to use that.
Hey Lewis,
Yeah to be honest, this no different than ReviewBoard, JIRA, etc.
At least it's not as bad as Spark :/ I did a review of Asitang's patch
and it took each one of my comments and sent a mail. B/c of Apache's
requirement that things happen on the list, we have to have the mails
replicated
Sorry I wasn't clear. I'm *not* fine with getting rid of Github.
I was simply proposing for the mail spam to be moved to a different
list. But, to me JIRA/SVN, is no different than Github comments and
pull requests and so forth. To each their own :) The ASF full supports
Git and Github integration
Hey Guys,
I got sick of moderating Git messages so I used my apmail karma to add
g...@git.apache.org to the lists.
List moderation for Git should be going away! :)
(yay)
Cheers,
Chris
Please follow the instructions on the website to unsubscribe.
From: Sahil Shah [sahilshah2...@gmail.com]
Sent: Wednesday, June 17, 2015 6:01 PM
To: dev@nutch.apache.org
Subject: Re: [jira] [Created] (NUTCH-2000) Link inversion fails with .locked
already exists.
Hey Folks,
Just wanted to share publicly some articles recently on NASA and
JPL’s involvement in Memex. It’s basically focused around Tika,
Nutch and Solr, so keep up the great work on all projects. A sampling
of the recent press/articles:
E. Landau. Deep Web Search May Help Scientists. NASA Jet
,
Karl
On Wed, May 27, 2015 at 8:26 AM, Mattmann, Chris A (3980)
chris.a.mattm...@jpl.nasa.gov wrote:
Hey Folks,
Just wanted to share publicly some articles recently on NASA and
JPL’s involvement in Memex. It’s basically focused around Tika,
Nutch and Solr, so keep up the great work on all
So please send an email to dev-unsubscr...@nutch.apache.org as it
indicates on the website.
http://nutch.apache.org/mailing_lists.html
This goes for all the rest of the students recently sending the
same email - the instructions are above.
-Original Message-
From: Haishan Ye
awesome job Lewis
++
Chris Mattmann, Ph.D.
Chief Architect
Instrument Software and Science Data Systems Section (398)
NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA
Office: 168-519, Mailstop: 168-527
Email:
+1 from me! SIGS, CHECKSUMS check out, looks gr8.
[chipotle:~/tmp/apache-nutch-1.10-rc1] mattmann% $HOME/bin/stage_apache_rc
apache-nutch 1.10-src https://dist.apache.org/repos/dist/dev/nutch/1.10/
% Total% Received % Xferd Average Speed TimeTime Time
Current
Hey Folks,
All the 1.10 issues are resolved and fix. There is still the
issue that the upgrade to Tika 1.8 broke the build. I’m still
trying to figure it out.
Cheers,
Chris
++
Chris Mattmann, Ph.D.
Chief Architect
Instrument
s/1.8/1.10/ right?
If so +1!
++
Chris Mattmann, Ph.D.
Chief Architect
Instrument Software and Science Data Systems Section (398)
NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA
Office: 168-519, Mailstop: 168-527
Email:
+1 please commit! Thanks seb
Sent from my iPhone
On Apr 17, 2015, at 4:15 PM, Sebastian Nagel (JIRA) j...@apache.org wrote:
[
https://issues.apache.org/jira/browse/NUTCH-1927?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Sebastian Nagel updated NUTCH-1927:
Hey Everyone,
Here’s what we’ve been involved in:
http://www.forbes.com/sites/thomasbrewster/2015/04/17/darpa-nasa-and-partne
rs-show-off-memex/
:) Nutch, Tika, Solr FTW!
Cheers,
Chris
++
Chris Mattmann, Ph.D.
Chief Architect
Thanks Tizy - adding Tyler to this in case he didn’t see it.
Tyler is this what you were running into? Thoughts?
++
Chris Mattmann, Ph.D.
Chief Architect
Instrument Software and Science Data Systems Section (398)
NASA Jet Propulsion
Welcome Nipurn! Looking forward to your awesome contributions
this summer! :)
++
Chris Mattmann, Ph.D.
Chief Architect
Instrument Software and Science Data Systems Section (398)
NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA
Dear Shivika,
I am very excited and fully expect you to rock your contributions
to Nutch! You will be awesome thanks!
Cheers,
Chris
P.S. CC’ing Lauren Wong who I also expect will be doing awesome
++
Chris Mattmann, Ph.D.
Chief
I’m happy to roll the release. It’s been a while! :)
I’ll start right away.
Cheers,
Chris
++
Chris Mattmann, Ph.D.
Chief Architect
Instrument Software and Science Data Systems Section (398)
NASA Jet Propulsion Laboratory Pasadena,
Hi Remzi - thanks! You may want to consider this as a Tika or
Any23 project since Nutch delegates its parsing to Tika (and
Any23 uses Tika [and vice versa] to handle micro formats).
Cheers,
Chris
++
Chris Mattmann, Ph.D.
Chief
,
Thanks for your feedback.
I was planning to use any23 and tika but I dont have detailed grasp of
both projects. I guess Im gonna need to dive in both.
I would appreciate if you could guide me
thanks
On Fri, Mar 27, 2015 at 4:07 PM, Mattmann, Chris A (3980)
chris.a.mattm...@jpl.nasa.gov wrote:
Hi
Hi Tizy,
After you crawl the images, take a look at ./bin/nutch dump to
get the images out. ./bin/nutch commoncrawldumper also will
dump into the common crawl format.
Cheers,
Chris
++
Chris Mattmann, Ph.D.
Chief Architect
Agreed Seb, moving dev@nutch.a.o into BCC and moving this to
the Tika list.
++
Chris Mattmann, Ph.D.
Chief Architect
Instrument Software and Science Data Systems Section (398)
NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA
Welcome, Mo!
++
Chris Mattmann, Ph.D.
Chief Architect
Instrument Software and Science Data Systems Section (398)
NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA
Office: 168-519, Mailstop: 168-527
Email:
If anyone wants to take a crack at closing issues based on the
following criteria, good thread from the dev@tika list.
++
Chris Mattmann, Ph.D.
Chief Architect
Instrument Software and Science Data Systems Section (398)
NASA Jet
Hi Nancy,
Tika is what put the metadata into the parsed content
in the file you are looking at. See the parse-tika
plugin. You don’t need to use Tika further that the
information that is in your crawled data.
Cheers,
Chris
++
Yep, Seb, that’s right.
I have a student (Sujeh Shah) at USC working on
Nutch REST 1.x API, with the goal of eventually
making D3 visualizations of crawl graphs and
seeing what’s going on in a crawl while it’s
happening! :)
We are working on Wiki pages and have some patches
coming on that that
Binary Data
Sure, I've just uploaded the updated patch.
On Sun, Feb 22, 2015 at 4:50 PM, Mattmann, Chris A (3980)
chris.a.mattm...@jpl.nasa.gov wrote:
I think this is fantastic Mohammad!
Can you update the patch on NUTCH-1933 with this improvement,
so we can get it into the sources?
Cheers
INFO exactdup.ExactDupURLFilter - Processed 5
links
2015-02-22 21:07:13,899 INFO exactdup.ExactDupURLFilter - Processed 6
links
Not sure if it is configurable?
On Sun, Feb 22, 2015 at 8:56 PM, Mattmann, Chris A (3980)
chris.a.mattm...@jpl.nasa.gov wrote:
That’s one way - for sure - but what
You need to install 1.8-SNAPSHOT version of Tika in your assignment.
Please read the assignment instructions again.
http://sunset.usc.edu/classes/cs572_2015/
Cheers,
Chris
++
Chris Mattmann, Ph.D.
Chief Architect
Instrument
In the constructor of your URLFilter, why not consider passing
in a NutchConfiguration object, and then reading the path to e.g,
the LinkDb from the config. Then have a private member variable
for the LinkDbReader (maybe static initialized for efficiency)
and use that in your interface method.
@nutch.apache.org
Subject: Re: Vagrant Crushed When using Nutch-Selenium
No problem! How'd it work out?
Mo
This message was drafted on a tiny touch screen; please forgive brevity
tpyos
On Feb 22, 2015, at 6:19 PM, Mattmann, Chris A (3980)
chris.a.mattm...@jpl.nasa.gov wrote:
Thanks Mo, great advice
Exactly, Jiaxin, great answer.
Cheers,
Chris
++
Chris Mattmann, Ph.D.
Chief Architect
Instrument Software and Science Data Systems Section (398)
NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA
Office: 168-519, Mailstop:
Hi Mohammad, did you get this fixed?
Cheers,
Chris
++
Chris Mattmann, Ph.D.
Chief Architect
Instrument Software and Science Data Systems Section (398)
NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA
Office: 168-519, Mailstop:
Exactly, Mohammad, thank you.
Cheers,
Chris
++
Chris Mattmann, Ph.D.
Chief Architect
Instrument Software and Science Data Systems Section (398)
NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA
Office: 168-519, Mailstop:
, Chris A (3980)
chris.a.mattm...@jpl.nasa.gov wrote:
In the constructor of your URLFilter, why not consider passing
in a NutchConfiguration object, and then reading the path to e.g,
the LinkDb from the config. Then have a private member variable
for the LinkDbReader (maybe static initialized
Good to hear!
++
Chris Mattmann, Ph.D.
Chief Architect
Instrument Software and Science Data Systems Section (398)
NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA
Office: 168-519, Mailstop: 168-527
Email:
at 11:34 AM, Mattmann, Chris A (3980)
chris.a.mattm...@jpl.nasa.gov
javascript:_e(%7B%7D,'cvml','chris.a.mattm...@jpl.nasa.gov'); wrote:
Thank you Mo. I sincerely appreciate your guidance and contribution.
I will work to get your nutch selenium grid plugin contributed
to work with Nutch 1.x.
Cheers
You are using the Github version of the patch which only works
with Nutch2 - you need to use NUTCH-1933.
Cheers,
Chris
++
Chris Mattmann, Ph.D.
Chief Architect
Instrument Software and Science Data Systems Section (398)
NASA Jet
Hi Nikunj,
Please see this:
https://en.wikipedia.org/wiki/Patch_(Unix)
Cheers,
Chris
++
Chris Mattmann, Ph.D.
Chief Architect
Instrument Software and Science Data Systems Section (398)
NASA Jet Propulsion Laboratory Pasadena, CA
I think this is fantastic Mohammad!
Can you update the patch on NUTCH-1933 with this improvement,
so we can get it into the sources?
Cheers,
Chris
++
Chris Mattmann, Ph.D.
Chief Architect
Instrument Software and Science Data
What command are you using to crawl? Are you using bin/crawl, and/or
doing incremental crawling?
Cheers,
Chris
++
Chris Mattmann, Ph.D.
Chief Architect
Instrument Software and Science Data Systems Section (398)
NASA Jet Propulsion
On Sun, Feb 22, 2015 at 4:53 PM, Mattmann, Chris A (3980)
chris.a.mattm...@jpl.nasa.gov wrote:
In the constructor of your URLFilter, why not consider passing
in a NutchConfiguration object, and then reading the path to e.g,
the LinkDb from the config. Then have a private member variable
detection?
Thanks,
Renxia
On Sun, Feb 22, 2015 at 8:39 PM, Mattmann, Chris A (3980)
chris.a.mattm...@jpl.nasa.gov wrote:
There is nothing stating in your assignment that you can’t
use *previously* crawled data to train your model - you
should have at least 2 full sets of this.
Cheers,
Chris
Welcome to the party, Jorge!
Cheers,
Chris
++
Chris Mattmann, Ph.D.
Chief Architect
Instrument Software and Science Data Systems Section (398)
NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA
Office: 168-519, Mailstop: 168-527
Parser checker
Sent from my iPhone
On Feb 18, 2015, at 3:03 PM, Jiaxin Ye
jiaxi...@usc.edumailto:jiaxi...@usc.edu wrote:
Hi Tyler,
Is there anyway to test if newest version of tika is working on Nutch or not?
On Wednesday, February 18, 2015, Tyler Palsulich
Hi Shuo,
Thanks for your email. I wonder if using selenium grid would
help?
Please see this plugin:
https://github.com/momer/nutch-selenium-grid-plugin
I’m CC’ing Mo the author of the plugin to see if he experienced
this while running the original selenium plugin - Mo did using
selenium grid
, Feb 13, 2015 at 10:25 AM, Mattmann, Chris A (3980)
chris.a.mattm...@jpl.nasa.gov wrote:
Hi Shuo,
Thanks for your email. I wonder if using selenium grid would
help?
Please see this plugin:
https://github.com/momer/nutch-selenium-grid-plugin
I’m CC’ing Mo the author of the plugin to see if he
Hi Guys,
As we bring Nutch into the realm of the dynamic deep web,
I would like to be working on a plugin that has a similar
idea to the Selenium stuff that Mo started and that Lewis
and I are integrating - I would like to bring Splash as a
component into Nutch too:
have modified.
1. patch -p0 YOUR_PATCH_FILE
2. ant clean jar
3. ant runtime
Will try crawling using selenium later on. Hope this helped. _
On Thu, Feb 12, 2015 at 9:20 AM, Mattmann, Chris A (3980)
chris.a.mattm...@jpl.nasa.gov
javascript:_e(%7B%7D,'cvml','chris.a.mattm...@jpl.nasa.gov'); wrote
You need Selenium Jiaxin, in order to crawl dynamic pages in the
polar dataset you have been assigned in my CSCI 572 search engines class.
The instructions for integrating Selenium with Nutch 1.10-trunk
are here:
https://issues.apache.org/jira/browse/NUTCH-1933
Cheers,
Chris
, Chris A (3980)
chris.a.mattm...@jpl.nasa.govmailto:chris.a.mattm...@jpl.nasa.gov wrote:
You need Selenium Jiaxin, in order to crawl dynamic pages in the
polar dataset you have been assigned in my CSCI 572 search engines class.
The instructions for integrating Selenium with Nutch 1.10-trunk
are here
1 - 100 of 152 matches
Mail list logo