RE: Filtering pages during crawling

2012-07-03 Thread Markus Jelsma
You can try the fetch filter: https://issues.apache.org/jira/browse/NUTCH-828 -Original message- From:shekhar sharma shekhar2...@gmail.com Sent: Tue 03-Jul-2012 06:42 To: user@nutch.apache.org Subject: Filtering pages during crawling Hello, Is it possible to define a filtering

Re: Using multiple proxies with Nutch 1.5

2012-07-03 Thread Lewis John Mcgibbney
Hi, Do you have Nutch working with one proxy? Is NUTCH-208 [0] of any use to you as well? If so then please test the patch out. This particular issue has been dormant for an age. I assume that you've seen the wiki entry for using Nutch with lightweight tinyproxy? Lewis [0]

Suitable index-plugin to add ip_address

2012-07-03 Thread Lewis John Mcgibbney
Hi, In trunk and Nutchgora branch we committed storing of ip_address (NUTCH-1360) Would it be beneficial for this to be indexed? If so which existing plugin would be most suitable? Lewis -- Lewis

Re: Suitable index-plugin to add ip_address

2012-07-03 Thread Julien Nioche
can't this be done with index-metadata and configured accordingly if necessary? Where is the IP info stored? On 3 July 2012 13:52, Lewis John Mcgibbney lewis.mcgibb...@gmail.comwrote: Hi, In trunk and Nutchgora branch we committed storing of ip_address (NUTCH-1360) Would it be beneficial

Re: parsechecker fetches url but fetcher fails

2012-07-03 Thread arijit
Hi,    I did some more digging around - and noticed this in the output from readseg: Recno:: 0 URL:: http://en.wikipedia.org/wiki/Districts_of_India/ CrawlDatum:: Version: 7 Status: 1 (db_unfetched) Fetch time: Tue Jul 03 16:52:09 IST 2012 Modified time: Thu Jan 01 05:30:00 IST 1970 Retries

Re: [VOTE] Apache Nutch 2.0 Release Candidate #3

2012-07-03 Thread Mattmann, Chris A (388J)
Hi Guys, Unfortunately, -1 from me, please read on: Release SIGS check out: [chipotle:~/tmp/nutch2] mattmann% $HOME/bin/verify_gpg_sigs Verifying Signature for file apache-nutch-2.0-src.tar.gz.asc gpg: Signature made Mon Jun 25 09:28:36 2012 PDT using RSA key ID C601BCA7 gpg: Good signature

Re: [VOTE] Apache Nutch 2.0 Release Candidate #3

2012-07-03 Thread Julien Nioche
Hi Chris [chipotle:~/tmp/nutch2] mattmann% $HOME/bin/verify_gpg_sigs Verifying Signature for file apache-nutch-2.0-src.tar.gz.asc gpg: Signature made Mon Jun 25 09:28:36 2012 PDT using RSA key ID C601BCA7 gpg: Good signature from Lewis John McGibbney (CODE SIGNING KEY) lewi...@apache.org

Re: [VOTE] Apache Nutch 2.0 Release Candidate #3

2012-07-03 Thread Mattmann, Chris A (388J)
Hey Julien, I ran this command: rm -rf /Users/mattmann/.ivy2/ But it still failed with the below messages: [ivy:resolve] :: problems summary :: [ivy:resolve] WARNINGS [ivy:resolve] [FAILED ] org.apache.hadoop#hadoop-core;1.0.3!hadoop-core.jar: invalid sha1:

Re: javascript in href does not get into outlink

2012-07-03 Thread arijit
Thanks a lot. That will be of quite some help. -Arijit From: remi tassing tassingr...@gmail.com To: user@nutch.apache.org Cc: arijit pari...@yahoo.com Sent: Tuesday, July 3, 2012 1:56 PM Subject: Re: javascript in href does not get into outlink I have a

[VOTE] Apache Nutch 1.5.1 RC#3

2012-07-03 Thread Lewis John Mcgibbney
Hi Everyone, A candidate for the Apache Nutch 1.5.1 RC#3 is available at: http://people.apache.org/~lewismc/apache-nutch-1.5.1-rc3 The release candidate is a src.zip, src.tar.gz, bin-zip and bin-tar.gz archive of the sources in: http://svn.apache.org/repos/asf/nutch/tags/release-1.5.1-rc3/

Re: parse and solrindex in nutch-2.0

2012-07-03 Thread alxsss
Hi, I was planning to parse img tags from a url content and put it in metadata filed of Webpage storage class in nutch2.0 to retrieve them later in the indexing step. However, since there is no metadata data type variable in Parse class (compare with outlinks) this can not be done in nutch

Nutch Any23 plugin

2012-07-03 Thread Prasanna. Suman
Is Any23 already integrated into Tika as planned? If not, is it on the way? -- -- -- Prasanna Suman #Any program is only as good as it is useful. - Linus Torvalds

problem with connecting to zookeeper (2.0 rc3)

2012-07-03 Thread Tianwei
Hi, all, I am trying to build the 2.0 rc3, but can't make it work. I strictly follow the wiki page(http://wiki.apache.org/nutch/Nutch2Tutorial). Before that, I also ensure that the hbase works well, as: hbase(main):004:0 create 'test1', 'cf' 0 row(s) in 1.3080 seconds The following is what I