Re: Will Solr/Nutch crawl multi websites (aka a mini google with faceted search)?

2011-09-13 Thread Alexander Aristov
Hi I would start from another. Crawling such sites is not an easy task. Your parser shall be very smart. I would investigate if your web sites have public API which could be used to do searches and then agregating results into one set. Best Regards Alexander Aristov On 12 September 2011

Re: Modifying fetch order with ScoringFilter

2011-09-13 Thread lewis john mcgibbney
Yes this is true (if the assumption is that by 'compar' you are referring to 'compare'), but it sounds like you are after some kind of similarity anaysis. As Markus said, this is not thr fetchers job but another element of the Nutch architecture. In general... I would aks you to understand how

Re: Relative outlinks without base

2011-09-13 Thread Dinçer Kavraal
Hi Markus, Please correct me if I'm wrong, but isn't there a document signature check to detect if the page contains same content with some other already parsed and indexed. Dinçer 2011/9/12 Markus Jelsma markus.jel...@openindex.io Hi, Would it not be a good idea to patch DomContentUtils

Re: Not able to index url which is giving http 302

2011-09-13 Thread Anshuman Mor
Hi Lewis, My Fault, sorry for that..br/ I had enabled some of the logging for httpclient. Please find attached log file.br/ Please let me know if you need more information on this.br/ http://lucene.472066.n3.nabble.com/file/n3332184/hadoop.log hadoop.log -- View this message in context:

Re: Relative outlinks without base

2011-09-13 Thread Markus Jelsma
On Tuesday 13 September 2011 13:12:41 Alexander Aristov wrote: yes you can stop but how do you know if a URL is good or not? You can use URL filter to discard unwanted URLs. We see that many sites with relative URL's without base href produce erroneous links. As with the example there is a

Outlinks with embedded params

2011-09-13 Thread Markus Jelsma
Hi, Another complaint on Nutch' handling of outlinks. Since NUTCH-436 there is better support for embedded segment parameters. This exotic feature, however, causes a lot of invalid outlinks to be generated. For some reason (most likely bad webmasters like my other thread) i see a lot of

Re: Permission error trying to read map file.

2011-09-13 Thread Markus Jelsma
Committed in 1.4 per https://issues.apache.org/jira/browse/NUTCH-1110 Thanks for pointing me to Nutch-1029. This will fix it. It seems completely Hadoop related though, because Nutch tries to open map files using merely Hadoop code. CrawlDbReader delegates it to

Re: Crawl fails - Input path does not exist

2011-09-13 Thread alxsss
Comparing with nutch-1.2 I do not see any content folder under segments ones. Does this mean that we cannot put store.content to false in nutch1-3? Thanks. Alex. -- View this message in context: http://lucene.472066.n3.nabble.com/Crawl-fails-Input-path-does-not-exist-tp996823p3334709.html Sent

How to serach on specific file types ?

2011-09-13 Thread ahmad ajiloo
Hello I want to search on articles via Solr. So need to find only specific files like doc, docx, and pdf. I don't need any html pages. Thus the result of our search should only consists of doc, docx, and pdf files. I'm using Nutch to crawling web pages and sending Nutch's data to Solr for